Skip to main content

Showing 1–50 of 134 results for author: Baral, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17169  [pdf, other

    cs.CL cs.AI

    Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

    Authors: Nisarg Patel, Mohith Kulkarni, Mihir Parmar, Aashna Budhiraja, Mutsumi Nakamura, Neeraj Varshney, Chitta Baral

    Abstract: As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datas… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: 23 Pages

  2. arXiv:2406.15444  [pdf, other

    cs.CL

    Investigating the Robustness of LLMs on Math Word Problems

    Authors: Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, Swaroop Mishra

    Abstract: Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, ProbleMATHIC, containing both adversarial and non-adversarial MWPs. Our experim… ▽ More

    Submitted 30 May, 2024; originally announced June 2024.

  3. arXiv:2406.05494  [pdf, other

    cs.CL

    Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation

    Authors: Neeraj Varshney, Satyam Raj, Venkatesh Mishra, Agneet Chatterjee, Ritika Sarkar, Amir Saeidi, Chitta Baral

    Abstract: Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, they have been shown to suffer from a critical limitation pertinent to 'hallucination' in their output. Recent research has focused on investigating and addressing this problem for a variety of tasks such as biography generation, question answering, abstractive summarization,… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  4. arXiv:2406.04046  [pdf, other

    cs.CC cs.AI

    ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints

    Authors: Divij Handa, Pavel Dolin, Shrinidhi Kumbhar, Chitta Baral, Tran Cao Son

    Abstract: Reasoning about actions and change (RAC) has historically driven the development of many early AI challenges, such as the frame problem, and many AI disciplines, including non-monotonic and commonsense reasoning. The role of RAC remains important even now, particularly for tasks involving dynamic environments, interactive scenarios, and commonsense reasoning. Despite the progress of Large Language… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: 54 pages, 11 figures

  5. arXiv:2406.03827  [pdf, other

    cs.CL

    Chaos with Keywords: Exposing Large Language Models Sycophancy to Misleading Keywords and Evaluating Defense Strategies

    Authors: Aswin RRV, Nemika Tyagi, Md Nayem Uddin, Neeraj Varshney, Chitta Baral

    Abstract: This study explores the sycophantic tendencies of Large Language Models (LLMs), where these models tend to provide answers that match what users want to hear, even if they are not entirely correct. The motivation behind this exploration stems from the common behavior observed in individuals searching the internet for facts with partial or misleading knowledge. Similar to using web search engines,… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: To be published in Findings of ACL 2024

  6. arXiv:2405.16681  [pdf, other

    cs.CL

    Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization

    Authors: Amir Saeidi, Shivanshu Verma, Aswin RRV, Chitta Baral

    Abstract: Large Language Models (LLMs) perform well across diverse tasks, but aligning them with human demonstrations is challenging. Recently, Reinforcement Learning (RL)-free methods like Direct Preference Optimization (DPO) have emerged, offering improved stability and scalability while retaining competitive performance relative to RL-based methods. However, while RL-free methods deliver satisfactory per… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  7. arXiv:2405.15961  [pdf, other

    cs.CV

    Grounding Stylistic Domain Generalization with Quantitative Domain Shift Measures and Synthetic Scene Images

    Authors: Yiran Luo, Joshua Feinglass, Tejas Gokhale, Kuan-Cheng Lee, Chitta Baral, Yezhou Yang

    Abstract: Domain Generalization (DG) is a challenging task in machine learning that requires a coherent ability to comprehend shifts across various domains through extraction of domain-invariant features. DG performance is typically evaluated by performing image classification in domains of various image styles. However, current methodology lacks quantitative understanding about shifts in stylistic domain,… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: Accepted at the 3rd CVPR Workshop on Vision Datasets Understanding

  8. arXiv:2404.15522  [pdf, other

    cs.CL cs.AI

    LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

    Authors: Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral

    Abstract: Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logi… ▽ More

    Submitted 6 June, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: Accepted at ACL(Main) 2024 | First version available @ https://openreview.net/forum?id=7NR2ZVzZxx

  9. arXiv:2404.14723  [pdf, other

    cs.CL

    Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

    Authors: Amir Saeidi, Shivanshu Verma, Chitta Baral

    Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehen… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  10. arXiv:2404.08540  [pdf, other

    cs.CV

    On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation

    Authors: Agneet Chatterjee, Tejas Gokhale, Chitta Baral, Yezhou Yang

    Abstract: Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results, the impact of the language prior, particularly in terms of generalization and robustness, remains unexplored. In this paper, we address this gap by quantifying the impact of this prior and introduce methods to benchmark its effectiveness acros… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024. Project webpage: https://agneetchatterjee.com/robustness_depth_lang/

  11. arXiv:2404.01197  [pdf, other

    cs.CV

    Getting it Right: Improving Spatial Consistency in Text-to-Image Models

    Authors: Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang

    Abstract: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also develo** datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: project webpage : https://spright-t2i.github.io/

  12. arXiv:2403.11092  [pdf, other

    cs.CL cs.AI cs.CV cs.CY eess.IV

    Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts

    Authors: Michael Saxon, Yiran Luo, Sharon Levy, Chitta Baral, Yezhou Yang, William Yang Wang

    Abstract: Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and co… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: NAACL 2024 Main Conference

  13. arXiv:2402.10601  [pdf, other

    cs.CL cs.AI

    Jailbreaking Proprietary Large Language Models using Word Substitution Cipher

    Authors: Divij Handa, Advait Chirmule, Bimal Gajera, Chitta Baral

    Abstract: Large Language Models (LLMs) are aligned to moral and ethical guidelines but remain susceptible to creative prompts called Jailbreak that can bypass the alignment process. However, most jailbreaking prompts contain harmful questions in the natural language (mainly English), which can be detected by the LLM themselves. In this paper, we present jailbreaking prompts encoded using cryptographic techn… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: 15 pages

  14. arXiv:2402.05195  [pdf, other

    cs.CV cs.CL

    $λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space

    Authors: Maitreya Patel, Sangmin Jung, Chitta Baral, Yezhou Yang

    Abstract: Despite the recent advances in personalized text-to-image (P-T2I) generative models, it remains challenging to perform finetuning-free multi-subject-driven T2I in a resource-efficient manner. Predominantly, contemporary approaches, involving the training of Hypernetworks and Multimodal Large Language Models (MLLMs), require heavy computing resources that range from 600 to 12300 GPU hours of traini… ▽ More

    Submitted 9 April, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: Project page: https://eclipse-t2i.github.io/Lambda-ECLIPSE/

  15. arXiv:2401.00287  [pdf, other

    cs.CL

    The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness

    Authors: Neeraj Varshney, Pavel Dolin, Agastya Seth, Chitta Baral

    Abstract: As Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of NLP research. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark: a collection of diverse safe and unsafe prompts with carefully designed evaluation methods that facilitate systematic evaluation, comparison, and ana… ▽ More

    Submitted 30 December, 2023; originally announced January 2024.

  16. arXiv:2312.04655  [pdf, other

    cs.CV

    ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

    Authors: Maitreya Patel, Changhoon Kim, Sheng Cheng, Chitta Baral, Yezhou Yang

    Abstract: Text-to-image (T2I) diffusion models, notably the unCLIP models (e.g., DALL-E-2), achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks, at the cost of significant computational resources. The unCLIP stack comprises T2I prior and diffusion image decoder. The T2I prior model alone adds a billion parameters compared to the Latent Diffusion Models, which increases the co… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: Project Page: https://eclipse-t2i.vercel.app/

  17. arXiv:2311.09564  [pdf, other

    cs.CL cs.AI

    LongBoX: Evaluating Transformers on Long-Sequence Clinical Tasks

    Authors: Mihir Parmar, Aakanksha Naik, Himanshu Gupta, Disha Agrawal, Chitta Baral

    Abstract: Many large language models (LLMs) for medicine have largely been evaluated on short texts, and their ability to handle longer sequences such as a complete electronic health record (EHR) has not been systematically explored. Assessing these models on long sequences is crucial since prior work in the general domain has demonstrated performance degradation of LLMs on longer texts. Motivated by this,… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

    Comments: 8 pages

  18. arXiv:2310.18581  [pdf, other

    cs.CL

    Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE

    Authors: Neeraj Varshney, Agneet Chatterjee, Mihir Parmar, Chitta Baral

    Abstract: Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks; however, their large size makes their inference slow and computationally expensive. Focusing on this problem, we propose to instruction tune LLMs with additional explicit losses from the intermediate layers (LITE) and show that it enables these layers to acquire 'good' generation abil… ▽ More

    Submitted 7 November, 2023; v1 submitted 28 October, 2023; originally announced October 2023.

  19. arXiv:2310.17876  [pdf, other

    cs.CL

    TarGEN: Targeted Data Generation with Large Language Models

    Authors: Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra

    Abstract: The rapid advancement of large language models (LLMs) has sparked interest in data synthesis techniques, aiming to generate diverse and high-quality synthetic datasets. However, these synthetic datasets often suffer from a lack of diversity and added noise. In this paper, we present TarGEN, a multi-step prompting strategy for generating high-quality synthetic datasets utilizing a LLM. An advantage… ▽ More

    Submitted 30 October, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: 10 pages, 6 tables, 5 figures, 5 pages references, 17 pages appendix

  20. arXiv:2310.14495  [pdf, other

    cs.CL cs.AI

    InstructExcel: A Benchmark for Natural Language Instruction in Excel

    Authors: Justin Payan, Swaroop Mishra, Mukul Singh, Carina Negreanu, Christian Poelitz, Chitta Baral, Subhro Roy, Rasika Chakravarthy, Benjamin Van Durme, Elnaz Nouri

    Abstract: With the evolution of Large Language Models (LLMs) we can solve increasingly more complex NLP tasks across various domains, including spreadsheets. This work investigates whether LLMs can generate code (Excel OfficeScripts, a TypeScript API for executing many tasks in Excel) that solves Excel specific tasks provided via natural language user instructions. To do so we introduce a new large-scale be… ▽ More

    Submitted 22 October, 2023; originally announced October 2023.

    Comments: Findings of EMNLP 2023, 18 pages

  21. arXiv:2310.00836  [pdf, other

    cs.CL cs.AI

    Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models

    Authors: Man Luo, Shrinidhi Kumbhar, Ming shen, Mihir Parmar, Neeraj Varshney, Pratyay Banerjee, Somak Aditya, Chitta Baral

    Abstract: Logical reasoning is fundamental for humans yet presents a substantial challenge in the domain of Artificial Intelligence. Initially, researchers used Knowledge Representation and Reasoning (KR) systems that did not scale and required non-trivial manual effort. Recently, the emergence of large language models (LLMs) has demonstrated the ability to overcome various limitations of formal Knowledge R… ▽ More

    Submitted 30 March, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: Work in progress

  22. arXiv:2309.04635  [pdf, other

    cs.CL

    Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that Don't have a Definitive Answer?

    Authors: Ayushi Agarwal, Nisarg Patel, Neeraj Varshney, Mihir Parmar, Pavan Mallina, Aryan Bhavin Shah, Srihari Raju Sangaraju, Tirth Patel, Nihar Thakkar, Chitta Baral

    Abstract: Though state-of-the-art (SOTA) NLP systems have achieved remarkable performance on a variety of language understanding tasks, they primarily focus on questions that have a correct and a definitive answer. However, in real-world applications, users often ask questions that don't have a definitive answer. Incorrectly answering such questions certainly hampers a system's reliability and trustworthine… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

    Comments: TrustNLP Workshop at ACL 2023

  23. arXiv:2309.00743  [pdf, other

    cs.RO cs.AI cs.CL

    Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains

    Authors: Divyanshu Raj, Chitta Baral, Nakul Gopalan

    Abstract: In this work, we present an approach to identify sub-tasks within a demonstrated robot trajectory using language instructions. We identify these sub-tasks using language provided during demonstrations as guidance to identify sub-segments of a longer robot trajectory. Given a sequence of natural language instructions and a long trajectory consisting of image frames and discrete actions, we want to… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

    Comments: 9 Pages, 13 figures, Accepted paper at the RSS 2023 Workshop on Articulate Robots: Utilizing Language for Robot Learning

  24. arXiv:2308.08147  [pdf, other

    cs.CL

    MDDial: A Multi-turn Differential Diagnosis Dialogue Dataset with Reliability Evaluation

    Authors: Srija Macherla, Man Luo, Mihir Parmar, Chitta Baral

    Abstract: Dialogue systems for Automatic Differential Diagnosis (ADD) have a wide range of real-life applications. These dialogue systems are promising for providing easy access and reducing medical costs. Building end-to-end ADD dialogue systems requires dialogue training datasets. However, to the best of our knowledge, there is no publicly available ADD dialogue dataset in English (although non-English da… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

  25. arXiv:2306.05539  [pdf, other

    cs.CL

    Instruction Tuned Models are Quick Learners

    Authors: Himanshu Gupta, Saurabh Arjun Sawant, Swaroop Mishra, Mutsumi Nakamura, Arindam Mitra, Santosh Mashetty, Chitta Baral

    Abstract: Instruction tuning of language models has demonstrated the ability to enhance model generalization to unseen tasks via in-context learning using a few examples. However, typical supervised learning still requires a plethora of downstream training data for finetuning. Often in real-world situations, there is a scarcity of data available for finetuning, falling somewhere between few shot inference a… ▽ More

    Submitted 17 May, 2023; originally announced June 2023.

    Comments: 9 pages, 5 figures, 19 Tables (inclusing appendix), 12 pages of Appendix

  26. arXiv:2306.04695  [pdf, other

    cs.CV cs.CL cs.LG

    ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models

    Authors: Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang

    Abstract: The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualita… ▽ More

    Submitted 22 February, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: Accepted at AAAI'24 | Project page: https://conceptbed.github.io

  27. arXiv:2306.00424  [pdf, other

    cs.CL cs.CV cs.IR

    End-to-end Knowledge Retrieval with Multi-modal Queries

    Authors: Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, Chitta Baral

    Abstract: We investigate knowledge retrieval with multi-modal queries, i.e. queries containing information split across image and text inputs, a challenging task that differs from previous work on cross-modal retrieval. We curate a new dataset called ReMuQ for benchmarking progress on this task. ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and imag… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: ACL 2023

  28. arXiv:2305.16357  [pdf, other

    cs.CL

    EDM3: Event Detection as Multi-task Text Generation

    Authors: Ujjwala Anantheswaran, Himanshu Gupta, Mihir Parmar, Kuntal Kumar Pal, Chitta Baral

    Abstract: Event detection refers to identifying event occurrences in a text and comprises of two subtasks; event identification and classification. We present EDM3, a novel approach for Event Detection that formulates three generative tasks: identification, classification, and combined detection. We show that EDM3 helps to learn transferable knowledge that can be leveraged to perform Event Detection and its… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: 9 pages, 4 figures, 10 tables, 5 Page appendix

  29. arXiv:2305.14128  [pdf, other

    cs.CL cs.AI

    Dr.ICL: Demonstration-Retrieved In-context Learning

    Authors: Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbrasaite, Vincent Y Zhao

    Abstract: In-context learning (ICL), teaching a large language model (LLM) to perform a task with few-shot demonstrations rather than adjusting the model parameters, has emerged as a strong paradigm for using LLMs. While early studies primarily used a fixed or random set of demonstrations for all test queries, recent research suggests that retrieving semantically similar demonstrations to the input from a p… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  30. arXiv:2305.12096  [pdf, other

    cs.CL

    Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions?

    Authors: Neeraj Varshney, Mihir Parmar, Nisarg Patel, Divij Handa, Sayantan Sarkar, Man Luo, Chitta Baral

    Abstract: Pre-training on large corpora of text enables the language models to acquire a vast amount of factual and commonsense knowledge which allows them to achieve remarkable performance on a variety of language understanding tasks. They typically acquire this knowledge by learning from the pre-training text and capturing certain patterns from it. However, real-world settings often present scenarios that… ▽ More

    Submitted 20 May, 2023; originally announced May 2023.

    Comments: 6 pages

  31. arXiv:2305.05079  [pdf, other

    cs.CL

    A Unified Evaluation Framework for Novelty Detection and Accommodation in NLP with an Instantiation in Authorship Attribution

    Authors: Neeraj Varshney, Himanshu Gupta, Eric Robertson, Bing Liu, Chitta Baral

    Abstract: State-of-the-art natural language processing models have been shown to achieve remarkable performance in 'closed-world' settings where all the labels in the evaluation set are known at training time. However, in real-world settings, 'novel' instances that do not belong to any known class are often observed. This renders the ability to deal with novelties crucial. To initiate a systematic research… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: Findings of ACL 2023

  32. arXiv:2305.01812  [pdf, other

    cs.CL

    Post-Abstention: Towards Reliably Re-Attempting the Abstained Instances in QA

    Authors: Neeraj Varshney, Chitta Baral

    Abstract: Despite remarkable progress made in natural language processing, even the state-of-the-art models often make incorrect predictions. Such predictions hamper the reliability of systems and limit their widespread adoption in real-world applications. 'Selective prediction' partly addresses the above concern by enabling models to abstain from answering when their predictions are likely to be incorrect.… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

    Comments: ACL 2023

  33. arXiv:2303.05400  [pdf, other

    cs.CL cs.AI cs.CR

    Prompt-Based Learning for Thread Structure Prediction in Cybersecurity Forums

    Authors: Kazuaki Kashihara, Kuntal Kumar Pal, Chitta Baral, Robert P Trevino

    Abstract: With recent trends indicating cyber crimes increasing in both frequency and cost, it is imperative to develop new methods that leverage data-rich hacker forums to assist in combating ever evolving cyber threats. Defining interactions within these forums is critical as it facilitates identifying highly skilled users, which can improve prediction of novel threats and future cyber attacks. We propose… ▽ More

    Submitted 4 March, 2023; originally announced March 2023.

    Comments: 16 pages, 7 figures, submitted to IntelliSys 2023

  34. arXiv:2302.14208  [pdf, other

    cs.AI

    Methods and Mechanisms for Interactive Novelty Handling in Adversarial Environments

    Authors: Tung Thai, Ming Shen, Mayank Garg, Ayush Kalani, Nakul Vaidya, Utkarsh Soni, Mudit Verma, Sriram Gopalakrishnan, Neeraj Varshney, Chitta Baral, Subbarao Kambhampati, Jivko Sinapov, Matthias Scheutz

    Abstract: Learning to detect, characterize and accommodate novelties is a challenge that agents operating in open-world domains need to address to be able to guarantee satisfactory task performance. Certain novelties (e.g., changes in environment dynamics) can interfere with the performance or prevent agents from accomplishing task goals altogether. In this paper, we introduce general methods and architectu… ▽ More

    Submitted 5 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

  35. arXiv:2302.10346  [pdf, other

    cs.CL cs.AI cs.CR

    Exploring the Limits of Transfer Learning with Unified Model in the Cybersecurity Domain

    Authors: Kuntal Kumar Pal, Kazuaki Kashihara, Ujjwala Anantheswaran, Kirby C. Kuznia, Siddhesh Jagtap, Chitta Baral

    Abstract: With the increase in cybersecurity vulnerabilities of software systems, the ways to exploit them are also increasing. Besides these, malware threats, irregular network interactions, and discussions about exploits in public forums are also on the rise. To identify these threats faster, to detect potentially relevant entities from any texts, and to be aware of software vulnerabilities, automated app… ▽ More

    Submitted 20 February, 2023; originally announced February 2023.

    Comments: 8 pages

  36. arXiv:2302.08624  [pdf, other

    cs.CL cs.LG

    InstructABSA: Instruction Learning for Aspect Based Sentiment Analysis

    Authors: Kevin Scaria, Himanshu Gupta, Siddharth Goyal, Saurabh Arjun Sawant, Swaroop Mishra, Chitta Baral

    Abstract: We introduce InstructABSA, an instruction learning paradigm for Aspect-Based Sentiment Analysis (ABSA) subtasks. Our method introduces positive, negative, and neutral examples to each training sample, and instruction tune the model (Tk-Instruct) for ABSA subtasks, yielding significant performance improvements. Experimental results on the Sem Eval 2014, 15, and 16 datasets demonstrate that Instruct… ▽ More

    Submitted 13 November, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: 4 pages, 3 figures, 9 tables, 9 appendix pages

  37. arXiv:2302.04434  [pdf, other

    cs.CL cs.AI cs.HC cs.LG

    Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow

    Authors: Anjana Arunkumar, Swaroop Mishra, Bhavdeep Sachdeva, Chitta Baral, Chris Bryan

    Abstract: Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample… ▽ More

    Submitted 8 February, 2023; originally announced February 2023.

    Comments: EACL 2023

  38. arXiv:2301.10165  [pdf, other

    cs.CL cs.AI

    Lexi: Self-Supervised Learning of the UI Language

    Authors: Pratyay Banerjee, Shweti Mahajan, Kushal Arora, Chitta Baral, Oriana Riva

    Abstract: Humans can learn to operate the user interface (UI) of an application by reading an instruction manual or how-to guide. Along with text, these resources include visual content such as UI screenshots and images of application icons referenced in the text. We explore how to leverage this data to learn generic visio-linguistic representations of UI screens and their components. These representations… ▽ More

    Submitted 23 January, 2023; originally announced January 2023.

    Comments: EMNLP (Findings) 2022

  39. arXiv:2212.10015  [pdf, other

    cs.CV cs.AI cs.CL

    Benchmarking Spatial Relationships in Text-to-Image Generation

    Authors: Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, Yezhou Yang

    Abstract: Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of… ▽ More

    Submitted 27 October, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: preprint; Code and Data at https://github.com/microsoft/VISOR and https://huggingface.co/datasets/tgokhale/sr2d_visor

  40. arXiv:2212.03866  [pdf, other

    cs.CV

    Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task

    Authors: Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang, Chitta Baral

    Abstract: 'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP (Sampat… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: 11 pages, 9 figures; Accepted at Findings of EMNLP 2022. arXiv admin note: substantial text overlap with arXiv:2212.03433

  41. arXiv:2212.03433  [pdf, other

    cs.CV

    Learning Action-Effect Dynamics from Pairs of Scene-graphs

    Authors: Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang, Chitta Baral

    Abstract: 'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). Recently, there has been growing interest in the study of RAC with visual and linguistic inputs. Graphs are often used to represent semantic structure of the visual content (i.e. objects, t… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

    Comments: 5 pages, 6 figures; Accepted at 3rd Workshop on Graphs and more Complex structures for Learning and Reasoning (GCLR) workshop, AAAI 2023

  42. arXiv:2211.12707  [pdf, other

    cs.CL cs.IR

    Can Open-Domain QA Reader Utilize External Knowledge Efficiently like Humans?

    Authors: Neeraj Varshney, Man Luo, Chitta Baral

    Abstract: Recent state-of-the-art open-domain QA models are typically based on a two stage retriever-reader approach in which the retriever first finds the relevant knowledge/passages and the reader then leverages that to predict the answer. Prior work has shown that the performance of the reader usually tends to improve with the increase in the number of these passages. Thus, state-of-the-art models use a… ▽ More

    Submitted 23 November, 2022; originally announced November 2022.

    Comments: AAAI-23 Workshop on Knowledge Augmented Methods for NLP

  43. arXiv:2211.03779  [pdf, other

    cs.CV cs.CL

    CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering

    Authors: Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang

    Abstract: Videos often capture objects, their visible properties, their motion, and the interactions between different objects. Objects also have physical properties such as mass, which the imaging pipeline is unable to directly capture. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. In this paper, we introduce CRIPP-VQA, a… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: Accepted to EMNLP 2022; https://maitreyapatel.com/CRIPP-VQA/

  44. arXiv:2210.17517  [pdf, other

    cs.CL cs.AI

    Lila: A Unified Benchmark for Mathematical Reasoning

    Authors: Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan

    Abstract: Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shop** to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., q… ▽ More

    Submitted 8 March, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

    MSC Class: 68T50 ACM Class: I.2.7

  45. arXiv:2210.07663  [pdf, other

    cs.CL cs.CV

    Pretrained Transformers Do not Always Improve Robustness

    Authors: Swaroop Mishra, Bhavdeep Singh Sachdeva, Chitta Baral

    Abstract: Pretrained Transformers (PT) have been shown to improve Out of Distribution (OOD) robustness than traditional models such as Bag of Words (BOW), LSTMs, Convolutional Neural Networks (CNN) powered by Word2Vec and Glove embeddings. How does the robustness comparison hold in a real world setting where some part of the dataset can be noisy? Do PT also provide more robust representation than traditiona… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

  46. arXiv:2210.07631  [pdf, other

    cs.CL cs.CV

    Hardness of Samples Need to be Quantified for a Reliable Evaluation System: Exploring Potential Opportunities with a New Task

    Authors: Swaroop Mishra, Anjana Arunkumar, Chris Bryan, Chitta Baral

    Abstract: Evaluation of models on benchmarks is unreliable without knowing the degree of sample hardness; this subsequently overestimates the capability of AI systems and limits their adoption in real world applications. We propose a Data Scoring task that requires assignment of each unannotated sample in a benchmark a score between 0 to 1, where 0 signifies easy and 1 signifies hard. Use of unannotated sam… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

    Comments: arXiv admin note: text overlap with arXiv:2007.06898

  47. arXiv:2210.07566  [pdf, other

    cs.CL cs.CV

    A Survey of Parameters Associated with the Quality of Benchmarks in NLP

    Authors: Swaroop Mishra, Anjana Arunkumar, Chris Bryan, Chitta Baral

    Abstract: Several benchmarks have been built with heavy investment in resources to track our progress in NLP. Thousands of papers published in response to those benchmarks have competed to top leaderboards, with models often surpassing human performance. However, recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the d… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

    Comments: arXiv admin note: text overlap with arXiv:2005.00816

  48. arXiv:2210.07471  [pdf, other

    cs.CL

    "John is 50 years old, can his son be 65?" Evaluating NLP Models' Understanding of Feasibility

    Authors: Himanshu Gupta, Neeraj Varshney, Swaroop Mishra, Kuntal Kumar Pal, Saurabh Arjun Sawant, Kevin Scaria, Siddharth Goyal, Chitta Baral

    Abstract: In current NLP research, large-scale language models and their abilities are widely being discussed. Some recent works have also found notable failures of these models. Often these failure examples involve complex reasoning abilities. This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible. To this end, we introduce FeasibilityQA, a question-an… ▽ More

    Submitted 2 February, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: EACL 2023

  49. arXiv:2210.05528  [pdf, other

    cs.CL cs.AI

    Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems

    Authors: Neeraj Varshney, Chitta Baral

    Abstract: Do all instances need inference through the big models for a correct prediction? Perhaps not; some instances are easy and can be answered correctly by even small capacity models. This provides opportunities for improving the computational efficiency of systems. In this work, we present an explorative study on 'model cascading', a simple technique that utilizes a collection of models of varying cap… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

  50. arXiv:2210.04466  [pdf, other

    cs.CL cs.CV

    Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications

    Authors: Swaroop Mishra, Anjana Arunkumar, Chitta Baral

    Abstract: With the increasing importance of safety requirements associated with the use of black box models, evaluation of selective answering capability of models has been critical. Area under the curve (AUC) is used as a metric for this purpose. We find limitations in AUC; e.g., a model having higher AUC is not always better in performing selective answering. We propose three alternate metrics that fix th… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.