Search | arXiv e-print repository

CodeNav: Beyond tool-use to using real-world codebases with LLM agents

Authors: Tanmay Gupta, Luca Weihs, Aniruddha Kembhavi

Abstract: We present CodeNav, an LLM agent that navigates and leverages previously unseen code repositories to solve user queries. In contrast to tool-use LLM agents that require ``registration'' of all relevant tools via manual descriptions within the LLM context, CodeNav automatically indexes and searches over code blocks in the target codebase, finds relevant code snippets, imports them, and uses them to… ▽ More We present CodeNav, an LLM agent that navigates and leverages previously unseen code repositories to solve user queries. In contrast to tool-use LLM agents that require ``registration'' of all relevant tools via manual descriptions within the LLM context, CodeNav automatically indexes and searches over code blocks in the target codebase, finds relevant code snippets, imports them, and uses them to iteratively generate a solution with execution feedback. To highlight the core-capabilities of CodeNav, we first showcase three case studies where we use CodeNav for solving complex user queries using three diverse codebases. Next, on three benchmarks, we quantitatively compare the effectiveness of code-use (which only has access to the target codebase) to tool-use (which has privileged access to all tool names and descriptions). Finally, we study the effect of varying kinds of tool and library descriptions on code-use performance, as well as investigate the advantage of the agent seeing source code as opposed to natural descriptions of code. All code will be made open source under a permissive license. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.11775 [pdf, other]

Task Me Anything

Authors: Jieyu Zhang, Weikai Huang, Zixian Ma, Oscar Michel, Dong He, Tanmay Gupta, Wei-Chiu Ma, Ali Farhadi, Aniruddha Kembhavi, Ranjay Krishna

Abstract: Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their spec… ▽ More Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their specific use case. This paper introduces Task-Me-Anything, a benchmark generation engine which produces a benchmark tailored to a user's needs. Task-Me-Anything maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. Additionally, it algorithmically addresses user queries regarding MLM performance efficiently within a computational budget. It contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships. It can generate 750M image/video question-answering pairs, which focus on evaluating MLM perceptual capabilities. Task-Me-Anything reveals critical insights: open-source MLMs excel in object and attribute recognition but lack spatial and temporal understanding; each model exhibits unique strengths and weaknesses; larger models generally perform better, though exceptions exist; and GPT4o demonstrates challenges in recognizing rotating/moving objects and distinguishing colors. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: website: https://www.task-me-anything.org

arXiv:2404.05366 [pdf, other]

CDAD-Net: Bridging Domain Gaps in Generalized Category Discovery

Authors: Sai Bhargav Rongali, Sarthak Mehrotra, Ankit Jha, Mohamad Hassan N C, Shirsha Bose, Tanisha Gupta, Mainak Singha, Biplab Banerjee

Abstract: In Generalized Category Discovery (GCD), we cluster unlabeled samples of known and novel classes, leveraging a training dataset of known classes. A salient challenge arises due to domain shifts between these datasets. To address this, we present a novel setting: Across Domain Generalized Category Discovery (AD-GCD) and bring forth CDAD-NET (Class Discoverer Across Domains) as a remedy. CDAD-NET is… ▽ More In Generalized Category Discovery (GCD), we cluster unlabeled samples of known and novel classes, leveraging a training dataset of known classes. A salient challenge arises due to domain shifts between these datasets. To address this, we present a novel setting: Across Domain Generalized Category Discovery (AD-GCD) and bring forth CDAD-NET (Class Discoverer Across Domains) as a remedy. CDAD-NET is architected to synchronize potential known class samples across both the labeled (source) and unlabeled (target) datasets, while emphasizing the distinct categorization of the target data. To facilitate this, we propose an entropy-driven adversarial learning strategy that accounts for the distance distributions of target samples relative to source-domain class prototypes. Parallelly, the discriminative nature of the shared space is upheld through a fusion of three metric learning objectives. In the source domain, our focus is on refining the proximity between samples and their affiliated class prototypes, while in the target domain, we integrate a neighborhood-centric contrastive learning mechanism, enriched with an adept neighborsmining approach. To further accentuate the nuanced feature interrelation among semantically aligned images, we champion the concept of conditional image inpainting, underscoring the premise that semantically analogous images prove more efficacious to the task than their disjointed counterparts. Experimentally, CDAD-NET eclipses existing literature with a performance increment of 8-15% on three AD-GCD benchmarks we present. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted in L3D-IVU, CVPR Workshop, 2024

arXiv:2404.01619 [pdf, other]

Making Privacy-preserving Federated Graph Analytics with Strong Guarantees Practical (for Certain Queries)

Authors: Kunlong Liu, Trinabh Gupta

Abstract: Privacy-preserving federated graph analytics is an emerging area of research. The goal is to run graph analytics queries over a set of devices that are organized as a graph while kee** the raw data on the devices rather than centralizing it. Further, no entity may learn any new information except for the final query result. For instance, a device may not learn a neighbor's data. The state-of-the… ▽ More Privacy-preserving federated graph analytics is an emerging area of research. The goal is to run graph analytics queries over a set of devices that are organized as a graph while kee** the raw data on the devices rather than centralizing it. Further, no entity may learn any new information except for the final query result. For instance, a device may not learn a neighbor's data. The state-of-the-art prior work for this problem provides privacy guarantees for a broad set of queries in a strong threat model where the devices can be malicious. However, it imposes an impractical overhead: each device locally requires over 8.79 hours of cpu time and 5.73 GiBs of network transfers per query. This paper presents Colo, a new, low-cost system for privacy-preserving federated graph analytics that requires minutes of cpu time and a few MiBs in network transfers, for a particular subset of queries. At the heart of Colo is a new secure computation protocol that enables a device to securely and efficiently evaluate a graph query in its local neighborhood while hiding device data, edge data, and topology data. An implementation and evaluation of Colo shows that for running a variety of COVID-19 queries over a population of 1M devices, it requires less than 8.4 minutes of a device's CPU time and 4.93 MiBs in network transfers - improvements of up to three orders of magnitude. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: to be published in SACMAT 2024

arXiv:2404.01475 [pdf, other]

Are large language models superhuman chemists?

Authors: Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Benedict Emoekabu, Aswanth Krishnan, Mara Wilhelmi, Macjonathan Okereke, Juliane Eberhardt, Amir Mohammad Elahi, Maximilian Greiner, Caroline T. Holick, Tanya Gupta, Mehrdad Asgari, Christina Glaubitz, Lea C. Klepsch, Yannik Köster, Jakob Meyer, Santiago Miret, Tim Hoffmann, Fabian Alexander Kreth, Michael Ringleb, Nicole Roesner, Ulrich S. Schubert, Leanne M. Stafast, Dinga Wonanke , et al. (3 additional authors not shown)

Abstract: Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. This is relevant for the chemical sciences, which face the problem of small and diverse datasets that are frequently in the form of text. LLMs have shown promise in addressing these issues and are increasingly being harnessed… ▽ More Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. This is relevant for the chemical sciences, which face the problem of small and diverse datasets that are frequently in the form of text. LLMs have shown promise in addressing these issues and are increasingly being harnessed to predict chemical properties, optimize reactions, and even design and conduct experiments autonomously. However, we still have only a very limited systematic understanding of the chemical reasoning capabilities of LLMs, which would be required to improve models and mitigate potential harms. Here, we introduce "ChemBench," an automated framework designed to rigorously evaluate the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of human chemists. We curated more than 7,000 question-answer pairs for a wide array of subfields of the chemical sciences, evaluated leading open and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. The models, however, struggle with some chemical reasoning tasks that are easy for human experts and provide overconfident, misleading predictions, such as about chemicals' safety profiles. These findings underscore the dual reality that, although LLMs demonstrate remarkable proficiency in chemical tasks, further research is critical to enhancing their safety and utility in chemical sciences. Our findings also indicate a need for adaptations to chemistry curricula and highlight the importance of continuing to develop evaluation frameworks to improve safe and useful LLMs. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.11085 [pdf, other]

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Authors: Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, Ranjay Krishna

Abstract: Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented… ▽ More Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms). △ Less

Submitted 21 March, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

arXiv:2402.15610 [pdf, other]

Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning

Authors: Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Ye** Choi, Jesse Thomason, Khyathi Raghavi Chandu

Abstract: Selective prediction minimizes incorrect predictions from vision-language models (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to… ▽ More Selective prediction minimizes incorrect predictions from vision-language models (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to reduce the over-abstention of a selective vision-language system without increasing the error rate of the system's predictions. When the VLM makes a low-confidence prediction, instead of abstaining ReCoVERR tries to find relevant clues in the image that provide additional evidence for the prediction. ReCoVERR uses an LLM to pose related questions to the VLM, collects high-confidence evidences, and if enough evidence confirms the prediction the system makes a prediction instead of abstaining. ReCoVERR enables three VLMs (BLIP2, InstructBLIP, and LLaVA-1.5) to answer up to 20% more questions on the VQAv2 and A-OKVQA tasks without decreasing system accuracy, thus improving overall system reliability. Our code is available at https://github.com/tejas1995/ReCoVERR. △ Less

Submitted 12 June, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: Accepted to ACL Findings 2024

arXiv:2402.06665 [pdf, other]

The Essential Role of Causality in Foundation World Models for Embodied AI

Authors: Tarun Gupta, Wenbo Gong, Chao Ma, Nick Pawlowski, Agrin Hilmkil, Meyer Scetbon, Marc Rigter, Ade Famoti, Ashley Juan Llorens, Jianfeng Gao, Stefan Bauer, Danica Kragic, Bernhard Schölkopf, Cheng Zhang

Abstract: Recent advances in foundation models, especially in large multi-modal models and conversational agents, have ignited interest in the potential of generally capable embodied agents. Such agents will require the ability to perform new tasks in many different real-world environments. However, current foundation models fail to accurately model physical interactions and are therefore insufficient for E… ▽ More Recent advances in foundation models, especially in large multi-modal models and conversational agents, have ignited interest in the potential of generally capable embodied agents. Such agents will require the ability to perform new tasks in many different real-world environments. However, current foundation models fail to accurately model physical interactions and are therefore insufficient for Embodied AI. The study of causality lends itself to the construction of veridical world models, which are crucial for accurately predicting the outcomes of possible interactions. This paper focuses on the prospects of building foundation world models for the upcoming generation of embodied agents and presents a novel viewpoint on the significance of causality within these. We posit that integrating causal considerations is vital to facilitating meaningful physical interactions with the world. Finally, we demystify misconceptions about causality in this context and present our outlook for future research. △ Less

Submitted 29 April, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.10601 [pdf, other]

Influential Slot and Tag Selection in Billboard Advertisement

Authors: Dildar Ali, Tejash Gupta, Suman Banerjee, Yamuna Prasad

Abstract: The selection of influential billboard slots remains an important problem in billboard advertisements. Existing studies on this problem have not considered the case of context-specific influence probability. To bridge this gap, in this paper, we introduce the Context Dependent Influential Billboard Slot Selection Problem. First, we show that the problem is NP-hard. We also show that the influence… ▽ More The selection of influential billboard slots remains an important problem in billboard advertisements. Existing studies on this problem have not considered the case of context-specific influence probability. To bridge this gap, in this paper, we introduce the Context Dependent Influential Billboard Slot Selection Problem. First, we show that the problem is NP-hard. We also show that the influence function holds the bi-monotonicity, bi-submodularity, and non-negativity properties. We propose an orthant-wise Stochastic Greedy approach to solve this problem. We show that this method leads to a constant factor approximation guarantee. Subsequently, we propose an orthant-wise Incremental and Lazy Greedy approach. In a generic sense, this is a method for maximizing a bi-submodular function under the cardinality constraint, which may also be of independent interest. We analyze the performance guarantee of this algorithm as well as time and space complexity. The proposed solution approaches have been implemented with real-world billboard and trajectory datasets. We compare the performance of our method with many baseline methods, and the results are reported. Our proposed orthant-wise stochastic greedy approach leads to significant results when the parameters are set properly with reasonable computational overhead. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: 15 pages

arXiv:2312.10789 [pdf, other]

doi 10.5220/0012322100003648

Federated learning with differential privacy and an untrusted aggregator

Authors: Kunlong Liu, Trinabh Gupta

Abstract: Federated learning for training models over mobile devices is gaining popularity. Current systems for this task exhibit significant trade-offs between model accuracy, privacy guarantee, and device efficiency. For instance, Oort (OSDI 2021) provides excellent accuracy and efficiency but requires a trusted central server. On the other hand, Orchard (OSDI 2020) provides good accuracy and the rigorous… ▽ More Federated learning for training models over mobile devices is gaining popularity. Current systems for this task exhibit significant trade-offs between model accuracy, privacy guarantee, and device efficiency. For instance, Oort (OSDI 2021) provides excellent accuracy and efficiency but requires a trusted central server. On the other hand, Orchard (OSDI 2020) provides good accuracy and the rigorous guarantee of differential privacy over an untrusted server, but creates huge overhead for the devices. This paper describes Aero, a new federated learning system that significantly improves this trade-off. Aero guarantees good accuracy, differential privacy over an untrusted server, and keeps the device overhead low. The key idea of Aero is to tune system architecture and design to a specific set of popular, federated learning algorithms. This tuning requires novel optimizations and techniques, e.g., a new protocol to securely aggregate updates from devices. An evaluation of Aero demonstrates that it provides comparable accuracy to plain federated learning (without differential privacy), and it improves efficiency (CPU and network) over Orchard by up to $10^5\times$. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: 22 pages, 10 figures, to be published in ICISSP 2024

Journal ref: Proceedings of the 10th International Conference on Information Systems Security and Privacy ICISSP - Volume 1, 379-389, 2024

arXiv:2312.07979 [pdf]

SLJP: Semantic Extraction based Legal Judgment Prediction

Authors: Prameela Madambakam, Shathanaa Rajmohan, Himangshu Sharma, Tummepalli Anka Chandrahas Purushotham Gupta

Abstract: Legal Judgment Prediction (LJP) is a judicial assistance system that recommends the legal components such as applicable statues, prison term and penalty term by analyzing the given input case document. Indian legal system is in the need of technical assistance such as artificial intelligence to solve the crores of pending cases in various courts for years and its being increased day to day. Most o… ▽ More Legal Judgment Prediction (LJP) is a judicial assistance system that recommends the legal components such as applicable statues, prison term and penalty term by analyzing the given input case document. Indian legal system is in the need of technical assistance such as artificial intelligence to solve the crores of pending cases in various courts for years and its being increased day to day. Most of the existing Indian models did not adequately concentrate on the semantics embedded in the fact description (FD) that impacts the decision. The proposed semantic extraction based LJP (SLJP) model provides the advantages of pretrained transformers for complex unstructured legal case document understanding and to generate embeddings. The model draws the in-depth semantics of the given FD at multiple levels i.e., chunk and case document level by following the divide and conquer approach. It creates the concise view of the given fact description using the extracted semantics as per the original court case document structure and predicts judgment using attention mechanism. We tested the model performance on two available Indian datasets Indian Legal Documents corpus (ILDC) and Indian Legal Statue Identification (ILSI) and got promising results. Also shown the highest performance and less performance degradation for increased epochs than base models on ILDC dataset. △ Less

Submitted 13 December, 2023; originally announced December 2023.

arXiv:2312.02976 [pdf, other]

Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World

Authors: Kiana Ehsani, Tanmay Gupta, Rose Hendrix, Jordi Salvador, Luca Weihs, Kuo-Hao Zeng, Kunal Pratap Singh, Ye** Kim, Winson Han, Alvaro Herrasti, Ranjay Krishna, Dustin Schwenk, Eli VanderBilt, Aniruddha Kembhavi

Abstract: Reinforcement learning (RL) with dense rewards and imitation learning (IL) with human-generated trajectories are the most widely used approaches for training modern embodied agents. RL requires extensive reward sha** and auxiliary losses and is often too slow and ineffective for long-horizon tasks. While IL with human supervision is effective, collecting human trajectories at scale is extremely… ▽ More Reinforcement learning (RL) with dense rewards and imitation learning (IL) with human-generated trajectories are the most widely used approaches for training modern embodied agents. RL requires extensive reward sha** and auxiliary losses and is often too slow and ineffective for long-horizon tasks. While IL with human supervision is effective, collecting human trajectories at scale is extremely expensive. In this work, we show that imitating shortest-path planners in simulation produces agents that, given a language instruction, can proficiently navigate, explore, and manipulate objects in both simulation and in the real world using only RGB sensors (no depth map or GPS coordinates). This surprising result is enabled by our end-to-end, transformer-based, SPOC architecture, powerful visual encoders paired with extensive image augmentation, and the dramatic scale and diversity of our training data: millions of frames of shortest-path-expert trajectories collected inside approximately 200,000 procedurally generated houses containing 40,000 unique 3D assets. Our models, data, training code, and newly proposed 10-task benchmarking suite CHORES will be open-sourced. △ Less

Submitted 5 December, 2023; originally announced December 2023.

Comments: First six authors contributed equally. Project page: https://spoc-robot.github.io/

arXiv:2311.09760 [pdf, other]

Eventually Lattice-Linear Algorithms

Authors: Arya Tanmay Gupta, Sandeep S Kulkarni

Abstract: Lattice-linear systems allow nodes to execute asynchronously. We introduce eventually lattice-linear algorithms, where lattices are induced only among the states in a subset of the state space. The algorithm guarantees that the system transitions to a state in one of the lattices. Then, the algorithm behaves lattice linearly while traversing to an optimal state through that lattice. We present a… ▽ More Lattice-linear systems allow nodes to execute asynchronously. We introduce eventually lattice-linear algorithms, where lattices are induced only among the states in a subset of the state space. The algorithm guarantees that the system transitions to a state in one of the lattices. Then, the algorithm behaves lattice linearly while traversing to an optimal state through that lattice. We present a lattice-linear self-stabilizing algorithm for service demand based minimal dominating set (SDMDS) problem. Using this as an example, we elaborate the working of, and define, eventually lattice-linear algorithms. Then, we present eventually lattice-linear self-stabilizing algorithms for minimal vertex cover (MVC), maximal independent set (MIS), graph colouring (GC) and 2-dominating set problems (2DS). Algorithms for SDMDS, MVCc and MIS converge in 1 round plus $n$ moves (within $2n$ moves), GC in $n+4m$ moves, and 2DS in 1 round plus $2n$ moves (within $3n$ moves). These results are an improvement over the existing literature. We also present experimental results to show performance gain demonstrating the benefit of lattice-linearity. △ Less

Submitted 13 January, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

Comments: arXiv admin note: text overlap with arXiv:2109.13216

arXiv:2310.08864 [pdf, other]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, A**kya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io. △ Less

Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Project website: https://robotics-transformer-x.github.io

arXiv:2309.14003 [pdf, other]

Hierarchical Imitation Learning for Stochastic Environments

Authors: Maximilian Igl, Punit Shah, Paul Mougin, Sirish Srinivasan, Tarun Gupta, Brandyn White, Kyriacos Shiarlis, Shimon Whiteson

Abstract: Many applications of imitation learning require the agent to generate the full distribution of behaviour observed in the training data. For example, to evaluate the safety of autonomous vehicles in simulation, accurate and diverse behaviour models of other road users are paramount. Existing methods that improve this distributional realism typically rely on hierarchical policies. These condition th… ▽ More Many applications of imitation learning require the agent to generate the full distribution of behaviour observed in the training data. For example, to evaluate the safety of autonomous vehicles in simulation, accurate and diverse behaviour models of other road users are paramount. Existing methods that improve this distributional realism typically rely on hierarchical policies. These condition the policy on types such as goals or personas that give rise to multi-modal behaviour. However, such methods are often inappropriate for stochastic environments where the agent must also react to external factors: because agent types are inferred from the observed future trajectory during training, these environments require that the contributions of internal and external factors to the agent behaviour are disentangled and only internal factors, i.e., those under the agent's control, are encoded in the type. Encoding future information about external factors leads to inappropriate agent reactions during testing, when the future is unknown and types must be drawn independently from the actual future. We formalize this challenge as distribution shift in the conditional distribution of agent types under environmental stochasticity. We propose Robust Type Conditioning (RTC), which eliminates this shift with adversarial training under randomly sampled types. Experiments on two domains, including the large-scale Waymo Open Motion Dataset, show improved distributional realism while maintaining or improving task performance compared to state-of-the-art baselines. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: Published at IROS'23

arXiv:2309.01618 [pdf, other]

Critical Behavioral Traits Foster Peer Engagement in Online Mental Health Communities

Authors: Aseem Srivastava, Tanya Gupta, Alison Cerezo, Sarah Peregrine, Lord, Md Shad Akhtar, Tanmoy Chakraborty

Abstract: Online Mental Health Communities (OMHCs), such as Reddit, have witnessed a surge in popularity as go-to platforms for seeking information and support in managing mental health needs. Platforms like Reddit offer immediate interactions with peers, granting users a vital space for seeking mental health assistance. However, the largely unregulated nature of these platforms introduces intricate challen… ▽ More Online Mental Health Communities (OMHCs), such as Reddit, have witnessed a surge in popularity as go-to platforms for seeking information and support in managing mental health needs. Platforms like Reddit offer immediate interactions with peers, granting users a vital space for seeking mental health assistance. However, the largely unregulated nature of these platforms introduces intricate challenges for both users and society at large. This study explores the factors that drive peer engagement within counseling threads, aiming to enhance our understanding of this critical phenomenon. We introduce BeCOPE, a novel behavior encoded Peer counseling dataset comprising over 10,118 posts and 58,279 comments sourced from 21 mental health-specific subreddits. The dataset is annotated using three major fine-grained behavior labels: (a) intent, (b) criticism, and (c) readability, along with the emotion labels. Our analysis indicates the prominence of ``self-criticism'' as the most prevalent form of criticism expressed by help-seekers, accounting for a significant 43% of interactions. Intriguingly, we observe that individuals who explicitly express their need for help are 18.01% more likely to receive assistance compared to those who present ``surveys'' or engage in ``rants.'' Furthermore, we highlight the pivotal role of well-articulated problem descriptions, showing that superior readability effectively doubles the likelihood of receiving the sought-after support. Our study emphasizes the essential role of OMHCs in offering personalized guidance and unveils behavior-driven engagement patterns. △ Less

Submitted 4 September, 2023; originally announced September 2023.

arXiv:2307.13080 [pdf, ps, other]

Tolerance to Asynchrony of an Algorithm for Gathering Myopic Robots on an Infinite Triangular Grid

Authors: Arya Tanmay Gupta, Sandeep S Kulkarni

Abstract: In this paper, we study the problem of gathering distance-1 myopic robots on an infinite triangular grid. We show that the algorithm developed by Goswami et al. (SSS, 2022) is lattice-linear (cf. Gupta and Kulkarni, SRDS 2023). This implies that a distributed scheduler, assumed therein, is not required for this algorithm: it runs correctly in asynchrony. It also implies that the algorithm works co… ▽ More In this paper, we study the problem of gathering distance-1 myopic robots on an infinite triangular grid. We show that the algorithm developed by Goswami et al. (SSS, 2022) is lattice-linear (cf. Gupta and Kulkarni, SRDS 2023). This implies that a distributed scheduler, assumed therein, is not required for this algorithm: it runs correctly in asynchrony. It also implies that the algorithm works correctly even if the robots are equipped with a unidirectional \textit{camera} to see the neighbouring robots (rather than an omnidirectional one, which would be required under a distributed scheduler). Due to lattice-linearity, we can predetermine the point of gathering. We also show that this algorithm converges in $2n$ rounds, which is lower than the complexity ($2.5(n+1)$ rounds) that was shown in Goswami et al. △ Less

Submitted 10 January, 2024; v1 submitted 24 July, 2023; originally announced July 2023.

arXiv:2307.11073 [pdf, other]

OBJECT 3DIT: Language-guided 3D-aware Image Editing

Authors: Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, Tanmay Gupta

Abstract: Existing image editing tools, while powerful, typically disregard the underlying 3D geometry from which the image is projected. As a result, edits made using these tools may become detached from the geometry and lighting conditions that are at the foundation of the image formation process. In this work, we formulate the newt ask of language-guided 3D-aware editing, where objects in an image should… ▽ More Existing image editing tools, while powerful, typically disregard the underlying 3D geometry from which the image is projected. As a result, edits made using these tools may become detached from the geometry and lighting conditions that are at the foundation of the image formation process. In this work, we formulate the newt ask of language-guided 3D-aware editing, where objects in an image should be edited according to a language instruction in context of the underlying 3D scene. To promote progress towards this goal, we release OBJECT: a dataset consisting of 400K editing examples created from procedurally generated 3D scenes. Each example consists of an input image, editing instruction in language, and the edited image. We also introduce 3DIT : single and multi-task models for four editing tasks. Our models show impressive abilities to understand the 3D composition of entire scenes, factoring in surrounding objects, surfaces, lighting conditions, shadows, and physically-plausible object configurations. Surprisingly, training on only synthetic scenes from OBJECT, editing capabilities of 3DIT generalize to real-world images. △ Less

Submitted 20 July, 2023; originally announced July 2023.

arXiv:2304.02075 [pdf, other]

GUTS: Generalized Uncertainty-Aware Thompson Sampling for Multi-Agent Active Search

Authors: Nikhil Angad Bakshi, Tejus Gupta, Ramina Ghods, Jeff Schneider

Abstract: Robotic solutions for quick disaster response are essential to ensure minimal loss of life, especially when the search area is too dangerous or too vast for human rescuers. We model this problem as an asynchronous multi-agent active-search task where each robot aims to efficiently seek objects of interest (OOIs) in an unknown environment. This formulation addresses the requirement that search miss… ▽ More Robotic solutions for quick disaster response are essential to ensure minimal loss of life, especially when the search area is too dangerous or too vast for human rescuers. We model this problem as an asynchronous multi-agent active-search task where each robot aims to efficiently seek objects of interest (OOIs) in an unknown environment. This formulation addresses the requirement that search missions should focus on quick recovery of OOIs rather than full coverage of the search region. Previous approaches fail to accurately model sensing uncertainty, account for occlusions due to foliage or terrain, or consider the requirement for heterogeneous search teams and robustness to hardware and communication failures. We present the Generalized Uncertainty-aware Thompson Sampling (GUTS) algorithm, which addresses these issues and is suitable for deployment on heterogeneous multi-robot systems for active search in large unstructured environments. We show through simulation experiments that GUTS consistently outperforms existing methods such as parallelized Thompson Sampling and exhaustive search, recovering all OOIs in 80% of all runs. In contrast, existing approaches recover all OOIs in less than 40% of all runs. We conduct field tests using our multi-robot system in an unstructured environment with a search area of approximately 75,000 sq. m. Our system demonstrates robustness to various failure modes, achieving full recovery of OOIs (where feasible) in every field run, and significantly outperforming our baseline. △ Less

Submitted 4 April, 2023; originally announced April 2023.

Comments: 7 pages, 5 figures, 1 table, for associated video see: https://youtu.be/K0jkzdQ_j2E , to appear in International Conference on Robotics and Automation (ICRA) 2023

arXiv:2302.14834 [pdf, ps, other]

DAG-Inducing Problems and Algorithms

Authors: Arya Tanmay Gupta, Sandeep S Kulkarni

Abstract: Consider the execution of a sequential algorithm that requires the program to converge to an optimal state, and then terminate/stutter. To design such an algorithm, we need to ensure that the state space that it traverses forms a directed acyclic graph (DAG) and its sink nodes are optimal states. However, if we run the same algorithm on multiple computing nodes running in parallel, and without syn… ▽ More Consider the execution of a sequential algorithm that requires the program to converge to an optimal state, and then terminate/stutter. To design such an algorithm, we need to ensure that the state space that it traverses forms a directed acyclic graph (DAG) and its sink nodes are optimal states. However, if we run the same algorithm on multiple computing nodes running in parallel, and without synchronization, it may not reach an optimal state. In most parallel processing algorithms designed in the literature, a synchronization primitive is assumed. Synchronization ensures that the nodes read fresh value, and the execution proceeds systematically, such that the subject algorithm traverses a DAG induced among the global states. With this observation, we investigate the conditions that guarantee that the execution of an algorithm is correct even if it is executed in parallel and without synchronization. To this end, we introduce DAG-inducing problems and DAG-inducing algorithms. We show that induction of a $\prec$-DAG (induced among the global states -- that forms as a result of a partial order induced among the local states visited by individual nodes) is a necessary and sufficient condition to allow an algorithm to run in asynchrony. In the paper, we first give a comprehensive description of DAG-inducing problems and DAG-inducing algorithms, along with some simple examples. Then we show some properties of an algorithm that is tolerant to asynchrony, which include the above-mentioned condition. △ Less

Submitted 10 April, 2024; v1 submitted 28 February, 2023; originally announced February 2023.

arXiv:2302.14139 [pdf, other]

Scalable End-to-End ML Platforms: from AutoML to Self-serve

Authors: Igor L. Markov, Pavlos A. Apostolopoulos, Mia R. Garrard, Tanya Qie, Yin Huang, Tanvi Gupta, Anika Li, Cesar Cardoso, George Han, Ryan Maghsoudian, Norm Zhou

Abstract: ML platforms help enable intelligent data-driven applications and maintain them with limited engineering effort. Upon sufficiently broad adoption, such platforms reach economies of scale that bring greater component reuse while improving efficiency of system development and maintenance. For an end-to-end ML platform with broad adoption, scaling relies on pervasive ML automation and system integrat… ▽ More ML platforms help enable intelligent data-driven applications and maintain them with limited engineering effort. Upon sufficiently broad adoption, such platforms reach economies of scale that bring greater component reuse while improving efficiency of system development and maintenance. For an end-to-end ML platform with broad adoption, scaling relies on pervasive ML automation and system integration to reach the quality we term self-serve that we define with ten requirements and six optional capabilities. With this in mind, we identify long-term goals for platform development, discuss related tradeoffs and future work. Our reasoning is illustrated on two commercially-deployed end-to-end ML platforms that host hundreds of real-time use cases -- one general-purpose and one specialized. △ Less

Submitted 3 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

Comments: 10 pages, 1 figure, 2 tables

arXiv:2302.08229 [pdf, other]

Improving Spoken Language Identification with Map-Mix

Authors: Shangeth Rajaa, Kriti Anandan, Swaraj Dalmia, Tarun Gupta, Eng Siong Chng

Abstract: The pre-trained multi-lingual XLSR model generalizes well for language identification after fine-tuning on unseen languages. However, the performance significantly degrades when the languages are not very distinct from each other, for example, in the case of dialects. Low resource dialect classification remains a challenging problem to solve. We present a new data augmentation method that leverage… ▽ More The pre-trained multi-lingual XLSR model generalizes well for language identification after fine-tuning on unseen languages. However, the performance significantly degrades when the languages are not very distinct from each other, for example, in the case of dialects. Low resource dialect classification remains a challenging problem to solve. We present a new data augmentation method that leverages model training dynamics of individual data points to improve sampling for latent mixup. The method works well in low-resource settings where generalization is paramount. Our datamaps-based mixup technique, which we call Map-Mix improves weighted F1 scores by 2% compared to the random mixup baseline and results in a significantly well-calibrated model. The code for our method is open sourced on https://github.com/skit-ai/Map-Mix. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: Accepted at ICASSP 2023

arXiv:2302.07207 [pdf, ps, other]

Lattice Linearity of Multiplication and Modulo

Authors: Arya Tanmay Gupta, Sandeep S Kulkarni

Abstract: In this paper, we study the lattice linearity of multiplication and modulo operations. We demonstrate that these operations are lattice linear and the parallel processing algorithms that we study for both these operations are able to exploit the lattice linearity of their respective problems. This implies that these algorithms can be implemented in asynchronous environments, where the nodes are al… ▽ More In this paper, we study the lattice linearity of multiplication and modulo operations. We demonstrate that these operations are lattice linear and the parallel processing algorithms that we study for both these operations are able to exploit the lattice linearity of their respective problems. This implies that these algorithms can be implemented in asynchronous environments, where the nodes are allowed to read old information from each other. These algorithms also exhibit snap-stabilizing properties, i.e., starting from an arbitrary state, the sequence of state transitions made by the system strictly follows its specification. △ Less

Submitted 24 July, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

arXiv:2211.13793 [pdf, other]

Tensor Decomposition of Large-scale Clinical EEGs Reveals Interpretable Patterns of Brain Physiology

Authors: Teja Gupta, Neeraj Wagh, Samarth Rawal, Brent Berry, Gregory Worrell, Yogatheesan Varatharajah

Abstract: Identifying abnormal patterns in electroencephalography (EEG) remains the cornerstone of diagnosing several neurological diseases. The current clinical EEG review process relies heavily on expert visual review, which is unscalable and error-prone. In an effort to augment the expert review process, there is a significant interest in mining population-level EEG patterns using unsupervised approaches… ▽ More Identifying abnormal patterns in electroencephalography (EEG) remains the cornerstone of diagnosing several neurological diseases. The current clinical EEG review process relies heavily on expert visual review, which is unscalable and error-prone. In an effort to augment the expert review process, there is a significant interest in mining population-level EEG patterns using unsupervised approaches. Current approaches rely either on two-dimensional decompositions (e.g., principal and independent component analyses) or deep representation learning (e.g., auto-encoders, self-supervision). However, most approaches do not leverage the natural multi-dimensional structure of EEGs and lack interpretability. In this study, we propose a tensor decomposition approach using the canonical polyadic decomposition to discover a parsimonious set of population-level EEG patterns, retaining the natural multi-dimensional structure of EEGs (time x space x frequency). We then validate their clinical value using a cohort of patients including varying stages of cognitive impairment. Our results show that the discovered patterns reflect physiologically meaningful features and accurately classify the stages of cognitive impairment (healthy vs mild cognitive impairment vs Alzheimer's dementia) with substantially fewer features compared to classical and deep learning-based baselines. We conclude that the decomposition of population-level EEG tensors recovers expert-interpretable EEG patterns that can aid in the study of smaller specialized clinical cohorts. △ Less

Submitted 4 February, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Comments: 4 pages, 3 Figures, 2 Tables; Accepted at IEEE NER 2023

arXiv:2211.13769 [pdf, other]

On Designing Light-Weight Object Trackers through Network Pruning: Use CNNs or Transformers?

Authors: Saksham Aggarwal, Taneesh Gupta, Pawan Kumar Sahu, Arnav Chavan, Rishabh Tiwari, Dilip K. Prasad, Deepak K. Gupta

Abstract: Object trackers deployed on low-power devices need to be light-weight, however, most of the current state-of-the-art (SOTA) methods rely on using compute-heavy backbones built using CNNs or transformers. Large sizes of such models do not allow their deployment in low-power conditions and designing compressed variants of large tracking models is of great importance. This paper demonstrates how high… ▽ More Object trackers deployed on low-power devices need to be light-weight, however, most of the current state-of-the-art (SOTA) methods rely on using compute-heavy backbones built using CNNs or transformers. Large sizes of such models do not allow their deployment in low-power conditions and designing compressed variants of large tracking models is of great importance. This paper demonstrates how highly compressed light-weight object trackers can be designed using neural architectural pruning of large CNN and transformer based trackers. Further, a comparative study on architectural choices best suited to design light-weight trackers is provided. A comparison between SOTA trackers using CNNs, transformers as well as the combination of the two is presented to study their stability at various compression ratios. Finally results for extreme pruning scenarios going as low as 1% in some cases are shown to study the limits of network pruning in object tracking. This work provides deeper insights into designing highly efficient trackers from existing SOTA methods. △ Less

Submitted 26 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Comments: Accepted at IEEE ICASSP 2023

arXiv:2211.11559 [pdf, other]

Visual Programming: Compositional visual reasoning without training

Authors: Tanmay Gupta, Aniruddha Kembhavi

Abstract: We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. VISPROG avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rational… ▽ More We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. VISPROG avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale. Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program. We demonstrate the flexibility of VISPROG on 4 diverse tasks - compositional visual question answering, zero-shot reasoning on image pairs, factual knowledge object tagging, and language-guided image editing. We believe neuro-symbolic approaches like VISPROG are an exciting avenue to easily and effectively expand the scope of AI systems to serve the long tail of complex tasks that people may wish to perform. △ Less

Submitted 18 November, 2022; originally announced November 2022.

arXiv:2211.04878 [pdf, other]

Foundation Models for Semantic Novelty in Reinforcement Learning

Authors: Tarun Gupta, Peter Karkus, Tong Che, Danfei Xu, Marco Pavone

Abstract: Effectively exploring the environment is a key challenge in reinforcement learning (RL). We address this challenge by defining a novel intrinsic reward based on a foundation model, such as contrastive language image pretraining (CLIP), which can encode a wealth of domain-independent semantic visual-language knowledge about the world. Specifically, our intrinsic reward is defined based on pre-train… ▽ More Effectively exploring the environment is a key challenge in reinforcement learning (RL). We address this challenge by defining a novel intrinsic reward based on a foundation model, such as contrastive language image pretraining (CLIP), which can encode a wealth of domain-independent semantic visual-language knowledge about the world. Specifically, our intrinsic reward is defined based on pre-trained CLIP embeddings without any fine-tuning or learning on the target RL task. We demonstrate that CLIP-based intrinsic rewards can drive exploration towards semantically meaningful states and outperform state-of-the-art methods in challenging sparse-reward procedurally-generated environments. △ Less

Submitted 9 November, 2022; originally announced November 2022.

Comments: Foundation Models for Decision Making Workshop at Neural Information Processing Systems, 2022

arXiv:2210.03055 [pdf, other]

Inducing Lattices in Non-Lattice-Linear Problems

Authors: Arya Tanmay Gupta, Sandeep S Kulkarni

Abstract: Lattice-linearity was introduced as modelling problems using predicates that induce a lattice among the global states (Garg, SPAA 2020). Such modelling enables permitting asynchronous execution in multiprocessor systems. A key property of \textit{the predicate} representing such problems is that it induces \textit{one} lattice in the state space. Such representation guarantees the execution to be… ▽ More Lattice-linearity was introduced as modelling problems using predicates that induce a lattice among the global states (Garg, SPAA 2020). Such modelling enables permitting asynchronous execution in multiprocessor systems. A key property of \textit{the predicate} representing such problems is that it induces \textit{one} lattice in the state space. Such representation guarantees the execution to be correct even if nodes execute asynchronously. However, many interesting problems do not exhibit lattice-linearity. This issue was alleviated with the introduction of eventually lattice-linear algorithms (Gupta and Kulkarni, SSS 2021). They induce \textit{single or multiple} lattices in a subset of the state space even when the problem cannot be defined by a predicate under which the states form a lattice. In this paper, we focus on analyzing and differentiating between lattice-linear problems and algorithms. We introduce a new class of algorithms called \textit{fully lattice-linear algorithms}. These algorithms partition the \textit{entire} reachable state space into \textit{one or more lattices}. For illustration, we present lattice-linear self-stabilizing algorithms for minimal dominating set (MDS) and graph colouring (GC) problems, and a parallel processing lattice-linear 2-approximation algorithm for vertex cover (VC). The algorithms for MDS and GC converge in {\boldmath $n$} moves and {\boldmath $n+2m$} moves respectively. These algorithms preserve this time complexity while allowing the nodes to execute asynchronously, where these nodes may execute based on old or inconsistent information about their neighbours. The algorithm for VC is the first lattice-linear approximation algorithm for an NP-Hard problem; it converges in {\boldmath $n$} moves. △ Less

Submitted 29 July, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

Comments: arXiv admin note: text overlap with arXiv:2209.14703

arXiv:2209.14703 [pdf, ps, other]

Fully Lattice Linear Algorithms

Authors: Arya Tanmay Gupta, Sandeep S Kulkarni

Abstract: This paper focuses on analyzing and differentiating between lattice linear problems and algorithms. It introduces a new class of algorithms called \textit{(fully) lattice linear algorithms}. A property of these algorithms is that they induce a partial order among all states and form \textit{multiple lattices}. An initial state locks in one of these lattices. We present a lattice linear self-stabil… ▽ More This paper focuses on analyzing and differentiating between lattice linear problems and algorithms. It introduces a new class of algorithms called \textit{(fully) lattice linear algorithms}. A property of these algorithms is that they induce a partial order among all states and form \textit{multiple lattices}. An initial state locks in one of these lattices. We present a lattice linear self-stabilizing algorithm for minimal dominating set. △ Less

Submitted 10 November, 2022; v1 submitted 29 September, 2022; originally announced September 2022.

arXiv:2207.13666 [pdf, other]

SAC-AP: Soft Actor Critic based Deep Reinforcement Learning for Alert Prioritization

Authors: Lalitha Chavali, Tanay Gupta, Paresh Saxena

Abstract: Intrusion detection systems (IDS) generate a large number of false alerts which makes it difficult to inspect true positives. Hence, alert prioritization plays a crucial role in deciding which alerts to investigate from an enormous number of alerts that are generated by IDS. Recently, deep reinforcement learning (DRL) based deep deterministic policy gradient (DDPG) off-policy method has shown to a… ▽ More Intrusion detection systems (IDS) generate a large number of false alerts which makes it difficult to inspect true positives. Hence, alert prioritization plays a crucial role in deciding which alerts to investigate from an enormous number of alerts that are generated by IDS. Recently, deep reinforcement learning (DRL) based deep deterministic policy gradient (DDPG) off-policy method has shown to achieve better results for alert prioritization as compared to other state-of-the-art methods. However, DDPG is prone to the problem of overfitting. Additionally, it also has a poor exploration capability and hence it is not suitable for problems with a stochastic environment. To address these limitations, we present a soft actor-critic based DRL algorithm for alert prioritization (SAC-AP), an off-policy method, based on the maximum entropy reinforcement learning framework that aims to maximize the expected reward while also maximizing the entropy. Further, the interaction between an adversary and a defender is modeled as a zero-sum game and a double oracle framework is utilized to obtain the approximate mixed strategy Nash equilibrium (MSNE). SAC-AP finds robust alert investigation policies and computes pure strategy best response against opponent's mixed strategy. We present the overall design of SAC-AP and evaluate its performance as compared to other state-of-the art alert prioritization methods. We consider defender's loss, i.e., the defender's inability to investigate the alerts that are triggered due to attacks, as the performance metric. Our results show that SAC-AP achieves up to 30% decrease in defender's loss as compared to the DDPG based alert prioritization method and hence provides better protection against intrusions. Moreover, the benefits are even higher when SAC-AP is compared to other traditional alert prioritization methods including Uniform, GAIN, RIO and Suricata. △ Less

Submitted 3 August, 2022; v1 submitted 27 July, 2022; originally announced July 2022.

Comments: 8 pages, 8 figures, IEEE WORLD CONGRESS ON COMPUTATIONAL INTELLIGENCE 2022

arXiv:2207.01079 [pdf, other]

DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles

Authors: Tanishq Gupta, Mohd Zaki, Devanshi Khatsuriya, Kausik Hira, N. M. Anoop Krishnan, Mausam

Abstract: A crucial component in the curation of KB for a scientific domain (e.g., materials science, foods & nutrition, fuels) is information extraction from tables in the domain's published research articles. To facilitate research in this direction, we define a novel NLP task of extracting compositions of materials (e.g., glasses) from tables in materials science papers. The task involves solving several… ▽ More A crucial component in the curation of KB for a scientific domain (e.g., materials science, foods & nutrition, fuels) is information extraction from tables in the domain's published research articles. To facilitate research in this direction, we define a novel NLP task of extracting compositions of materials (e.g., glasses) from tables in materials science papers. The task involves solving several challenges in concert, such as tables that mention compositions have highly varying structures; text in captions and full paper needs to be incorporated along with data in tables; and regular languages for numbers, chemical compounds and composition expressions must be integrated into the model. We release a training dataset comprising 4,408 distantly supervised tables, along with 1,475 manually annotated dev and test tables. We also present a strong baseline DISCOMAT, that combines multiple graph neural networks with several task-specific regular expressions, features, and constraints. We show that DISCOMAT outperforms recent table processing architectures by significant margins. △ Less

Submitted 28 January, 2024; v1 submitted 3 July, 2022; originally announced July 2022.

Comments: Accepted long paper at ACL 2023 (https://2023.aclweb.org/program/accepted_main_conference/)

arXiv:2206.14913 [pdf, other]

GPTs at Factify 2022: Prompt Aided Fact-Verification

Authors: Pawan Kumar Sahu, Saksham Aggarwal, Taneesh Gupta, Gyanendra Das

Abstract: One of the most pressing societal issues is the fight against false news. The false claims, as difficult as they are to expose, create a lot of damage. To tackle the problem, fact verification becomes crucial and thus has been a topic of interest among diverse research communities. Using only the textual form of data we propose our solution to the problem and achieve competitive results with other… ▽ More One of the most pressing societal issues is the fight against false news. The false claims, as difficult as they are to expose, create a lot of damage. To tackle the problem, fact verification becomes crucial and thus has been a topic of interest among diverse research communities. Using only the textual form of data we propose our solution to the problem and achieve competitive results with other approaches. We present our solution based on two approaches - PLM (pre-trained language model) based method and Prompt based method. The PLM-based approach uses the traditional supervised learning, where the model is trained to take 'x' as input and output prediction 'y' as P(y|x). Whereas, Prompt-based learning reflects the idea to design input to fit the model such that the original objective may be re-framed as a problem of (masked) language modeling. We may further stimulate the rich knowledge provided by PLMs to better serve downstream tasks by employing extra prompts to fine-tune PLMs. Our experiments showed that the proposed method performs better than just fine-tuning PLMs. We achieved an F1 score of 0.6946 on the FACTIFY dataset and a 7th position on the competition leader-board. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: Accepted in AAAI'22: First Workshop on Multimodal Fact-Checking and Hate Speech Detection, Februrary 22 - March 1, 2022,Vancouver, BC, Canada

arXiv:2204.13653 [pdf, other]

GRIT: General Robust Image Task Benchmark

Authors: Tanmay Gupta, Ryan Marten, Aniruddha Kembhavi, Derek Hoiem

Abstract: Computer vision models excel at making predictions when the test distribution closely resembles the training distribution. Such models have yet to match the ability of biological vision to learn from multiple sources and generalize to new data sources and tasks. To facilitate the development and evaluation of more general vision systems, we introduce the General Robust Image Task (GRIT) benchmark.… ▽ More Computer vision models excel at making predictions when the test distribution closely resembles the training distribution. Such models have yet to match the ability of biological vision to learn from multiple sources and generalize to new data sources and tasks. To facilitate the development and evaluation of more general vision systems, we introduce the General Robust Image Task (GRIT) benchmark. GRIT evaluates the performance, robustness, and calibration of a vision system across a variety of image prediction tasks, concepts, and data sources. The seven tasks in GRIT are selected to cover a range of visual skills: object categorization, object localization, referring expression grounding, visual question answering, segmentation, human keypoint detection, and surface normal estimation. GRIT is carefully designed to enable the evaluation of robustness under image perturbations, image source distribution shift, and concept distribution shift. By providing a unified platform for thorough assessment of skills and concepts learned by a vision model, we hope GRIT catalyzes the development of performant and robust general purpose vision systems. △ Less

Submitted 2 May, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

arXiv:2203.11774 [pdf, other]

Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model

Authors: Tarun Gupta, Duc-Tuan Truong, Tran The Anh, Chng Eng Siong

Abstract: The estimation of speaker characteristics such as age and height is a challenging task, having numerous applications in voice forensic analysis. In this work, we propose a bi-encoder transformer mixture model for speaker age and height estimation. Considering the wide differences in male and female voice characteristics such as differences in formant and fundamental frequencies, we propose the use… ▽ More The estimation of speaker characteristics such as age and height is a challenging task, having numerous applications in voice forensic analysis. In this work, we propose a bi-encoder transformer mixture model for speaker age and height estimation. Considering the wide differences in male and female voice characteristics such as differences in formant and fundamental frequencies, we propose the use of two separate transformer encoders for the extraction of specific voice features in the male and female gender, using wav2vec 2.0 as a common-level feature extractor. This architecture reduces the interference effects during backpropagation and improves the generalizability of the model. We perform our experiments on the TIMIT dataset and significantly outperform the current state-of-the-art results on age estimation. Specifically, we achieve root mean squared error (RMSE) of 5.54 years and 6.49 years for male and female age estimation, respectively. Further experiment to evaluate the relative importance of different phonetic types for our task demonstrate that vowel sounds are the most distinguishing for age estimation. △ Less

Submitted 22 March, 2022; originally announced March 2022.

Comments: Submitted to Interspeech 2022

arXiv:2202.02317 [pdf, other]

Webly Supervised Concept Expansion for General Purpose Vision Models

Authors: Amita Kamath, Christopher Clark, Tanmay Gupta, Eric Kolve, Derek Hoiem, Aniruddha Kembhavi

Abstract: General Purpose Vision (GPV) systems are models that are designed to solve a wide array of visual tasks without requiring architectural changes. Today, GPVs primarily learn both skills and concepts from large fully supervised datasets. Scaling GPVs to tens of thousands of concepts by acquiring data to learn each concept for every skill quickly becomes prohibitive. This work presents an effective a… ▽ More General Purpose Vision (GPV) systems are models that are designed to solve a wide array of visual tasks without requiring architectural changes. Today, GPVs primarily learn both skills and concepts from large fully supervised datasets. Scaling GPVs to tens of thousands of concepts by acquiring data to learn each concept for every skill quickly becomes prohibitive. This work presents an effective and inexpensive alternative: learn skills from supervised datasets, learn concepts from web image search, and leverage a key characteristic of GPVs: the ability to transfer visual knowledge across skills. We use a dataset of 1M+ images spanning 10k+ visual concepts to demonstrate webly-supervised concept expansion for two existing GPVs (GPV-1 and VL-T5) on 3 benchmarks: 5 COCO-based datasets (80 primary concepts), a newly curated series of 5 datasets based on the OpenImages and VisualGenome repositories (~500 concepts), and the Web-derived dataset (10k+ concepts). We also propose a new architecture, GPV-2 that supports a variety of tasks -- from vision tasks like classification and localization to vision+language tasks like QA and captioning, to more niche ones like human-object interaction detection. GPV-2 benefits hugely from web data and outperforms GPV-1 and VL-T5 across these benchmarks. Our data, code, and web demo are available at https://prior.allenai.org/projects/gpv2. △ Less

Submitted 20 July, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

Comments: ECCV 2022

arXiv:2202.00104 [pdf, other]

Generalization in Cooperative Multi-Agent Systems

Authors: Anuj Mahajan, Mikayel Samvelyan, Tarun Gupta, Benjamin Ellis, Mingfei Sun, Tim Rocktäschel, Shimon Whiteson

Abstract: Collective intelligence is a fundamental trait shared by several species of living organisms. It has allowed them to thrive in the diverse environmental conditions that exist on our planet. From simple organisations in an ant colony to complex systems in human groups, collective intelligence is vital for solving complex survival tasks. As is commonly observed, such natural systems are flexible to… ▽ More Collective intelligence is a fundamental trait shared by several species of living organisms. It has allowed them to thrive in the diverse environmental conditions that exist on our planet. From simple organisations in an ant colony to complex systems in human groups, collective intelligence is vital for solving complex survival tasks. As is commonly observed, such natural systems are flexible to changes in their structure. Specifically, they exhibit a high degree of generalization when the abilities or the total number of agents changes within a system. We term this phenomenon as Combinatorial Generalization (CG). CG is a highly desirable trait for autonomous systems as it can increase their utility and deployability across a wide range of applications. While recent works addressing specific aspects of CG have shown impressive results on complex domains, they provide no performance guarantees when generalizing towards novel situations. In this work, we shed light on the theoretical underpinnings of CG for cooperative multi-agent systems (MAS). Specifically, we study generalization bounds under a linear dependence of the underlying dynamics on the agent capabilities, which can be seen as a generalization of Successor Features to MAS. We then extend the results first for Lipschitz and then arbitrary dependence of rewards on team capabilities. Finally, empirical analysis on various domains using the framework of multi-agent reinforcement learning highlights important desiderata for multi-agent algorithms towards ensuring CG. △ Less

Submitted 21 February, 2022; v1 submitted 31 January, 2022; originally announced February 2022.

arXiv:2110.08963 [pdf, other]

SS-MAIL: Self-Supervised Multi-Agent Imitation Learning

Authors: Akshay Dharmavaram, Tejus Gupta, Jiachen Li, Katia P. Sycara

Abstract: The current landscape of multi-agent expert imitation is broadly dominated by two families of algorithms - Behavioral Cloning (BC) and Adversarial Imitation Learning (AIL). BC approaches suffer from compounding errors, as they ignore the sequential decision-making nature of the trajectory generation problem. Furthermore, they cannot effectively model multi-modal behaviors. While AIL methods solve… ▽ More The current landscape of multi-agent expert imitation is broadly dominated by two families of algorithms - Behavioral Cloning (BC) and Adversarial Imitation Learning (AIL). BC approaches suffer from compounding errors, as they ignore the sequential decision-making nature of the trajectory generation problem. Furthermore, they cannot effectively model multi-modal behaviors. While AIL methods solve the issue of compounding errors and multi-modal policy training, they are plagued with instability in their training dynamics. In this work, we address this issue by introducing a novel self-supervised loss that encourages the discriminator to approximate a richer reward function. We employ our method to train a graph-based multi-agent actor-critic architecture that learns a centralized policy, conditioned on a learned latent interaction graph. We show that our method (SS-MAIL) outperforms prior state-of-the-art methods on real-world prediction tasks, as well as on custom-designed synthetic experiments. We prove that SS-MAIL is part of the family of AIL methods by providing a theoretical connection to cost-regularized apprenticeship learning. Moreover, we leverage the self-supervised formulation to introduce a novel teacher forcing-based curriculum (Trajectory Forcing) that improves sample efficiency by progressively increasing the length of the generated trajectory. The SS-MAIL framework improves multi-agent imitation capabilities by stabilizing the policy training, improving the reward sha** capabilities, as well as providing the ability for modeling multi-modal trajectories. △ Less

Submitted 17 October, 2021; originally announced October 2021.

Comments: Pre-Print

arXiv:2110.07554 [pdf, other]

Looper: An end-to-end ML platform for product decisions

Authors: Igor L. Markov, Hanson Wang, Nitya Kasturi, Shaun Singh, Sze Wai Yuen, Mia Garrard, Sarah Tran, Yin Huang, Zehui Wang, Igor Glotov, Tanvi Gupta, Boshuang Huang, Peng Chen, Xiaowen Xie, Michael Belkin, Sal Uryasev, Sam Howie, Eytan Bakshy, Norm Zhou

Abstract: Modern software systems and products increasingly rely on machine learning models to make data-driven decisions based on interactions with users, infrastructure and other systems. For broader adoption, this practice must (i) accommodate product engineers without ML backgrounds, (ii) support finegrain product-metric evaluation and (iii) optimize for product goals. To address shortcomings of prior p… ▽ More Modern software systems and products increasingly rely on machine learning models to make data-driven decisions based on interactions with users, infrastructure and other systems. For broader adoption, this practice must (i) accommodate product engineers without ML backgrounds, (ii) support finegrain product-metric evaluation and (iii) optimize for product goals. To address shortcomings of prior platforms, we introduce general principles for and the architecture of an ML platform, Looper, with simple APIs for decision-making and feedback collection. Looper covers the end-to-end ML lifecycle from collecting training data and model training to deployment and inference, and extends support to personalization, causal evaluation with heterogenous treatment effects, and Bayesian tuning for product goals. During the 2021 production deployment Looper simultaneously hosted 440-1,000 ML models that made 4-6 million real-time decisions per second. We sum up experiences of platform adopters and describe their learning curve. △ Less

Submitted 21 June, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

Comments: 11 pages + references, 7 figures; to appear in KDD 2022

arXiv:2109.15290 [pdf]

MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction

Authors: Tanishq Gupta, Mohd Zaki, N. M. Anoop Krishnan, Mausam

Abstract: An overwhelmingly large amount of knowledge in the materials domain is generated and stored as text published in peer-reviewed scientific literature. Recent developments in natural language processing, such as bidirectional encoder representations from transformers (BERT) models, provide promising tools to extract information from these texts. However, direct application of these models in the mat… ▽ More An overwhelmingly large amount of knowledge in the materials domain is generated and stored as text published in peer-reviewed scientific literature. Recent developments in natural language processing, such as bidirectional encoder representations from transformers (BERT) models, provide promising tools to extract information from these texts. However, direct application of these models in the materials domain may yield suboptimal results as the models themselves may not be trained on notations and jargon that are specific to the domain. Here, we present a materials-aware language model, namely, MatSciBERT, which is trained on a large corpus of scientific literature published in the materials domain. We further evaluate the performance of MatSciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction, on different materials datasets. We show that MatSciBERT outperforms SciBERT, a language model trained on science corpus, on all the tasks. Further, we discuss some of the applications of MatSciBERT in the materials domain for extracting information, which can, in turn, contribute to materials discovery or optimization. Finally, to make the work accessible to the larger materials community, we make the pretrained and finetuned weights and the models of MatSciBERT freely accessible. △ Less

Submitted 30 September, 2021; originally announced September 2021.

arXiv:2109.14579 [pdf]

A secure home automation prototype built on raspberry-pi

Authors: Arya Tanmay Gupta, Himani Gupta, Muskan Sharma, Priyanka Khanna

Abstract: With the development of sensors, wireless mobile communication, embedded system, the technologies of the Internet of Things have been widely used in SmartMeter, public security, intelligent building and so on. Because of its huge market prospects, the Internet of Things has been paid close attention by several governments all over the world. IoT facilitates the seamless integration of wireless sen… ▽ More With the development of sensors, wireless mobile communication, embedded system, the technologies of the Internet of Things have been widely used in SmartMeter, public security, intelligent building and so on. Because of its huge market prospects, the Internet of Things has been paid close attention by several governments all over the world. IoT facilitates the seamless integration of wireless sensor networks. In this paper, we present an IoT prototype that is built on Raspberry Pi and uses SMTP (simple mail transfer protocol) for communication. Through this device, we have proposed a communication system that is less complex and more secure. It integrates with any "thing" and makes it electronically communicable. We give an implementation of the prototy** system and system validation. △ Less

Submitted 8 October, 2021; v1 submitted 29 September, 2021; originally announced September 2021.

arXiv:2109.13216 [pdf, ps, other]

Extending Lattice linearity for Self-Stabilizing Algorithms

Authors: Arya Tanmay Gupta, Sandeep S Kulkarni

Abstract: In this article, we focus on extending the notion of lattice linearity to self-stabilizing programs. Lattice linearity allows a node to execute its actions with old information about the state of other nodes and still preserve correctness. It increases the concurrency of the program execution by eliminating the need for synchronization among its nodes. The extension -- denoted as eventually lattic… ▽ More In this article, we focus on extending the notion of lattice linearity to self-stabilizing programs. Lattice linearity allows a node to execute its actions with old information about the state of other nodes and still preserve correctness. It increases the concurrency of the program execution by eliminating the need for synchronization among its nodes. The extension -- denoted as eventually lattice linear algorithms -- is performed with an example of the service-demand based minimal dominating set (SDDS) problem, which is a generalization of the dominating set problem; it converges in $2n$ moves. Subsequently, we also show that the same approach could be used in various other problems including minimal vertex cover, maximal independent set and graph coloring. △ Less

Submitted 18 October, 2021; v1 submitted 27 September, 2021; originally announced September 2021.

arXiv:2108.09418 [pdf, other]

Technical Report: Using Static Analysis to Compute Benefit of Tolerating Consistency

Authors: Duong Nguyen, Arya Tanmay Gupta, Sandeep S. Kulkarni

Abstract: Synchronization is the Achilles heel of concurrent programs. Synchronization requirement is often used to ensure that the execution of the concurrent program can be serialized. Without synchronization requirement, a program suffers from consistency violations. Recently, it was shown that if programs are designed to tolerate such consistency violation faults (\cvf{s}) then one can obtain substantia… ▽ More Synchronization is the Achilles heel of concurrent programs. Synchronization requirement is often used to ensure that the execution of the concurrent program can be serialized. Without synchronization requirement, a program suffers from consistency violations. Recently, it was shown that if programs are designed to tolerate such consistency violation faults (\cvf{s}) then one can obtain substantial performance gain. Previous efforts to analyze the effect of \cvf-tolerance are limited to run-time analysis of the program to determine if tolerating \cvf{s} can improve the performance. Such run-time analysis is very expensive and provides limited insight. In this work, we consider the question, `Can static analysis of the program predict the benefit of \cvf-tolerance?' We find that the answer to this question is affirmative. Specifically, we use static analysis to evaluate the cost of a \cvf and demonstrate that it can be used to predict the benefit of \cvf-tolerance. We also find that when faced with a large state space, partial analysis of the state space (via sampling) also provides the required information to predict the benefit of \cvf-tolerance. Furthermore, we observe that the \cvf-cost distribution is exponential in nature, i.e., the probability that a \cvf has a cost of $c$ is $A.B^{-c}$, where $A$ and $B$ are constants, i.e., most \cvf{s} cause no/low perturbation whereas a small number of \cvf{s} cause a large perturbation. This opens up new aveneus to evaluate the benefit of \cvf-tolerance. △ Less

Submitted 7 October, 2022; v1 submitted 20 August, 2021; originally announced August 2021.

arXiv:2105.14331 [pdf, other]

doi 10.1109/CISS50987.2021.9400245

Foveal-pit inspired filtering of DVS spike response

Authors: Shriya T. P. Gupta, Pablo Linares-Serrano, Basabdatta Sen Bhattacharya, Teresa Serrano-Gotarredona

Abstract: In this paper, we present results of processing Dynamic Vision Sensor (DVS) recordings of visual patterns with a retinal model based on foveal-pit inspired Difference of Gaussian (DoG) filters. A DVS sensor was stimulated with varying number of vertical white and black bars of different spatial frequencies moving horizontally at a constant velocity. The output spikes generated by the DVS sensor we… ▽ More In this paper, we present results of processing Dynamic Vision Sensor (DVS) recordings of visual patterns with a retinal model based on foveal-pit inspired Difference of Gaussian (DoG) filters. A DVS sensor was stimulated with varying number of vertical white and black bars of different spatial frequencies moving horizontally at a constant velocity. The output spikes generated by the DVS sensor were applied as input to a set of DoG filters inspired by the receptive field structure of the primate visual pathway. In particular, these filters mimic the receptive fields of the midget and parasol ganglion cells (spiking neurons of the retina) that sub-serve the photo-receptors of the foveal-pit. The features extracted with the foveal-pit model are used for further classification using a spiking convolutional neural network trained with a backpropagation variant adapted for spiking neural networks. △ Less

Submitted 29 May, 2021; originally announced May 2021.

Comments: 6 pages, 4 figures, 2 tables. 2021 55th Annual Conference on Information Sciences and Systems (CISS), 2021

ACM Class: I.2.10; I.4.5; I.4.10

arXiv:2105.14326 [pdf, other]

doi 10.1109/IJCNN48605.2020.9207612

Implementing a foveal-pit inspired filter in a Spiking Convolutional Neural Network: a preliminary study

Authors: Shriya T. P. Gupta, Basabdatta Sen Bhattacharya

Abstract: We have presented a Spiking Convolutional Neural Network (SCNN) that incorporates retinal foveal-pit inspired Difference of Gaussian filters and rank-order encoding. The model is trained using a variant of the backpropagation algorithm adapted to work with spiking neurons, as implemented in the Nengo library. We have evaluated the performance of our model on two publicly available datasets - one f… ▽ More We have presented a Spiking Convolutional Neural Network (SCNN) that incorporates retinal foveal-pit inspired Difference of Gaussian filters and rank-order encoding. The model is trained using a variant of the backpropagation algorithm adapted to work with spiking neurons, as implemented in the Nengo library. We have evaluated the performance of our model on two publicly available datasets - one for digit recognition task, and the other for vehicle recognition task. The network has achieved up to 90% accuracy, where loss is calculated using the cross-entropy function. This is an improvement over around 57% accuracy obtained with the alternate approach of performing the classification without any kind of neural filtering. Overall, our proof-of-concept study indicates that introducing biologically plausible filtering in existing SCNN architecture will work well with noisy input images such as those in our vehicle recognition task. Based on our results, we plan to enhance our SCNN by integrating lateral inhibition-based redundancy reduction prior to rank-ordering, which will further improve the classification accuracy by the network. △ Less

Submitted 29 May, 2021; originally announced May 2021.

Comments: 8 pages, 8 figures, 4 tables. 2020 International Joint Conference on Neural Networks (IJCNN)

ACM Class: I.2.10; I.4.5; I.4.10

arXiv:2104.13446 [pdf, other]

Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients

Authors: Bozhidar Vasilev, Tarun Gupta, Bei Peng, Shimon Whiteson

Abstract: Policy gradient methods are an attractive approach to multi-agent reinforcement learning problems due to their convergence properties and robustness in partially observable scenarios. However, there is a significant performance gap between state-of-the-art policy gradient and value-based methods on the popular StarCraft Multi-Agent Challenge (SMAC) benchmark. In this paper, we introduce semi-on-po… ▽ More Policy gradient methods are an attractive approach to multi-agent reinforcement learning problems due to their convergence properties and robustness in partially observable scenarios. However, there is a significant performance gap between state-of-the-art policy gradient and value-based methods on the popular StarCraft Multi-Agent Challenge (SMAC) benchmark. In this paper, we introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods. We enhance two state-of-the-art policy gradient algorithms with SOP training, demonstrating significant performance improvements. Furthermore, we show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks. △ Less

Submitted 6 May, 2021; v1 submitted 27 April, 2021; originally announced April 2021.

Comments: AAMAS Adaptive and Learning Agents Workshop. 20th International Conference on Autonomous Agents and Multiagent Systems

arXiv:2104.08793 [pdf, other]

SalKG: Learning From Knowledge Graph Explanations for Commonsense Reasoning

Authors: Aaron Chan, Jiashu Xu, Boyuan Long, Soumya Sanyal, Tanishq Gupta, Xiang Ren

Abstract: Augmenting pre-trained language models with knowledge graphs (KGs) has achieved success on various commonsense reasoning tasks. However, for a given task instance, the KG, or certain parts of the KG, may not be useful. Although KG-augmented models often use attention to focus on specific KG components, the KG is still always used, and the attention mechanism is never explicitly taught which KG com… ▽ More Augmenting pre-trained language models with knowledge graphs (KGs) has achieved success on various commonsense reasoning tasks. However, for a given task instance, the KG, or certain parts of the KG, may not be useful. Although KG-augmented models often use attention to focus on specific KG components, the KG is still always used, and the attention mechanism is never explicitly taught which KG components should be used. Meanwhile, saliency methods can measure how much a KG feature (e.g., graph, node, path) influences the model to make the correct prediction, thus explaining which KG features are useful. This paper explores how saliency explanations can be used to improve KG-augmented models' performance. First, we propose to create coarse (Is the KG useful?) and fine (Which nodes/paths in the KG are useful?) saliency explanations. Second, to motivate saliency-based supervision, we analyze oracle KG-augmented models which directly use saliency explanations as extra inputs for guiding their attention. Third, we propose SalKG, a framework for KG-augmented models to learn from coarse and/or fine saliency explanations. Given saliency explanations created from a task's training set, SalKG jointly trains the model to predict the explanations, then solve the task by attending to KG features highlighted by the predicted explanations. On three commonsense QA benchmarks (CSQA, OBQA, CODAH) and a range of KG-augmented models, we show that SalKG can yield considerable performance gains -- up to 2.76% absolute improvement on CSQA. △ Less

Submitted 20 March, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

Comments: NeurIPS 2021

arXiv:2104.00990 [pdf, other]

Visual Semantic Role Labeling for Video Understanding

Authors: Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, Aniruddha Kembhavi

Abstract: We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchm… ▽ More We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchmark, a large-scale video understanding data source with $29K$ $10$-second movie clips richly annotated with a verb and semantic-roles every $2$ seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (${\sim}3K$) and have been chosen to be both complex (${\sim}4.2$ unique verbs within a video) as well as diverse (${\sim}200$ verbs have more than $100$ annotations each). We provide a comprehensive analysis of the dataset in comparison to other publicly available video understanding benchmarks, several illustrative baselines and evaluate a range of standard video recognition models. Our code and dataset is available at vidsitu.org. △ Less

Submitted 2 April, 2021; originally announced April 2021.

Comments: CVPR21 camera-ready including appendix. Project Page at https://vidsitu.org/

arXiv:2104.00743 [pdf, other]

Towards General Purpose Vision Systems

Authors: Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, Derek Hoiem

Abstract: Computer vision systems today are primarily N-purpose systems, designed and trained for a predefined set of tasks. Adapting such systems to new tasks is challenging and often requires non-trivial modifications to the network architecture (e.g. adding new output heads) or training process (e.g. adding new losses). To reduce the time and expertise required to develop new applications, we would like… ▽ More Computer vision systems today are primarily N-purpose systems, designed and trained for a predefined set of tasks. Adapting such systems to new tasks is challenging and often requires non-trivial modifications to the network architecture (e.g. adding new output heads) or training process (e.g. adding new losses). To reduce the time and expertise required to develop new applications, we would like to create general purpose vision systems that can learn and perform a range of tasks without any modification to the architecture or learning process. In this paper, we propose GPV-1, a task-agnostic vision-language architecture that can learn and perform tasks that involve receiving an image and producing text and/or bounding boxes, including classification, localization, visual question answering, captioning, and more. We also propose evaluations of generality of architecture, skill-concept transfer, and learning efficiency that may inform future work on general purpose vision. Our experiments indicate GPV-1 is effective at multiple tasks, reuses some concept knowledge across tasks, can perform the Referring Expressions task zero-shot, and further improves upon the zero-shot performance using a few training samples. △ Less

Submitted 19 April, 2022; v1 submitted 1 April, 2021; originally announced April 2021.

Comments: CVPR 2022 Oral; Project page: https://prior.allenai.org/projects/gpv

arXiv:2101.10814

Spread and defend infection in graphs

Authors: Arya Tanmay Gupta

Abstract: The spread of an infection, a contagion, meme, emotion, message and various other spreadable objects have been discussed in several works. Burning and firefighting have been discussed in particular on static graphs. Graph burning simulates the notion of the spread of "fire" throughout a graph (plus, one unburned node burned at each time-step); graph firefighting simulates the defending of nodes by… ▽ More The spread of an infection, a contagion, meme, emotion, message and various other spreadable objects have been discussed in several works. Burning and firefighting have been discussed in particular on static graphs. Graph burning simulates the notion of the spread of "fire" throughout a graph (plus, one unburned node burned at each time-step); graph firefighting simulates the defending of nodes by placing firefighters on the nodes which have not been already burned while the fire is being spread (started by only a single fire source). This article studies a combination of firefighting and burning on a graph class which is a variation (generalization) of temporal graphs. Nodes can be infected from "outside" a network. We present a notion of both upgrading (of unburned nodes, similar to firefighting) and repairing (of infected nodes). The nodes which are burned, firefighted, or repaired are chosen probabilistically. So a variable amount of nodes are allowed to be infected, upgraded and repaired in each time step. In the model presented in this article, both burning and firefighting proceed concurrently, we introduce such a system to enable the community to study the notion of spread of an infection and the notion of upgrade/repair against each other. The graph class that we study (on which, these processes are simulated) is a variation of temporal graph class in which at each time-step, probabilistically, a communication takes place (iff an edge exists in that time step). In addition, a node can be "worn out" and thus can be removed from the network, and a new healthy node can be added to the network as well. This class of graphs enables systems with high complexity to be able to be simulated and studied. △ Less

Submitted 16 November, 2023; v1 submitted 5 January, 2021; originally announced January 2021.

Comments: incomplete work. major revision required

arXiv:2011.09533 [pdf, other]

Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Authors: Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip H. S. Torr, Mingfei Sun, Shimon Whiteson

Abstract: Most recently developed approaches to cooperative multi-agent reinforcement learning in the \emph{centralized training with decentralized execution} setting involve estimating a centralized, joint value function. In this paper, we demonstrate that, despite its various theoretical shortcomings, Independent PPO (IPPO), a form of independent learning in which each agent simply estimates its local val… ▽ More Most recently developed approaches to cooperative multi-agent reinforcement learning in the \emph{centralized training with decentralized execution} setting involve estimating a centralized, joint value function. In this paper, we demonstrate that, despite its various theoretical shortcomings, Independent PPO (IPPO), a form of independent learning in which each agent simply estimates its local value function, can perform just as well as or better than state-of-the-art joint learning approaches on popular multi-agent benchmark suite SMAC with little hyperparameter tuning. We also compare IPPO to several variants; the results suggest that IPPO's strong performance may be due to its robustness to some forms of environment non-stationarity. △ Less

Submitted 18 November, 2020; originally announced November 2020.

Showing 1–50 of 76 results for author: Gupta, T