Search | arXiv e-print repository

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, Peter Clark

Abstract: Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systemat… ▽ More Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. We additionally provide 903 synthetic tasks to conduct controlled evaluations across task complexity. Furthermore, our structured formalism of data-driven discovery enables a facet-based evaluation that provides useful insights into different failure modes. We evaluate several popular LLM-based reasoning frameworks using both open and closed LLMs as baselines on DiscoveryBench and find that even the best system scores only 25%. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Website: https://github.com/allenai/discoverybench

arXiv:2407.00677 [pdf, ps, other]

Combinatorial Multi-Access Coded Caching with Private Caches

Authors: Dhruv Pratap Singh, Anjana A. Mahesh, B. Sundar Rajan

Abstract: We consider a variant of the coded caching problem where users connect to two types of caches, called private and access caches. The problem setting consists of a server with a library of files and a set of access caches. Each user, equipped with a private cache, connects to a distinct $r-$subset of the access caches. The server populates both types of caches with files in uncoded format. For this… ▽ More We consider a variant of the coded caching problem where users connect to two types of caches, called private and access caches. The problem setting consists of a server with a library of files and a set of access caches. Each user, equipped with a private cache, connects to a distinct $r-$subset of the access caches. The server populates both types of caches with files in uncoded format. For this setting, we provide an achievable scheme and derive a lower bound on the number of transmissions for this scheme. We also present a lower and upper bound for the optimal worst-case rate under uncoded placement for this setting using the rates of the Maddah-Ali--Niesen scheme for dedicated and combinatorial multi-access coded caching settings, respectively. Further, we derive a lower bound on the optimal worst-case rate for any general placement policy using cut-set arguments. We also provide numerical plots comparing the rate of the proposed achievability scheme with the above bounds, from which it can be observed that the proposed scheme approaches the lower bound when the amount of memory accessed by a user is large. Finally, we discuss the optimality w.r.t worst-case rate when the system has four access caches. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 13 pages and 6 figures

arXiv:2406.17576 [pdf, other]

Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations

Authors: Cheng Wang, Christopher Redino, Ryan Clark, Abdul Rahman, Sal Aguinaga, Sathvik Murli, Dhruv Nandakumar, Roland Rao, Lanxiao Huang, Daniel Radke, Edward Bowen

Abstract: Ransomware presents a significant and increasing threat to individuals and organizations by encrypting their systems and not releasing them until a large fee has been extracted. To bolster preparedness against potential attacks, organizations commonly conduct red teaming exercises, which involve simulated attacks to assess existing security measures. This paper proposes a novel approach utilizing… ▽ More Ransomware presents a significant and increasing threat to individuals and organizations by encrypting their systems and not releasing them until a large fee has been extracted. To bolster preparedness against potential attacks, organizations commonly conduct red teaming exercises, which involve simulated attacks to assess existing security measures. This paper proposes a novel approach utilizing reinforcement learning (RL) to simulate ransomware attacks. By training an RL agent in a simulated environment mirroring real-world networks, effective attack strategies can be learned quickly, significantly streamlining traditional, manual penetration testing processes. The attack pathways revealed by the RL agent can provide valuable insights to the defense team, hel** them identify network weak points and develop more resilient defensive measures. Experimental results on a 152-host example network confirm the effectiveness of the proposed approach, demonstrating the RL agent's capability to discover and orchestrate attacks on high-value targets while evading honeyfiles (decoy files strategically placed to detect unauthorized access). △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.12998 [pdf, other]

Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech

Authors: Cheol Jun Cho, Peter Wu, Tejas S. Prabhune, Dhruv Agarwal, Gopala K. Anumanchipalli

Abstract: Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- articulatory encodec. The articulatory encod… ▽ More Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- articulatory encodec. The articulatory encodec comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.09366 [pdf, other]

Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

Authors: Rylan Schaeffer, Victor Lecomte, Dhruv Bhandarkar Pai, Andres Carranza, Berivan Isik, Alyssa Unell, Mikail Khona, Thomas Yerxa, Yann LeCun, SueYeon Chung, Andrey Gromov, Ravid Shwartz-Ziv, Sanmi Koyejo

Abstract: Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is intriguing because it does not fit neatly into any of the commonplace MVSSL lineages, instead originating from a statistical mechanical perspective on the linear separability of data manifolds. In this paper, we seek to impro… ▽ More Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is intriguing because it does not fit neatly into any of the commonplace MVSSL lineages, instead originating from a statistical mechanical perspective on the linear separability of data manifolds. In this paper, we seek to improve our understanding and our utilization of MMCR. To better understand MMCR, we leverage tools from high dimensional probability to demonstrate that MMCR incentivizes alignment and uniformity of learned embeddings. We then leverage tools from information theory to show that such embeddings maximize a well-known lower bound on mutual information between views, thereby connecting the geometric perspective of MMCR to the information-theoretic perspective commonly discussed in MVSSL. To better utilize MMCR, we mathematically predict and experimentally confirm non-monotonic changes in the pretraining loss akin to double descent but with respect to atypical hyperparameters. We also discover compute scaling laws that enable predicting the pretraining loss as a function of gradients steps, batch size, embedding dimension and number of views. We then show that MMCR, originally applied to image data, is performant on multimodal image-text data. By more deeply understanding the theoretical and empirical behavior of MMCR, our work reveals insights on improving MVSSL methods. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.01799 [pdf, other]

Online Control in Population Dynamics

Authors: Noah Golowich, Elad Hazan, Zhou Lu, Dhruv Rohatgi, Y. Jennifer Sun

Abstract: The study of population dynamics originated with early sociological works but has since extended into many fields, including biology, epidemiology, evolutionary game theory, and economics. Most studies on population dynamics focus on the problem of prediction rather than control. Existing mathematical models for control in population dynamics are often restricted to specific, noise-free dynamics,… ▽ More The study of population dynamics originated with early sociological works but has since extended into many fields, including biology, epidemiology, evolutionary game theory, and economics. Most studies on population dynamics focus on the problem of prediction rather than control. Existing mathematical models for control in population dynamics are often restricted to specific, noise-free dynamics, while real-world population changes can be complex and adversarial. To address this gap, we propose a new framework based on the paradigm of online control. We first characterize a set of linear dynamical systems that can naturally model evolving populations. We then give an efficient gradient-based controller for these systems, with near-optimal regret bounds with respect to a broad class of linear policies. Our empirical evaluations demonstrate the effectiveness of the proposed algorithm for control in population dynamics even for non-linear models such as SIR and replicator dynamics. △ Less

Submitted 6 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.20254 [pdf, other]

Conversational Agents to Facilitate Deliberation on Harmful Content in WhatsApp Groups

Authors: Dhruv Agarwal, Farhana Shahid, Aditya Vashistha

Abstract: WhatsApp groups have become a hotbed for the propagation of harmful content including misinformation, hate speech, polarizing content, and rumors, especially in Global South countries. Given the platform's end-to-end encryption, moderation responsibilities lie on group admins and members, who rarely contest such content. Another approach is fact-checking, which is unscalable, and can only contest… ▽ More WhatsApp groups have become a hotbed for the propagation of harmful content including misinformation, hate speech, polarizing content, and rumors, especially in Global South countries. Given the platform's end-to-end encryption, moderation responsibilities lie on group admins and members, who rarely contest such content. Another approach is fact-checking, which is unscalable, and can only contest factual content (e.g., misinformation) but not subjective content (e.g., hate speech). Drawing on recent literature, we explore deliberation -- open and inclusive discussion -- as an alternative. We investigate the role of a conversational agent in facilitating deliberation on harmful content in WhatsApp groups. We conducted semi-structured interviews with 21 Indian WhatsApp users, employing a design probe to showcase an example agent. Participants expressed the need for anonymity and recommended AI assistance to reduce the effort required in deliberation. They appreciated the agent's neutrality but pointed out the futility of deliberation in echo chamber groups. Our findings highlight design tensions for such an agent, including privacy versus group dynamics and freedom of speech in private spaces. We discuss the efficacy of deliberation using deliberative theory as a lens, compare deliberation with moderation and fact-checking, and provide design recommendations for future such systems. Ultimately, this work advances CSCW by offering insights into designing deliberative systems for combating harmful content in private group chats on social media. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: To appear at CSCW 2024

arXiv:2405.16406 [pdf, other]

SpinQuant: LLM quantization with learned rotations

Authors: Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort

Abstract: Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Recent findings suggest that rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a… ▽ More Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Recent findings suggest that rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures, and find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant that optimizes (or learns) the rotation matrices with Cayley optimization on a small validation set. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-2 7B/LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by 30.2%/34.1% relative to QuaRot. △ Less

Submitted 28 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.11487 [pdf, other]

"Previously on ..." From Recaps to Story Summarization

Authors: Aditya Kumar Singh, Dhruv Srivastava, Makarand Tapaswi

Abstract: We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps and long episodes of 40 minutes. Story summarization labels are unlocked by matching recap shots to corresponding sub-stories in t… ▽ More We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps and long episodes of 40 minutes. Story summarization labels are unlocked by matching recap shots to corresponding sub-stories in the episode. We propose a hierarchical model TaleSumm that processes entire episodes by creating compact shot and dialog representations, and predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization, our method extracts multiple plot points from long videos. We present a thorough evaluation on story summarization, including promising cross-series generalization. TaleSumm also shows good results on classic video summarization benchmarks. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: CVPR 2024; Project page: https://katha-ai.github.io/projects/recap-story-summ/

arXiv:2405.10391 [pdf, other]

Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance

Authors: Anish Bhattacharya, Nishanth Rao, Dhruv Parikh, Pratik Kunapuli, Nikolai Matni, Vijay Kumar

Abstract: We demonstrate the capabilities of an attention-based end-to-end approach for high-speed quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional vision-based navigation via independent map**, p… ▽ More We demonstrate the capabilities of an attention-based end-to-end approach for high-speed quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional vision-based navigation via independent map**, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end planning and control networks have shown to be effective for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer models for depth-based end-to-end control, in a photorealistic, high-physics-fidelity simulator as well as in hardware, and observe that the attention-based models are more effective as quadrotor speeds increase, while recurrent models with many layers provide smoother commands at lower speeds. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: 8 pages, 10 figures, 3 tables

arXiv:2405.05852 [pdf, other]

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

Authors: Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, Tim G. J. Rudner

Abstract: Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used… ▽ More Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.05033 [pdf, other]

Multi-fidelity Hamiltonian Monte Carlo

Authors: Dhruv V. Patel, Jonghyun Lee, Matthew W. Farthing, Peter K. Kitanidis, Eric F. Darve

Abstract: Numerous applications in biology, statistics, science, and engineering require generating samples from high-dimensional probability distributions. In recent years, the Hamiltonian Monte Carlo (HMC) method has emerged as a state-of-the-art Markov chain Monte Carlo technique, exploiting the shape of such high-dimensional target distributions to efficiently generate samples. Despite its impressive em… ▽ More Numerous applications in biology, statistics, science, and engineering require generating samples from high-dimensional probability distributions. In recent years, the Hamiltonian Monte Carlo (HMC) method has emerged as a state-of-the-art Markov chain Monte Carlo technique, exploiting the shape of such high-dimensional target distributions to efficiently generate samples. Despite its impressive empirical success and increasing popularity, its wide-scale adoption remains limited due to the high computational cost of gradient calculation. Moreover, applying this method is impossible when the gradient of the posterior cannot be computed (for example, with black-box simulators). To overcome these challenges, we propose a novel two-stage Hamiltonian Monte Carlo algorithm with a surrogate model. In this multi-fidelity algorithm, the acceptance probability is computed in the first stage via a standard HMC proposal using an inexpensive differentiable surrogate model, and if the proposal is accepted, the posterior is evaluated in the second stage using the high-fidelity (HF) numerical solver. Splitting the standard HMC algorithm into these two stages allows for approximating the gradient of the posterior efficiently, while producing accurate posterior samples by using HF numerical solvers in the second stage. We demonstrate the effectiveness of this algorithm for a range of problems, including linear and nonlinear Bayesian inverse problems with in-silico data and experimental data. The proposed algorithm is shown to seamlessly integrate with various low-fidelity and HF models, priors, and datasets. Remarkably, our proposed method outperforms the traditional HMC algorithm in both computational and statistical efficiency by several orders of magnitude, all while retaining or improving the accuracy in computed posterior statistics. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2405.01394 [pdf, other]

Analysis of a Modular Autonomous Driving Architecture: The Top Submission to CARLA Leaderboard 2.0 Challenge

Authors: Weize Zhang, Mohammed Elmahgiubi, Kasra Rezaee, Behzad Khamidehi, Hamidreza Mirkhani, Fazel Arasteh, Chunlin Li, Muhammad Ahsan Kaleem, Eduardo R. Corral-Soto, Dhruv Sharma, Tongtong Cao

Abstract: In this paper we present the architecture of the Kyber-E2E submission to the map track of CARLA Leaderboard 2.0 Autonomous Driving (AD) challenge 2023, which achieved first place. We employed a modular architecture for our solution consists of five main components: sensing, localization, perception, tracking/prediction, and planning/control. Our solution leverages state-of-the-art language-assiste… ▽ More In this paper we present the architecture of the Kyber-E2E submission to the map track of CARLA Leaderboard 2.0 Autonomous Driving (AD) challenge 2023, which achieved first place. We employed a modular architecture for our solution consists of five main components: sensing, localization, perception, tracking/prediction, and planning/control. Our solution leverages state-of-the-art language-assisted perception models to help our planner perform more reliably in highly challenging traffic scenarios. We use open-source driving datasets in conjunction with Inverse Reinforcement Learning (IRL) to enhance the performance of our motion planner. We provide insight into our design choices and trade-offs made to achieve this solution. We also explore the impact of each component in the overall performance of our solution, with the intent of providing a guideline where allocation of resources can have the greatest impact. △ Less

Submitted 21 March, 2024; originally announced May 2024.

arXiv:2404.11819 [pdf, other]

Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement

Authors: Pushkar Shukla, Dhruv Srikanth, Lee Cohen, Matthew Turk

Abstract: We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. While counterfactuals have been used to analyze and address biases in DNN models, the counterfactuals themselves are often generated from biased generative models, which can introduce additional biases or spurious correlations. To address this issue, we propose using adv… ▽ More We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. While counterfactuals have been used to analyze and address biases in DNN models, the counterfactuals themselves are often generated from biased generative models, which can introduce additional biases or spurious correlations. To address this issue, we propose using adversarial images, that is images that deceive a deep neural network but not humans, as counterfactuals for fair model training. Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. By incorporating adversarial images into the training data, we aim to prevent biases from propagating through the pipeline. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods. Qualitatively, our results indicate that post-training, the decisions made by the model are less dependent on the sensitive attribute and our model better disentangles the relationship between sensitive attributes and classification variables. △ Less

Submitted 27 June, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.11325 [pdf, ps, other]

On Learning Parities with Dependent Noise

Authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi

Abstract: In this expository note we show that the learning parities with noise (LPN) assumption is robust to weak dependencies in the noise distribution of small batches of samples. This provides a partial converse to the linearization technique of [AG11]. The material in this note is drawn from a recent work by the authors [GMR24], where the robustness guarantee was a key component in a cryptographic sepa… ▽ More In this expository note we show that the learning parities with noise (LPN) assumption is robust to weak dependencies in the noise distribution of small batches of samples. This provides a partial converse to the linearization technique of [AG11]. The material in this note is drawn from a recent work by the authors [GMR24], where the robustness guarantee was a key component in a cryptographic separation between reinforcement learning and supervised learning. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: This note draws heavily from arXiv:2404.03774

arXiv:2404.08513 [pdf, other]

Adversarial Imitation Learning via Boosting

Authors: Jonathan D. Chang, Dhruv Sreenivas, Yingbing Huang, Kianté Brantley, Wen Sun

Abstract: Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective… ▽ More Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective is on-policy and DAC's ad-hoc application of off-policy training does not guarantee successful imitation (Kostrikov et al., 2019; 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this issue by deriving a fully off-policy AIL objective. Instead in this work, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. In the weighted replay buffer, the contribution of the data from older policies are properly discounted with the weight computed based on the boosting framework. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021), achieving competitive performance with as little as one expert trajectory. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 19 pages, 7 figures, 4 tables, 3 algorithms, ICLR 2024

arXiv:2404.06723 [pdf, other]

Global Contrastive Training for Multimodal Electronic Health Records with Language Supervision

Authors: Yingbo Ma, Suraj Kolla, Zhenhong Hu, Dhruv Kaliraman, Victoria Nolan, Ziyuan Guan, Yuanfang Ren, Brooke Armfield, Tezcan Ozrazgat-Baslanti, Jeremy A. Balch, Tyler J. Loftus, Parisa Rashidi, Azra Bihorac, Benjamin Shickel

Abstract: Modern electronic health records (EHRs) hold immense promise in tracking personalized patient health trajectories through sequential deep learning, owing to their extensive breadth, scale, and temporal granularity. Nonetheless, how to effectively leverage multiple modalities from EHRs poses significant challenges, given its complex characteristics such as high dimensionality, multimodality, sparsi… ▽ More Modern electronic health records (EHRs) hold immense promise in tracking personalized patient health trajectories through sequential deep learning, owing to their extensive breadth, scale, and temporal granularity. Nonetheless, how to effectively leverage multiple modalities from EHRs poses significant challenges, given its complex characteristics such as high dimensionality, multimodality, sparsity, varied recording frequencies, and temporal irregularities. To this end, this paper introduces a novel multimodal contrastive learning framework, specifically focusing on medical time series and clinical notes. To tackle the challenge of sparsity and irregular time intervals in medical time series, the framework integrates temporal cross-attention transformers with a dynamic embedding and tokenization scheme for learning multimodal feature representations. To harness the interconnected relationships between medical time series and clinical notes, the framework equips a global contrastive loss, aligning a patient's multimodal feature representations with the corresponding discharge summaries. Since discharge summaries uniquely pertain to individual patients and represent a holistic view of the patient's hospital stay, machine learning models are led to learn discriminative multimodal features via global contrasting. Extensive experiments with a real-world EHR dataset demonstrated that our framework outperformed state-of-the-art approaches on the exemplar task of predicting the occurrence of nine postoperative complications for more than 120,000 major inpatient surgeries using multimodal data from UF health system split among three hospitals (UF Health Gainesville, UF Health Jacksonville, and UF Health Jacksonville-North). △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 12 pages, 3 figures. arXiv admin note: text overlap with arXiv:2403.04012

arXiv:2404.06609 [pdf, other]

GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

Authors: Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, Roozbeh Mottaghi

Abstract: The Embodied AI community has made significant strides in visual navigation tasks, exploring targets from 3D coordinates, objects, language descriptions, and images. However, these navigation models often handle only a single input modality as the target. With the progress achieved so far, it is time to move towards universal navigation models capable of handling various goal types, enabling more… ▽ More The Embodied AI community has made significant strides in visual navigation tasks, exploring targets from 3D coordinates, objects, language descriptions, and images. However, these navigation models often handle only a single input modality as the target. With the progress achieved so far, it is time to move towards universal navigation models capable of handling various goal types, enabling more effective user interaction with robots. To facilitate this goal, we propose GOAT-Bench, a benchmark for the universal navigation task referred to as GO to AnyThing (GOAT). In this task, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image in an open-vocabulary fashion. We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities, the role of explicit and implicit scene memories, their robustness to noise in goal specifications, and the impact of memory in lifelong scenarios. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.04603 [pdf, ps, other]

Analyzing LLM Usage in an Advanced Computing Class in India

Authors: Chaitanya Arora, Utkarsh Venaik, Pavit Singh, Sahil Goyal, Jatin Tyagi, Shyama Goel, Ujjwal Singhal, Dhruv Kumar

Abstract: This paper investigates the usage patterns of undergraduate and graduate students when engaging with large language models (LLMs) to tackle programming assignments in the context of advanced computing courses. Existing work predominantly focuses on the influence of LLMs in introductory programming contexts. Additionally, there is a scarcity of studies analyzing actual conversations between student… ▽ More This paper investigates the usage patterns of undergraduate and graduate students when engaging with large language models (LLMs) to tackle programming assignments in the context of advanced computing courses. Existing work predominantly focuses on the influence of LLMs in introductory programming contexts. Additionally, there is a scarcity of studies analyzing actual conversations between students and LLMs. Our study provides a comprehensive quantitative and qualitative analysis of raw interactions between students and LLMs within an advanced computing course (Distributed Systems) at an Indian University. We further complement this by conducting student interviews to gain deeper insights into their usage patterns. Our study shows that students make use of large language models (LLMs) in various ways: generating code or debugging code by identifying and fixing errors. They also copy and paste assignment descriptions into LLM interfaces for specific solutions, ask conceptual questions about complex programming ideas or theoretical concepts, and generate test cases to check code functionality and robustness. Our analysis includes over 4,000 prompts from 411 students and conducting interviews with 10 students. Our analysis shows that LLMs excel at generating boilerplate code and assisting in debugging, while students handle the integration of components and system troubleshooting. This aligns with the learning objectives of advanced computing courses, which are oriented towards teaching students how to build systems and troubleshoot, with less emphasis on generating code from scratch. Therefore, LLM tools can be leveraged to increase student productivity, as shown by the data we collected. This study contributes to the ongoing discussion on LLM use in education, advocating for their usefulness in advanced computing courses to complement higher-level learning and productivity. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: Under review: 12 pages

arXiv:2404.04527 [pdf, other]

VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA

Authors: Sachini Wickramasinghe, Dhruv Parikh, Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna, Carl Busart

Abstract: Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the current state-of-the-art in various computer vision applications, outperforming their CNN counterparts. However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive trai… ▽ More Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the current state-of-the-art in various computer vision applications, outperforming their CNN counterparts. However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive training data to generalize well due to their low locality; the standard SAR datasets, however, have a limited number of labeled training data which reduces the learning capability of ViTs; (2) ViTs have a high parameter count and are computation intensive which makes their deployment on resource-constrained SAR platforms difficult. In this work, we develop a lightweight ViT model that can be trained directly on small datasets without any pre-training by utilizing the Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) modules. We directly train this model on SAR datasets which have limited training samples to evaluate its effectiveness for SAR ATR applications. We evaluate our proposed model, that we call VTR (ViT for SAR ATR), on three widely used SAR datasets: MSTAR, SynthWakeSAR, and GBSAR. Further, we propose a novel FPGA accelerator for VTR, in order to enable deployment for real-time SAR ATR applications. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: SPIE DCS 2024

arXiv:2404.03774 [pdf, other]

Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning

Authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi

Abstract: Supervised learning is often computationally easy in practice. But to what extent does this mean that other modes of learning, such as reinforcement learning (RL), ought to be computationally easy by extension? In this work we show the first cryptographic separation between RL and supervised learning, by exhibiting a class of block MDPs and associated decoding functions where reward-free explorati… ▽ More Supervised learning is often computationally easy in practice. But to what extent does this mean that other modes of learning, such as reinforcement learning (RL), ought to be computationally easy by extension? In this work we show the first cryptographic separation between RL and supervised learning, by exhibiting a class of block MDPs and associated decoding functions where reward-free exploration is provably computationally harder than the associated regression problem. We also show that there is no computationally efficient algorithm for reward-directed RL in block MDPs, even when given access to an oracle for this regression problem. It is known that being able to perform regression in block MDPs is necessary for finding a good policy; our results suggest that it is not sufficient. Our separation lower bound uses a new robustness property of the Learning Parities with Noise (LPN) hardness assumption, which is crucial in handling the dependent nature of RL data. We argue that separations and oracle lower bounds, such as ours, are a more meaningful way to prove hardness of learning because the constructions better reflect the practical reality that supervised learning by itself is often not the computational bottleneck. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: 112 pages, 3 figures

arXiv:2404.01413 [pdf, other]

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Authors: Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

Abstract: The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration… ▽ More The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of language models on text corpora. We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs. △ Less

Submitted 29 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.14047 [pdf, other]

Accelerating ViT Inference on FPGA through Static and Dynamic Pruning

Authors: Dhruv Parikh, Shouyi Li, Bingyi Zhang, Rajgopal Kannan, Carl Busart, Viktor Prasanna

Abstract: Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are two well-known methods for reducing complexity: weight pruning reduces the model size and associated computational demands, while token pruning further dynamic… ▽ More Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are two well-known methods for reducing complexity: weight pruning reduces the model size and associated computational demands, while token pruning further dynamically reduces the computation based on the input. Combining these two techniques should significantly reduce computation complexity and model size; however, naively integrating them results in irregular computation patterns, leading to significant accuracy drops and difficulties in hardware acceleration. Addressing the above challenges, we propose a comprehensive algorithm-hardware codesign for accelerating ViT on FPGA through simultaneous pruning -combining static weight pruning and dynamic token pruning. For algorithm design, we systematically combine a hardware-aware structured block-pruning method for pruning model parameters and a dynamic token pruning method for removing unimportant token vectors. Moreover, we design a novel training algorithm to recover the model's accuracy. For hardware design, we develop a novel hardware accelerator for executing the pruned model. The proposed hardware design employs multi-level parallelism with load balancing strategy to efficiently deal with the irregular computation pattern led by the two pruning approaches. Moreover, we develop an efficient hardware mechanism for efficiently executing the on-the-fly token pruning. △ Less

Submitted 12 April, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Comments: FCCM 2024

arXiv:2403.12173 [pdf, other]

TnT-LLM: Text Mining at Scale with Large Language Models

Authors: Mengting Wan, Tara Safavi, Sujay Kumar Jauhar, Yu** Kim, Scott Counts, Jennifer Neville, Siddharth Suri, Chirag Shah, Ryen W White, Longqi Yang, Reid Andersen, Georg Buscher, Dhruv Joshi, Nagu Rangan

Abstract: Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. Thi… ▽ More Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: 9 pages main content, 8 pages references and appendix

arXiv:2403.08956 [pdf]

doi 10.1109/INCET54531.2022.9825164

AI coach for badminton

Authors: Dhruv Toshniwal, Arpit Patil, Nancy Vachhani

Abstract: In the competitive realm of sports, optimal performance necessitates rigorous management of nutrition and physical conditioning. Specifically, in badminton, the agility and precision required make it an ideal candidate for motion analysis through video analytics. This study leverages advanced neural network methodologies to dissect video footage of badminton matches, aiming to extract detailed ins… ▽ More In the competitive realm of sports, optimal performance necessitates rigorous management of nutrition and physical conditioning. Specifically, in badminton, the agility and precision required make it an ideal candidate for motion analysis through video analytics. This study leverages advanced neural network methodologies to dissect video footage of badminton matches, aiming to extract detailed insights into player kinetics and biomechanics. Through the analysis of stroke mechanics, including hand-hip coordination, leg positioning, and the execution angles of strokes, the research aims to derive predictive models that can suggest improvements in stance, technique, and muscle orientation. These recommendations are designed to mitigate erroneous techniques, reduce the risk of joint fatigue, and enhance overall performance. Utilizing a vast array of data available online, this research correlates players' physical attributes with their in-game movements to identify muscle activation patterns during play. The goal is to offer personalized training and nutrition strategies that align with the specific biomechanical demands of badminton, thereby facilitating targeted performance enhancements. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: 7 pages, 11 figures. https://ieeexplore.ieee.org/document/9825164

Journal ref: 2022 3rd International Conference for Emerging Technology (INCET), Belgaum, India, 2022, pp. 1-7

arXiv:2403.08283 [pdf, other]

Optimized Detection and Classification on GTRSB: Advancing Traffic Sign Recognition with Convolutional Neural Networks

Authors: Dhruv Toshniwal, Saurabh Loya, Anuj Khot, Yash Marda

Abstract: In the rapidly evolving landscape of transportation, the proliferation of automobiles has made road traffic more complex, necessitating advanced vision-assisted technologies for enhanced safety and navigation. These technologies are imperative for providing critical traffic sign information, influencing driver behavior, and supporting vehicle control, especially for drivers with disabilities and i… ▽ More In the rapidly evolving landscape of transportation, the proliferation of automobiles has made road traffic more complex, necessitating advanced vision-assisted technologies for enhanced safety and navigation. These technologies are imperative for providing critical traffic sign information, influencing driver behavior, and supporting vehicle control, especially for drivers with disabilities and in the burgeoning field of autonomous vehicles. Traffic sign detection and recognition have emerged as key areas of research due to their essential roles in ensuring road safety and compliance with traffic regulations. Traditional computer vision methods have faced challenges in achieving optimal accuracy and speed due to real-world variabilities. However, the advent of deep learning and Convolutional Neural Networks (CNNs) has revolutionized this domain, offering solutions that significantly surpass previous capabilities in terms of speed and reliability. This paper presents an innovative approach leveraging CNNs that achieves an accuracy of nearly 96\%, highlighting the potential for even greater precision through advanced localization techniques. Our findings not only contribute to the ongoing advancement of traffic sign recognition technology but also underscore the critical impact of these developments on road safety and the future of autonomous driving. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: 8 pages, 8 figures, 1 table

arXiv:2403.08195 [pdf, ps, other]

Efficiently verifiable quantum advantage on near-term analog quantum simulators

Authors: Zhenning Liu, Dhruv Devulapalli, Dominik Hangleiter, Yi-Kai Liu, Alicia J. Kollár, Alexey V. Gorshkov, Andrew M. Childs

Abstract: Existing schemes for demonstrating quantum computational advantage are subject to various practical restrictions, including the hardness of verification and challenges in experimental implementation. Meanwhile, analog quantum simulators have been realized in many experiments to study novel physics. In this work, we propose a quantum advantage protocol based on single-step Feynman-Kitaev verificati… ▽ More Existing schemes for demonstrating quantum computational advantage are subject to various practical restrictions, including the hardness of verification and challenges in experimental implementation. Meanwhile, analog quantum simulators have been realized in many experiments to study novel physics. In this work, we propose a quantum advantage protocol based on single-step Feynman-Kitaev verification of an analog quantum simulation, in which the verifier need only run an $O(λ^2)$-time classical computation, and the prover need only prepare $O(1)$ samples of a history state and perform $O(λ^2)$ single-qubit measurements, for a security parameter $λ$. We also propose a near-term feasible strategy for honest provers and discuss potential experimental realizations. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 20 pages, 6 figures

arXiv:2403.04012 [pdf, other]

Temporal Cross-Attention for Dynamic Embedding and Tokenization of Multimodal Electronic Health Records

Authors: Yingbo Ma, Suraj Kolla, Dhruv Kaliraman, Victoria Nolan, Zhenhong Hu, Ziyuan Guan, Yuanfang Ren, Brooke Armfield, Tezcan Ozrazgat-Baslanti, Tyler J. Loftus, Parisa Rashidi, Azra Bihorac, Benjamin Shickel

Abstract: The breadth, scale, and temporal granularity of modern electronic health records (EHR) systems offers great potential for estimating personalized and contextual patient health trajectories using sequential deep learning. However, learning useful representations of EHR data is challenging due to its high dimensionality, sparsity, multimodality, irregular and variable-specific recording frequency, a… ▽ More The breadth, scale, and temporal granularity of modern electronic health records (EHR) systems offers great potential for estimating personalized and contextual patient health trajectories using sequential deep learning. However, learning useful representations of EHR data is challenging due to its high dimensionality, sparsity, multimodality, irregular and variable-specific recording frequency, and timestamp duplication when multiple measurements are recorded simultaneously. Although recent efforts to fuse structured EHR and unstructured clinical notes suggest the potential for more accurate prediction of clinical outcomes, less focus has been placed on EHR embedding approaches that directly address temporal EHR challenges by learning time-aware representations from multimodal patient time series. In this paper, we introduce a dynamic embedding and tokenization framework for precise representation of multimodal clinical time series that combines novel methods for encoding time and sequential position with temporal cross-attention. Our embedding and tokenization framework, when integrated into a multitask transformer classifier with sliding window attention, outperformed baseline approaches on the exemplar task of predicting the occurrence of nine postoperative complications of more than 120,000 major inpatient surgeries using multimodal data from three hospitals and two academic health centers in the United States. △ Less

Submitted 1 April, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

Comments: ICLR 2024 Workshop on Learning From Time Series for Health. 10 pages, 3 figures

arXiv:2403.00991 [pdf, other]

SELFI: Autonomous Self-Improvement with Reinforcement Learning for Social Navigation

Authors: Noriaki Hirose, Dhruv Shah, Kyle Stachowicz, Ajay Sridhar, Sergey Levine

Abstract: Autonomous self-improving robots that interact and improve with experience are key to the real-world deployment of robotic systems. In this paper, we propose an online learning method, SELFI, that leverages online robot experience to rapidly fine-tune pre-trained control policies efficiently. SELFI applies online model-free reinforcement learning on top of offline model-based learning to bring out… ▽ More Autonomous self-improving robots that interact and improve with experience are key to the real-world deployment of robotic systems. In this paper, we propose an online learning method, SELFI, that leverages online robot experience to rapidly fine-tune pre-trained control policies efficiently. SELFI applies online model-free reinforcement learning on top of offline model-based learning to bring out the best parts of both learning paradigms. Specifically, SELFI stabilizes the online learning process by incorporating the same model-based learning objective from offline pre-training into the Q-values learned with online model-free reinforcement learning. We evaluate SELFI in multiple real-world environments and report improvements in terms of collision avoidance, as well as more socially compliant behavior, measured by a human user study. SELFI enables us to quickly learn useful robotic behaviors with less human interventions such as pre-emptive behavior for the pedestrians, collision avoidance for small and transparent objects, and avoiding travel on uneven floor surfaces. We provide supplementary videos to demonstrate the performance of our fine-tuned policy on our project page. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: 11pages, 13 figures, 2 tables

arXiv:2402.19432 [pdf, other]

Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation

Authors: Jonathan Yang, Catherine Glossop, Arjun Bhorkar, Dhruv Shah, Quan Vuong, Chelsea Finn, Dorsa Sadigh, Sergey Levine

Abstract: Recent years in robotics and imitation learning have shown remarkable progress in training large-scale foundation models by leveraging data across a multitude of embodiments. The success of such policies might lead us to wonder: just how diverse can the robots in the training set be while still facilitating positive transfer? In this work, we study this question in the context of heterogeneous emb… ▽ More Recent years in robotics and imitation learning have shown remarkable progress in training large-scale foundation models by leveraging data across a multitude of embodiments. The success of such policies might lead us to wonder: just how diverse can the robots in the training set be while still facilitating positive transfer? In this work, we study this question in the context of heterogeneous embodiments, examining how even seemingly very different domains, such as robotic navigation and manipulation, can provide benefits when included in the training data for the same model. We train a single goal-conditioned policy that is capable of controlling robotic arms, quadcopters, quadrupeds, and mobile bases. We then investigate the extent to which transfer can occur across navigation and manipulation on these embodiments by framing them as a single goal-reaching task. We find that co-training with navigation data can enhance robustness and performance in goal-conditioned manipulation with a wrist-mounted camera. We then deploy our policy trained only from navigation-only and static manipulation-only data on a mobile manipulator, showing that it can control a novel embodiment in a zero-shot manner. These results provide evidence that large-scale robotic policies can benefit from data collected across various embodiments. Further information and robot videos can be found on our project website http://extreme-cross-embodiment.github.io. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: 16 pages, 9 figures

MSC Class: 68T40 ACM Class: I.2.9

arXiv:2402.16472 [pdf, other]

mEdIT: Multilingual Text Editing via Instruction Tuning

Authors: Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar

Abstract: We introduce mEdIT, a multi-lingual extension to CoEdIT -- the recent state-of-the-art text editing models for writing assistance. mEdIT models are trained by fine-tuning multi-lingual large, pre-trained language models (LLMs) via instruction tuning. They are designed to take instructions from the user specifying the attributes of the desired text in the form of natural language instructions, such… ▽ More We introduce mEdIT, a multi-lingual extension to CoEdIT -- the recent state-of-the-art text editing models for writing assistance. mEdIT models are trained by fine-tuning multi-lingual large, pre-trained language models (LLMs) via instruction tuning. They are designed to take instructions from the user specifying the attributes of the desired text in the form of natural language instructions, such as Grammatik korrigieren (German) or Parafrasee la oración (Spanish). We build mEdIT by curating data from multiple publicly available human-annotated text editing datasets for three text editing tasks (Grammatical Error Correction (GEC), Text Simplification, and Paraphrasing) across diverse languages belonging to six different language families. We detail the design and training of mEdIT models and demonstrate their strong performance on many multi-lingual text editing benchmarks against other multilingual LLMs. We also find that mEdIT generalizes effectively to new languages over multilingual baselines. We publicly release our data, code, and trained models at https://github.com/vipulraheja/medit. △ Less

Submitted 17 April, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: Accepted to NAACL 2024 (Main). 23 pages, 8 tables, 11 figures

ACM Class: I.2.7

arXiv:2402.15409 [pdf, other]

Lasso with Latents: Efficient Estimation, Covariate Rescaling, and Computational-Statistical Gaps

Authors: Jonathan Kelner, Frederic Koehler, Raghu Meka, Dhruv Rohatgi

Abstract: It is well-known that the statistical performance of Lasso can suffer significantly when the covariates of interest have strong correlations. In particular, the prediction error of Lasso becomes much worse than computationally inefficient alternatives like Best Subset Selection. Due to a large conjectured computational-statistical tradeoff in the problem of sparse linear regression, it may be impo… ▽ More It is well-known that the statistical performance of Lasso can suffer significantly when the covariates of interest have strong correlations. In particular, the prediction error of Lasso becomes much worse than computationally inefficient alternatives like Best Subset Selection. Due to a large conjectured computational-statistical tradeoff in the problem of sparse linear regression, it may be impossible to close this gap in general. In this work, we propose a natural sparse linear regression setting where strong correlations between covariates arise from unobserved latent variables. In this setting, we analyze the problem caused by strong correlations and design a surprisingly simple fix. While Lasso with standard normalization of covariates fails, there exists a heterogeneous scaling of the covariates with which Lasso will suddenly obtain strong provable guarantees for estimation. Moreover, we design a simple, efficient procedure for computing such a "smart scaling." The sample complexity of the resulting "rescaled Lasso" algorithm incurs (in the worst case) quadratic dependence on the sparsity of the underlying signal. While this dependence is not information-theoretically necessary, we give evidence that it is optimal among the class of polynomial-time algorithms, via the method of low-degree polynomials. This argument reveals a new connection between sparse linear regression and a special version of sparse PCA with a near-critical negative spike. The latter problem can be thought of as a real-valued analogue of learning a sparse parity. Using it, we also establish the first computational-statistical gap for the closely related problem of learning a Gaussian Graphical Model. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.13610 [pdf, other]

Data-driven Discovery with Large Generative Models

Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, Peter Clark

Abstract: With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a se… ▽ More With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DATAVOYAGER, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata -- a feat previously unattainable -- while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.10202 [pdf, other]

Bridging Associative Memory and Probabilistic Modeling

Authors: Rylan Schaeffer, Nika Zahedi, Mikail Khona, Dhruv Pai, Sang Truong, Yilun Du, Mitchell Ostrow, Sarthak Chandra, Andres Carranza, Ila Rani Fiete, Andrey Gromov, Sanmi Koyejo

Abstract: Associative memory and probabilistic modeling are two fundamental topics in artificial intelligence. The first studies recurrent neural networks designed to denoise, complete and retrieve data, whereas the second studies learning and sampling from probability distributions. Based on the observation that associative memory's energy functions can be seen as probabilistic modeling's negative log like… ▽ More Associative memory and probabilistic modeling are two fundamental topics in artificial intelligence. The first studies recurrent neural networks designed to denoise, complete and retrieve data, whereas the second studies learning and sampling from probability distributions. Based on the observation that associative memory's energy functions can be seen as probabilistic modeling's negative log likelihoods, we build a bridge between the two that enables useful flow of ideas in both directions. We showcase four examples: First, we propose new energy-based models that flexibly adapt their energy functions to new in-context datasets, an approach we term \textit{in-context learning of energy functions}. Second, we propose two new associative memory models: one that dynamically creates new memories as necessitated by the training data using Bayesian nonparametrics, and another that explicitly computes proportional memory assignments using the evidence lower bound. Third, using tools from associative memory, we analytically and numerically characterize the memory capacity of Gaussian kernel density estimators, a widespread tool in probababilistic modeling. Fourth, we study a widespread implementation choice in transformers -- normalization followed by self attention -- to show it performs clustering on the hypersphere. Altogether, this work urges further exchange of useful ideas between these two continents of artificial intelligence. △ Less

Submitted 13 June, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

arXiv:2402.10051 [pdf, other]

SwissNYF: Tool Grounded LLM Agents for Black Box Setting

Authors: Somnath Sendhil Kumar, Dhruv Jain, Eshaan Agarwal, Raunak Pandey

Abstract: While Large Language Models (LLMs) have demonstrated enhanced capabilities in function-calling, these advancements primarily rely on accessing the functions' responses. This methodology is practical for simpler APIs but faces scalability issues with irreversible APIs that significantly impact the system, such as a database deletion API. Similarly, processes requiring extensive time for each API ca… ▽ More While Large Language Models (LLMs) have demonstrated enhanced capabilities in function-calling, these advancements primarily rely on accessing the functions' responses. This methodology is practical for simpler APIs but faces scalability issues with irreversible APIs that significantly impact the system, such as a database deletion API. Similarly, processes requiring extensive time for each API call and those necessitating forward planning, like automated action pipelines, present complex challenges. Furthermore, scenarios often arise where a generalized approach is needed because algorithms lack direct access to the specific implementations of these functions or secrets to use them. Traditional tool planning methods are inadequate in these cases, compelling the need to operate within black-box environments. Unlike their performance in tool manipulation, LLMs excel in black-box tasks, such as program synthesis. Therefore, we harness the program synthesis capabilities of LLMs to strategize tool usage in black-box settings, ensuring solutions are verified prior to implementation. We introduce TOPGUN, an ingeniously crafted approach leveraging program synthesis for black box tool planning. Accompanied by SwissNYF, a comprehensive suite that integrates black-box algorithms for planning and verification tasks, addressing the aforementioned challenges and enhancing the versatility and effectiveness of LLMs in complex API interactions. The public code for SwissNYF is available at https://github.com/iclr-dummy-user/SwissNYF. △ Less

Submitted 15 February, 2024; originally announced February 2024.

arXiv:2402.09811 [pdf, other]

TEXTRON: Weakly Supervised Multilingual Text Detection through Data Programming

Authors: Dhruv Kudale, Badri Vishal Kasuba, Venkatapathy Subramanian, Parag Chaudhuri, Ganesh Ramakrishnan

Abstract: Several recent deep learning (DL) based techniques perform considerably well on image-based multilingual text detection. However, their performance relies heavily on the availability and quality of training data. There are numerous types of page-level document images consisting of information in several modalities, languages, fonts, and layouts. This makes text detection a challenging problem in t… ▽ More Several recent deep learning (DL) based techniques perform considerably well on image-based multilingual text detection. However, their performance relies heavily on the availability and quality of training data. There are numerous types of page-level document images consisting of information in several modalities, languages, fonts, and layouts. This makes text detection a challenging problem in the field of computer vision (CV), especially for low-resource or handwritten languages. Furthermore, there is a scarcity of word-level labeled data for text detection, especially for multilingual settings and Indian scripts that incorporate both printed and handwritten text. Conventionally, Indian script text detection requires training a DL model on plenty of labeled data, but to the best of our knowledge, no relevant datasets are available. Manual annotation of such data requires a lot of time, effort, and expertise. In order to solve this problem, we propose TEXTRON, a Data Programming-based approach, where users can plug various text detection methods into a weak supervision-based learning framework. One can view this approach to multilingual text detection as an ensemble of different CV-based techniques and DL approaches. TEXTRON can leverage the predictions of DL models pre-trained on a significant amount of language data in conjunction with CV-based methods to improve text detection in other languages. We demonstrate that TEXTRON can improve the detection performance for documents written in Indian languages, despite the absence of corresponding labeled data. Further, through extensive experimentation, we show improvement brought about by our approach over the current State-of-the-art (SOTA) models, especially for handwritten Devanagari text. Code and dataset has been made available at https://github.com/IITB-LEAP-OCR/TEXTRON △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: Accepted at the WACV 2024 Conference

arXiv:2402.09200 [pdf, other]

doi 10.1109/SoutheastCon52093.2024.10500045

Discovering Command and Control (C2) Channels on Tor and Public Networks Using Reinforcement Learning

Authors: Cheng Wang, Christopher Redino, Abdul Rahman, Ryan Clark, Daniel Radke, Tyler Cody, Dhruv Nandakumar, Edward Bowen

Abstract: Command and control (C2) channels are an essential component of many types of cyber attacks, as they enable attackers to remotely control their malware-infected machines and execute harmful actions, such as propagating malicious code across networks, exfiltrating confidential data, or initiating distributed denial of service (DDoS) attacks. Identifying these C2 channels is therefore crucial in hel… ▽ More Command and control (C2) channels are an essential component of many types of cyber attacks, as they enable attackers to remotely control their malware-infected machines and execute harmful actions, such as propagating malicious code across networks, exfiltrating confidential data, or initiating distributed denial of service (DDoS) attacks. Identifying these C2 channels is therefore crucial in hel** to mitigate and prevent cyber attacks. However, identifying C2 channels typically involves a manual process, requiring deep knowledge and expertise in cyber operations. In this paper, we propose a reinforcement learning (RL) based approach to automatically emulate C2 attack campaigns using both the normal (public) and the Tor networks. In addition, payload size and network firewalls are configured to simulate real-world attack scenarios. Results on a typical network configuration show that the RL agent can automatically discover resilient C2 attack paths utilizing both Tor-based and conventional communication channels, while also bypassing network firewalls. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2402.07118 [pdf, other]

Next-Generation Teleophthalmology: AI-enabled Quality Assessment Aiding Remote Smartphone-based Consultation

Authors: Dhruv Srikanth, Jayang Gurung, N Satya Deepika, Vineet Joshi, Pravin Vaddavalli, Soumya Jana

Abstract: Blindness and other eye diseases are a global health concern, particularly in low- and middle-income countries like India. In this regard, during the COVID-19 pandemic, teleophthalmology became a lifeline, and the Grabi attachment for smartphone-based eye imaging gained in use. However, quality of user-captured image often remained inadequate, requiring clinician vetting and delays. In this backdr… ▽ More Blindness and other eye diseases are a global health concern, particularly in low- and middle-income countries like India. In this regard, during the COVID-19 pandemic, teleophthalmology became a lifeline, and the Grabi attachment for smartphone-based eye imaging gained in use. However, quality of user-captured image often remained inadequate, requiring clinician vetting and delays. In this backdrop, we propose an AI-based quality assessment system with instant feedback mimicking clinicians' judgments and tested on patient-captured images. Dividing the complex problem hierarchically, here we tackle a nontrivial part, and demonstrate a proof of the concept. △ Less

Submitted 11 February, 2024; originally announced February 2024.

Comments: 4 pages, Submitted to IEEE EMBC 2024

arXiv:2402.04914 [pdf, other]

Personalized Text Generation with Fine-Grained Linguistic Control

Authors: Bashar Alhafni, Vivek Kulkarni, Dhruv Kumar, Vipul Raheja

Abstract: As the text generation capabilities of large language models become increasingly prominent, recent studies have focused on controlling particular aspects of the generated text to make it more personalized. However, most research on controllable text generation focuses on controlling the content or modeling specific high-level/coarse-grained attributes that reflect authors' writing styles, such as… ▽ More As the text generation capabilities of large language models become increasingly prominent, recent studies have focused on controlling particular aspects of the generated text to make it more personalized. However, most research on controllable text generation focuses on controlling the content or modeling specific high-level/coarse-grained attributes that reflect authors' writing styles, such as formality, domain, or sentiment. In this paper, we focus on controlling fine-grained attributes spanning multiple linguistic dimensions, such as lexical and syntactic attributes. We introduce a novel benchmark to train generative models and evaluate their ability to generate personalized text based on multiple fine-grained linguistic attributes. We systematically investigate the performance of various large language models on our benchmark and draw insights from the factors that impact their performance. We make our code, data, and pretrained models publicly available. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.01687 [pdf, ps, other]

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

Authors: Vibhor Agarwal, Madhav Krishan Garg, Sahiti Dharmavaram, Dhruv Kumar

Abstract: This study evaluates the effectiveness of various large language models (LLMs) in performing tasks common among undergraduate computer science students. Although a number of research studies in the computing education community have explored the possibility of using LLMs for a variety of tasks, there is a lack of comprehensive research comparing different LLMs and evaluating which LLMs are most ef… ▽ More This study evaluates the effectiveness of various large language models (LLMs) in performing tasks common among undergraduate computer science students. Although a number of research studies in the computing education community have explored the possibility of using LLMs for a variety of tasks, there is a lack of comprehensive research comparing different LLMs and evaluating which LLMs are most effective for different tasks. Our research systematically assesses some of the publicly available LLMs such as Google Bard, ChatGPT(3.5), GitHub Copilot Chat, and Microsoft Copilot across diverse tasks commonly encountered by undergraduate computer science students in India. These tasks include code explanation and documentation, solving class assignments, technical interview preparation, learning new concepts and frameworks, and email writing. Evaluation for these tasks was carried out by pre-final year and final year undergraduate computer science students and provides insights into the models' strengths and limitations. This study aims to guide students as well as instructors in selecting suitable LLMs for any specific task and offers valuable insights on how LLMs can be used constructively by students and instructors. △ Less

Submitted 3 April, 2024; v1 submitted 22 January, 2024; originally announced February 2024.

Comments: Under review

arXiv:2401.15605 [pdf, other]

AI as a Medical Ally: Evaluating ChatGPT's Usage and Impact in Indian Healthcare

Authors: Aryaman Raina, Prateek Mishra, Harshit goyal, Dhruv Kumar

Abstract: This study investigates the integration and impact of Large Language Models (LLMs), like ChatGPT, in India's healthcare sector. Our research employs a dual approach, engaging both general users and medical professionals through surveys and interviews respectively. Our findings reveal that healthcare professionals value ChatGPT in medical education and preliminary clinical settings, but exercise ca… ▽ More This study investigates the integration and impact of Large Language Models (LLMs), like ChatGPT, in India's healthcare sector. Our research employs a dual approach, engaging both general users and medical professionals through surveys and interviews respectively. Our findings reveal that healthcare professionals value ChatGPT in medical education and preliminary clinical settings, but exercise caution due to concerns about reliability, privacy, and the need for cross-verification with medical references. General users show a preference for AI interactions in healthcare, but concerns regarding accuracy and trust persist. The study underscores the need for these technologies to complement, not replace, human medical expertise, highlighting the importance of develo** LLMs in collaboration with healthcare providers. This paper enhances the understanding of LLMs in healthcare, detailing current usage, user trust, and improvement areas. Our insights inform future research and development, underscoring the need for ethically compliant, user-focused LLM advancements that address healthcare-specific challenges. △ Less

Submitted 28 January, 2024; originally announced January 2024.

Comments: Under review

arXiv:2401.15595 [pdf, other]

Comuniqa : Exploring Large Language Models for improving speaking skills

Authors: Manas Mhasakar, Shikhar Sharma, Apurv Mehra, Utkarsh Venaik, Ujjwal Singhal, Dhruv Kumar, Kashish Mittal

Abstract: In this paper, we investigate the potential of Large Language Models (LLMs) to improve English speaking skills. This is particularly relevant in countries like India, where English is crucial for academic, professional, and personal communication but remains a non-native language for many. Traditional methods for enhancing speaking skills often rely on human experts, which can be limited in terms… ▽ More In this paper, we investigate the potential of Large Language Models (LLMs) to improve English speaking skills. This is particularly relevant in countries like India, where English is crucial for academic, professional, and personal communication but remains a non-native language for many. Traditional methods for enhancing speaking skills often rely on human experts, which can be limited in terms of scalability, accessibility, and affordability. Recent advancements in Artificial Intelligence (AI) offer promising solutions to overcome these limitations. We propose Comuniqa, a novel LLM-based system designed to enhance English speaking skills. We adopt a human-centric evaluation approach, comparing Comuniqa with the feedback and instructions provided by human experts. In our evaluation, we divide the participants in three groups: those who use LLM-based system for improving speaking skills, those guided by human experts for the same task and those who utilize both the LLM-based system as well as the human experts. Using surveys, interviews, and actual study sessions, we provide a detailed perspective on the effectiveness of different learning modalities. Our preliminary findings suggest that while LLM-based systems have commendable accuracy, they lack human-level cognitive capabilities, both in terms of accuracy and empathy. Nevertheless, Comuniqa represents a significant step towards achieving Sustainable Development Goal 4: Quality Education by providing a valuable learning tool for individuals who may not have access to human experts for improving their speaking skills. △ Less

Submitted 14 May, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

Comments: Accepted at 7th ACM SIGCAS/SIGCHI Conference of Computing and Sustainable Societies : ACM COMPASS 2024

arXiv:2401.15589 [pdf, other]

OpineBot: Class Feedback Reimagined Using a Conversational LLM

Authors: Henansh Tanwar, Kunal Shrivastva, Rahul Singh, Dhruv Kumar

Abstract: Conventional class feedback systems often fall short, relying on static, unengaging surveys offering little incentive for student participation. To address this, we present OpineBot, a novel system employing large language models (LLMs) to conduct personalized, conversational class feedback via chatbot interface. We assessed OpineBot's effectiveness in a user study with 20 students from an Indian… ▽ More Conventional class feedback systems often fall short, relying on static, unengaging surveys offering little incentive for student participation. To address this, we present OpineBot, a novel system employing large language models (LLMs) to conduct personalized, conversational class feedback via chatbot interface. We assessed OpineBot's effectiveness in a user study with 20 students from an Indian university's Operating-Systems class, utilizing surveys and interviews to analyze their experiences. Findings revealed a resounding preference for OpineBot compared to conventional methods, highlighting its ability to engage students, produce deeper feedback, offering a dynamic survey experience. This research represents a work in progress, providing early results, marking a significant step towards revolutionizing class feedback through LLM-based technology, promoting student engagement, and leading to richer data for instructors. This ongoing research presents preliminary findings and marks a notable advancement in transforming classroom feedback using LLM-based technology to enhance student engagement and generate comprehensive data for educators. △ Less

Submitted 28 January, 2024; originally announced January 2024.

Comments: Under review

arXiv:2401.11095 [pdf, other]

doi 10.1145/3643834.3661556

SoundShift: Exploring Sound Manipulations for Accessible Mixed-Reality Awareness

Authors: Ruei-Che Chang, Chia-Sheng Hung, Bing-Yu Chen, Dhruv Jain, Anhong Guo

Abstract: Mixed-reality (MR) soundscapes blend real-world sound with virtual audio from hearing devices, presenting intricate auditory information that is hard to discern and differentiate. This is particularly challenging for blind or visually impaired individuals, who rely on sounds and descriptions in their everyday lives. To understand how complex audio information is consumed, we analyzed online forum… ▽ More Mixed-reality (MR) soundscapes blend real-world sound with virtual audio from hearing devices, presenting intricate auditory information that is hard to discern and differentiate. This is particularly challenging for blind or visually impaired individuals, who rely on sounds and descriptions in their everyday lives. To understand how complex audio information is consumed, we analyzed online forum posts within the blind community, identifying prevailing challenges, needs, and desired solutions. We synthesized the results and propose SoundShift for increasing MR sound awareness, which includes six sound manipulations: Transparency Shift, Envelope Shift, Position Shift, Style Shift, Time Shift, and Sound Append. To evaluate the effectiveness of SoundShift, we conducted a user study with 18 blind participants across three simulated MR scenarios, where participants identified specific sounds within intricate soundscapes. We found that SoundShift increased MR sound awareness and minimized cognitive load. Finally, we developed three real-world example applications to demonstrate the practicality of SoundShift. △ Less

Submitted 26 May, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

Comments: DIS 2024

arXiv:2401.09937 [pdf, other]

From Cash to Cashless: UPI's Impact on Spending Behavior among Indian Users

Authors: Harshal Dev, Raj Gupta, Dhruv Kumar

Abstract: The emergence of digital payment systems has transformed how individuals conduct financial transactions, offering convenience, security, and efficiency. One groundbreaking innovation making waves in the Indian financial landscape is the Unified Payments Interface (UPI). Existing work has explored how digital payments benefit a country's economy and GDP. However, our study explores how the introduc… ▽ More The emergence of digital payment systems has transformed how individuals conduct financial transactions, offering convenience, security, and efficiency. One groundbreaking innovation making waves in the Indian financial landscape is the Unified Payments Interface (UPI). Existing work has explored how digital payments benefit a country's economy and GDP. However, our study explores how the introduction of UPI has influenced spending behavior among Indian users on an "individual" level. We gathered 235 valid survey responses encompassing diverse demographics and interviewed 20 survey respondents. Approximately 75\% of the survey respondents reported increased spending due to UPI, with only 7\% indicating reduced spending. Significantly, 91.5\% of the respondents reported satisfaction with their UPI usage. Also, 95.2\% of the survey respondents found making payments via UPI convenient. Our research also provides suggestions for UPI applications and various stakeholders to enhance digital payment systems, enabling users to make informed decisions and fostering responsible financial management. △ Less

Submitted 7 May, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted to ACM CHI 2024 - Late Breaking Work Track

arXiv:2401.08091 [pdf, other]

'One Style Does Not Regulate All': Moderation Practices in Public and Private WhatsApp Groups

Authors: Farhana Shahid, Dhruv Agarwal, Aditya Vashistha

Abstract: WhatsApp is the largest social media platform in the Global South and is a virulent force in global misinformation and political propaganda. Due to end-to-end encryption WhatsApp can barely review any content and this often pushes the responsibility of moderation towards group admins. Yet, little is known about how WhatsApp group admins manage their groups, what factors and values influence modera… ▽ More WhatsApp is the largest social media platform in the Global South and is a virulent force in global misinformation and political propaganda. Due to end-to-end encryption WhatsApp can barely review any content and this often pushes the responsibility of moderation towards group admins. Yet, little is known about how WhatsApp group admins manage their groups, what factors and values influence moderation decisions, and what challenges they face in moderating their groups. To fill this gap, we interviewed admins of 32 diverse groups and reviewed content from 30 public groups in India and Bangladesh. We observed notable differences in the formation, members' behavior, and moderation of public versus private groups, as well as in how WhatsApp admins operate compared to those on other platforms. We used Baumrind's typology of 'parenting styles' as a lens to explore moderation practices in WhatsApp groups and identified four moderation styles based on how responsive and controlling the admins were and discuss design recommendations to help them better manage problematic content in WhatsApp groups. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2401.07770 [pdf, other]

Seeing the Unseen: Visual Common Sense for Semantic Placement

Authors: Ram Ramrakhya, Aniruddha Kembhavi, Dhruv Batra, Zsolt Kira, Kuo-Hao Zeng, Luca Weihs

Abstract: Computer vision tasks typically involve describing what is present in an image (e.g. classification, detection, segmentation, and captioning). We study a visual common sense task that requires understanding what is not present. Specifically, given an image (e.g. of a living room) and name of an object ("cushion"), a vision system is asked to predict semantically-meaningful regions (masks or boundi… ▽ More Computer vision tasks typically involve describing what is present in an image (e.g. classification, detection, segmentation, and captioning). We study a visual common sense task that requires understanding what is not present. Specifically, given an image (e.g. of a living room) and name of an object ("cushion"), a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans (e.g. on the sofa). We call this task: Semantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house), and AR devices (automatically rendering an object in the user's space). Studying the invisible is hard. Datasets for image description are typically constructed by curating relevant images and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image. We overcome this challenge by operating in the opposite direction: we start with an image of an object in context from web, and then remove that object from the image via inpainting. This automated pipeline converts unstructured web data into a dataset comprising pairs of images with/without the object. Using this, we collect a novel dataset, with ${\sim}1.3$M images across $9$ object categories, and train a SP prediction model called CLIP-UNet. CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors with object detectors on real-world and simulated images. In our user studies, we find that the SP masks predicted by CLIP-UNet are favored $43.7\%$ and $31.3\%$ times when comparing against the $4$ SP baselines on real and simulated images. In addition, we demonstrate leveraging SP mask predictions from CLIP-UNet enables downstream applications like building tidying robots in indoor environments. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2401.04120 [pdf, other]

Generation Z's Ability to Discriminate Between AI-generated and Human-Authored Text on Discord

Authors: Dhruv Ramu, Rishab Jain, Aditya Jain

Abstract: The growing popularity of generative artificial intelligence (AI) chatbots such as ChatGPT is having transformative effects on social media. As the prevalence of AI-generated content grows, concerns have been raised regarding privacy and misinformation online. Among social media platforms, Discord enables AI integrations -- making their primarily "Generation Z" userbase particularly exposed to AI-… ▽ More The growing popularity of generative artificial intelligence (AI) chatbots such as ChatGPT is having transformative effects on social media. As the prevalence of AI-generated content grows, concerns have been raised regarding privacy and misinformation online. Among social media platforms, Discord enables AI integrations -- making their primarily "Generation Z" userbase particularly exposed to AI-generated content. We surveyed Generation Z aged individuals (n = 335) to evaluate their proficiency in discriminating between AI-generated and human-authored text on Discord. The investigation employed one-shot prompting of ChatGPT, disguised as a text message received on the Discord.com platform. We explore the influence of demographic factors on ability, as well as participants' familiarity with Discord and artificial intelligence technologies. We find that Generation Z individuals are unable to discern between AI and human-authored text (p = 0.011), and that those with lower self-reported familiarity with Discord demonstrated an improved ability in identifying human-authored compared to those with self-reported experience with AI (p << 0.0001). Our results suggest that there is a nuanced relationship between AI technology and popular modes of communication for Generation Z, contributing valuable insights into human-computer interactions, digital communication, and artificial intelligence literacy. △ Less

Submitted 31 December, 2023; originally announced January 2024.

arXiv:2312.17601 [pdf]

The Tyranny of Possibilities in the Design of Task-Oriented LLM Systems: A Sco** Survey

Authors: Dhruv Dhamani, Mary Lou Maher

Abstract: This sco** survey focuses on our current understanding of the design space for task-oriented LLM systems and elaborates on definitions and relationships among the available design parameters. The paper begins by defining a minimal task-oriented LLM system and exploring the design space of such systems through a thought experiment contemplating the performance of diverse LLM system configurations… ▽ More This sco** survey focuses on our current understanding of the design space for task-oriented LLM systems and elaborates on definitions and relationships among the available design parameters. The paper begins by defining a minimal task-oriented LLM system and exploring the design space of such systems through a thought experiment contemplating the performance of diverse LLM system configurations (involving single LLMs, single LLM-based agents, and multiple LLM-based agent systems) on a complex software development task and hypothesizes the results. We discuss a pattern in our results and formulate them into three conjectures. While these conjectures may be partly based on faulty assumptions, they provide a starting point for future research. The paper then surveys a select few design parameters: covering and organizing research in LLM augmentation, prompting techniques, and uncertainty estimation, and discussing their significance. The paper notes the lack of focus on computational and energy efficiency in evaluating research in these areas. Our survey findings provide a basis for develo** the concept of linear and non-linear contexts, which we define and use to enable an agent-centric projection of prompting techniques providing a lens through which prompting techniques can be viewed as multi-agent systems. The paper discusses the implications of this lens, for the cross-pollination of research between LLM prompting and LLM-based multi-agent systems; and also, for the generation of synthetic training data based on existing prompting techniques in research. In all, the sco** survey presents seven conjectures that can help guide future research efforts. △ Less

Submitted 29 December, 2023; originally announced December 2023.

Comments: 18 pages, 6 figures. Work-in-progress draft published to gather feedback. Please reach out with comments, if any

arXiv:2312.16733 [pdf, other]

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

Authors: Alind Khare, Dhruv Garg, Sukrit Kalra, Snigdha Grandhi, Ion Stoica, Alexey Tumanov

Abstract: The increasing deployment of ML models on the critical path of production applications in both datacenter and the edge requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overal… ▽ More The increasing deployment of ML models on the critical path of production applications in both datacenter and the edge requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overall efficiency of utilization of scarce resources. State-of-the-art systems resolve this tension by either choosing a static point in the latency-accuracy tradeoff space to serve all requests or load specific models on the critical path of request serving. In this work, we instead resolve this tension by simultaneously serving the entire-range of models spanning the latency-accuracy tradeoff space. Our novel mechanism, SubNetAct, achieves this by carefully inserting specialized operators in weight-shared SuperNetworks. These operators enable SubNetAct to dynamically route requests through the network to meet a latency and accuracy target. SubNetAct requires upto 2.6x lower memory to serve a vastly-higher number of models than prior state-of-the-art. In addition, SubNetAct's near-instantaneous actuation of models unlocks the design space of fine-grained, reactive scheduling policies. We explore the design of one such extremely effective policy, SlackFit and instantiate both SubNetAct and SlackFit in a real system, SuperServe. SuperServe achieves 4.67% higher accuracy for the same SLO attainment and 2.85x higher SLO attainment for the same accuracy on a trace derived from the real-world Microsoft Azure Functions workload and yields the best trade-offs on a wide range of extremely-bursty synthetic traces automatically. △ Less

Submitted 27 December, 2023; originally announced December 2023.

Showing 1–50 of 495 results for author: Dhruv