-
Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation
Authors:
Atharvan Dogra,
Ameet Deshpande,
John Nay,
Tanmay Rajpurohit,
Ashwin Kalyan,
Balaraman Ravindran
Abstract:
Recent developments in large language models (LLMs), while offering a powerful foundation for develo** natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its en…
▽ More
Recent developments in large language models (LLMs), while offering a powerful foundation for develo** natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception through straight-out lying, making objective selfish decisions, or giving false information, as seen in previous AI safety research. We target a specific category of deception achieved through obfuscation and equivocation. We broadly explain the two types of deception by analogizing them with the rabbit-out-of-hat magic trick, where (i) the rabbit either comes out of a hidden trap door or (ii) (our focus) the audience is completely distracted to see the magician bring out the rabbit right in front of them using sleight of hand or misdirection. Our novel testbed framework displays intrinsic deception capabilities of LLM agents in a goal-driven environment when directed to be deceptive in their natural language generations in a two-agent adversarial dialogue system built upon the legislative task of "lobbying" for a bill. Along the lines of a goal-driven environment, we show develo** deceptive capacity through a reinforcement learning setup, building it around the theories of language philosophy and cognitive psychology. We find that the lobbyist agent increases its deceptive capabilities by ~ 40% (relative) through subsequent reinforcement trials of adversarial interactions, and our deception detection mechanism shows a detection capability of up to 92%. Our results highlight potential issues in agent-human interaction, with agents potentially manipulating humans towards its programmed end-goal.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
Authors:
Shreyas Chaudhari,
Pranjal Aggarwal,
Vishvak Murahari,
Tanmay Rajpurohit,
Ashwin Kalyan,
Karthik Narasimhan,
Ameet Deshpande,
Bruno Castro da Silva
Abstract:
State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hal…
▽ More
State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial design choices that popularized the method and current research focuses on augmenting those choices rather than fundamentally improving the framework. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating substantial focus to the core component of RLHF -- the reward model. Our study investigates modeling choices, caveats of function approximation, and their implications on RLHF training algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our analysis improves the understanding of the role of reward models and methods for their training, concurrently revealing limitations of the current methodology. We characterize these limitations, including incorrect generalization, model misspecification, and the sparsity of feedback, along with their impact on the performance of a language model. The discussion and analysis are substantiated by a categorical review of current literature, serving as a reference for researchers and practitioners to understand the challenges of RLHF and build upon existing efforts.
△ Less
Submitted 15 April, 2024; v1 submitted 12 April, 2024;
originally announced April 2024.
-
High-contrast JWST-MIRI spectroscopy of planet-forming disks for the JDISC Survey
Authors:
Klaus M. Pontoppidan,
Colette Salyk,
Andrea Banzatti,
Ke Zhang,
Ilaria Pascucci,
Karin I. Oberg,
Feng Long,
Carlos Munoz-Romero,
John Carr,
Joan Najita,
Geoffrey A. Blake,
Nicole Arulanantham,
Sean Andrews,
Nicholas P. Ballering,
Edwin Bergin,
Jenny Calahan,
Douglas Cobb,
Maria Jose Colmenares,
Annie Dickson-Vandervelde,
Anna Dignan,
Joel Green,
Phoebe Heretz,
Greg Herczeg,
Anusha Kalyaan,
Sebastian Krijt
, et al. (4 additional authors not shown)
Abstract:
The JWST Disk Infrared Spectral Chemistry Survey (JDISCS) aims to understand the evolution of the chemistry of inner protoplanetary disks using the Mid-InfraRed Instrument (MIRI) on the James Webb Space Telescope (JWST). With a growing sample of >30 disks, the survey implements a custom method to calibrate the MIRI Medium Resolution Spectrometer (MRS) to contrasts of better than 1:300 across its 4…
▽ More
The JWST Disk Infrared Spectral Chemistry Survey (JDISCS) aims to understand the evolution of the chemistry of inner protoplanetary disks using the Mid-InfraRed Instrument (MIRI) on the James Webb Space Telescope (JWST). With a growing sample of >30 disks, the survey implements a custom method to calibrate the MIRI Medium Resolution Spectrometer (MRS) to contrasts of better than 1:300 across its 4.9-28 micron spectral range. This is achieved using observations of Themis-family asteroids as precise empirical reference sources. High spectral contrast enables precise retrievals of physical parameters, searches for rare molecular species and isotopologues, and constraints on the inventories of carbon- and nitrogen-bearing species. JDISCS also offers significant improvements to the MRS wavelength and resolving power calibration. We describe the JDISCS calibrated data and demonstrate its quality using observations of the disk around the solar-mass young star FZ Tau. The FZ Tau MIRI spectrum is dominated by strong emission from warm water vapor. We show that the water and CO line emission originates from the disk surface and traces a range of gas temperatures of ~500-1500 K. We retrieve parameters for the observed CO and H2O lines, and show that they are consistent with a radial distribution represented by two temperature components. A high water abundance of n(H2O)~10^-4 fills the disk surface at least out to the 350 K isotherm at 1.5 au. We search the FZ Tau environs for extended emission detecting a large (radius of ~300 au) ring of emission from H2 gas surrounding FZ Tau, and discuss its origin.
△ Less
Submitted 16 January, 2024; v1 submitted 28 November, 2023;
originally announced November 2023.
-
GEO: Generative Engine Optimization
Authors:
Pranjal Aggarwal,
Vishvak Murahari,
Tanmay Rajpurohit,
Ashwin Kalyan,
Karthik Narasimhan,
Ameet Deshpande
Abstract:
The advent of large language models (LLMs) has ushered in a new paradigm of search engines that use generative models to gather and summarize information to answer user queries. This emerging technology, which we formalize under the unified framework of generative engines (GEs), can generate accurate and personalized responses, rapidly replacing traditional search engines like Google and Bing. Gen…
▽ More
The advent of large language models (LLMs) has ushered in a new paradigm of search engines that use generative models to gather and summarize information to answer user queries. This emerging technology, which we formalize under the unified framework of generative engines (GEs), can generate accurate and personalized responses, rapidly replacing traditional search engines like Google and Bing. Generative Engines typically satisfy queries by synthesizing information from multiple sources and summarizing them using LLMs. While this shift significantly improves $\textit{user}$ utility and $\textit{generative search engine}$ traffic, it poses a huge challenge for the third stakeholder -- website and content creators. Given the black-box and fast-moving nature of generative engines, content creators have little to no control over $\textit{when}$ and $\textit{how}$ their content is displayed. With generative engines here to stay, we must ensure the creator economy is not disadvantaged. To address this, we introduce Generative Engine Optimization (GEO), the first novel paradigm to aid content creators in improving their content visibility in generative engine responses through a flexible black-box optimization framework for optimizing and defining visibility metrics. We facilitate systematic evaluation by introducing GEO-bench, a large-scale benchmark of diverse user queries across multiple domains, along with relevant web sources to answer these queries. Through rigorous evaluation, we demonstrate that GEO can boost visibility by up to $40\%$ in generative engine responses. Moreover, we show the efficacy of these strategies varies across domains, underscoring the need for domain-specific optimization methods. Our work opens a new frontier in information discovery systems, with profound implications for both developers of generative engines and content creators.
△ Less
Submitted 28 June, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs
Authors:
Shashank Gupta,
Vaishnavi Shrivastava,
Ameet Deshpande,
Ashwin Kalyan,
Peter Clark,
Ashish Sabharwal,
Tushar Khot
Abstract:
Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of…
▽ More
Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks. Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups. Our experiments unveil that LLMs harbor deep rooted bias against various socio-demographics underneath a veneer of fairness. While they overtly reject stereotypes when explicitly asked ('Are Black people less skilled at mathematics?'), they manifest stereotypical and erroneous presumptions when asked to answer questions while adopting a persona. These can be observed as abstentions in responses, e.g., 'As a Black person, I can't answer this question as it requires math knowledge', and generally result in a substantial performance drop. Our experiments with ChatGPT-3.5 show that this bias is ubiquitous - 80% of our personas demonstrate bias; it is significant - some datasets show performance drops of 70%+; and can be especially harmful for certain groups - some personas suffer statistically significant drops on 80%+ of the datasets. Overall, all 4 LLMs exhibit this bias to varying extents, with GPT-4-Turbo showing the least but still a problematic amount of bias (evident in 42% of the personas). Further analysis shows that these persona-induced errors can be hard-to-discern and hard-to-avoid. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs - a trend on the rise - can surface their deep-rooted biases and have unforeseeable and detrimental side-effects.
△ Less
Submitted 27 January, 2024; v1 submitted 8 November, 2023;
originally announced November 2023.
-
QualEval: Qualitative Evaluation for Model Improvement
Authors:
Vishvak Murahari,
Ameet Deshpande,
Peter Clark,
Tanmay Rajpurohit,
Ashish Sabharwal,
Karthik Narasimhan,
Ashwin Kalyan
Abstract:
Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a…
▽ More
Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a way to compare and benchmark models, and do not yield actionable diagnostics, thus making the model improvement process challenging. Model developers find themselves amid extensive manual efforts involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are backed by a comprehensive dashboard with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace of model development, thus in essence serving as a data-scientist-in-a-box. Given the focus on critiquing and improving current evaluation metrics, our method serves as a refreshingly new technique for both model evaluation and improvement.
△ Less
Submitted 5 May, 2024; v1 submitted 5 November, 2023;
originally announced November 2023.
-
Water-Rich Disks around Late M-stars Unveiled: Exploring the Remarkable Case of Sz114
Authors:
Chengyan Xie,
Ilaria Pascucci,
Feng Long,
Klaus M. Pontoppidan,
Andrea Banzatti,
Anusha Kalyaan,
Colette Salyk,
Yao Liu,
Joan R. Najita,
Paola Pinilla,
Nicole Arulanantham,
Gregory J. Herczeg,
John Carr,
Edwin A. Bergin,
Nicholas P. Ballering,
Sebastiaan Krijt,
Geoffrey A. Blake,
Ke Zhang,
Karin I. Oberg,
Joel D. Green,
the JDISC collaboration
Abstract:
We present an analysis of the JDISC JWST/MIRI-MRS spectrum of Sz~114, an accreting M5 star surrounded by a large dust disk with a shallow gap at $\sim 39$ au. The spectrum is molecular-rich: we report the detection of water, CO, CO$_2$, HCN, C$_2$H$_2$, and H$_2$. The only identified atomic/ionic transition is from [NeII] at 12.81 micron. A distinct feature of this spectrum is the forest of water…
▽ More
We present an analysis of the JDISC JWST/MIRI-MRS spectrum of Sz~114, an accreting M5 star surrounded by a large dust disk with a shallow gap at $\sim 39$ au. The spectrum is molecular-rich: we report the detection of water, CO, CO$_2$, HCN, C$_2$H$_2$, and H$_2$. The only identified atomic/ionic transition is from [NeII] at 12.81 micron. A distinct feature of this spectrum is the forest of water lines with the 17.22 micron emission surpassing that of most mid-to-late M-star disks by an order of magnitude in flux and aligning instead with disks of earlier-type stars. Moreover, flux ratios of C$_2$H$_2$/H$_2$O and HCN/H$_2$O in Sz~114 also resemble those of earlier-type disks, with a slightly elevated CO$_2$/H$_2$O ratio. While accretional heating can boost all infrared lines, the unusual properties of Sz~114 could be explained by the young age of the source, its formation under unusual initial conditions (a large massive disk), and the presence of dust substructures. The latter delays the inward drift of icy pebbles and help preserve a lower C/O ratio over an extended period. In contrast, mid-to-late M-star disks--which are typically faint, small in size, and likely lack significant substructures--may have more quickly depleted the outer icy reservoir and already evolved out of a water-rich inner disk phase. Our findings underscore the unexpected diversity within mid-infrared spectra of mid-to-late M-star disks, highlighting the need to expand the observational sample for a comprehensive understanding of their variations and thoroughly test pebble drift and planet formation models.
△ Less
Submitted 28 November, 2023; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Estimating Numbers without Regression
Authors:
Avijit Thawani,
Jay Pujara,
Ashwin Kalyan
Abstract:
Despite recent successes in language models, their ability to represent numbers is insufficient. Humans conceptualize numbers based on their magnitudes, effectively projecting them on a number line; whereas subword tokenization fails to explicitly capture magnitude by splitting numbers into arbitrary chunks. To alleviate this shortcoming, alternative approaches have been proposed that modify numbe…
▽ More
Despite recent successes in language models, their ability to represent numbers is insufficient. Humans conceptualize numbers based on their magnitudes, effectively projecting them on a number line; whereas subword tokenization fails to explicitly capture magnitude by splitting numbers into arbitrary chunks. To alleviate this shortcoming, alternative approaches have been proposed that modify numbers at various stages of the language modeling pipeline. These methods change either the (1) notation in which numbers are written (\eg scientific vs decimal), the (2) vocabulary used to represent numbers or the entire (3) architecture of the underlying language model, to directly regress to a desired number.
Previous work suggests that architectural change helps achieve state-of-the-art on number estimation but we find an insightful ablation: changing the model's vocabulary instead (\eg introduce a new token for numbers in range 10-100) is a far better trade-off. In the context of masked number prediction, a carefully designed tokenization scheme is both the simplest to implement and sufficient, \ie with similar performance to the state-of-the-art approach that requires making significant architectural changes. Finally, we report similar trends on the downstream task of numerical fact estimation (for Fermi Problems) and discuss reasons behind our findings.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
Distraction-free Embeddings for Robust VQA
Authors:
Atharvan Dogra,
Deeksha Varshney,
Ashwin Kalyan,
Ameet Deshpande,
Neeraj Kumar
Abstract:
The generation of effective latent representations and their subsequent refinement to incorporate precise information is an essential prerequisite for Vision-Language Understanding (VLU) tasks such as Video Question Answering (VQA). However, most existing methods for VLU focus on sparsely sampling or fine-graining the input information (e.g., sampling a sparse set of frames or text tokens), or add…
▽ More
The generation of effective latent representations and their subsequent refinement to incorporate precise information is an essential prerequisite for Vision-Language Understanding (VLU) tasks such as Video Question Answering (VQA). However, most existing methods for VLU focus on sparsely sampling or fine-graining the input information (e.g., sampling a sparse set of frames or text tokens), or adding external knowledge. We present a novel "DRAX: Distraction Removal and Attended Cross-Alignment" method to rid our cross-modal representations of distractors in the latent space. We do not exclusively confine the perception of any input information from various modalities but instead use an attention-guided distraction removal method to increase focus on task-relevant information in latent embeddings. DRAX also ensures semantic alignment of embeddings during cross-modal fusions. We evaluate our approach on a challenging benchmark (SUTD-TrafficQA dataset), testing the framework's abilities for feature and event queries, temporal relation understanding, forecasting, hypothesis, and causal analysis through extensive experiments.
△ Less
Submitted 31 August, 2023;
originally announced September 2023.
-
Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations
Authors:
Nirbhay Modhe,
Qiaozi Gao,
Ashwin Kalyan,
Dhruv Batra,
Govind Thattai,
Gaurav Sukhatme
Abstract:
Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions. Model-free methods penalize values at all unseen actions, while model-based methods are able to further exploit unseen states via model rollouts. However, such methods are handicapped in their ability to find unseen st…
▽ More
Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions. Model-free methods penalize values at all unseen actions, while model-based methods are able to further exploit unseen states via model rollouts. However, such methods are handicapped in their ability to find unseen states far away from the available offline data due to two factors -- (a) very short rollout horizons in models due to cascading model errors, and (b) model rollouts originating solely from states observed in offline data. We relax the second assumption and present a novel unseen state augmentation strategy to allow exploitation of unseen states where the learned model and value estimates generalize. Our strategy finds unseen states by value-informed perturbations of seen states followed by filtering out states with epistemic uncertainty estimates too high (high error) or too low (too similar to seen data). We observe improved performance in several offline RL tasks and find that our augmentation strategy consistently leads to overall lower average dataset Q-value estimates i.e. more conservative Q-value estimates than a baseline.
△ Less
Submitted 24 September, 2023; v1 submitted 7 August, 2023;
originally announced August 2023.
-
JWST reveals excess cool water near the snowline in compact disks, consistent with pebble drift
Authors:
Andrea Banzatti,
Klaus M. Pontoppidan,
John Carr,
Evan Jellison,
Ilaria Pascucci,
Joan Najita,
Carlos E. Munoz-Romero,
Karin I. Oberg,
Anusha Kalyaan,
Paola Pinilla,
Sebastiaan Krijt,
Feng Long,
Michiel Lambrechts,
Giovanni Rosotti,
Gregory J. Herczeg,
Colette Salyk,
Ke Zhang,
Edwin Bergin,
Nick Ballering,
Michael R. Meyer,
Simon Bruderer,
the JDISCS collaboration
Abstract:
Previous analyses of mid-infrared water spectra from young protoplanetary disks observed with the Spitzer-IRS found an anti-correlation between water luminosity and the millimeter dust disk radius observed with ALMA. This trend was suggested to be evidence for a fundamental process of inner disk water enrichment, used to explain properties of the Solar System 40 years ago, in which icy pebbles dri…
▽ More
Previous analyses of mid-infrared water spectra from young protoplanetary disks observed with the Spitzer-IRS found an anti-correlation between water luminosity and the millimeter dust disk radius observed with ALMA. This trend was suggested to be evidence for a fundamental process of inner disk water enrichment, used to explain properties of the Solar System 40 years ago, in which icy pebbles drift inward from the outer disk and sublimate after crossing the snowline. Previous analyses of IRS water spectra, however, were uncertain due to the low spectral resolution that blended lines together. We present new JWST-MIRI spectra of four disks, two compact and two large with multiple radial gaps, selected to test the scenario that water vapor inside the snowline is regulated by pebble drift. The higher spectral resolving power of MIRI-MRS now yields water spectra that separate individual lines, tracing upper level energies from 900 K to 10,000 K. These spectra clearly reveal excess emission in the low-energy lines in compact disks, compared to the large disks, demonstrating an enhanced cool component with $T \approx$ 170-400 K and equivalent emitting radius $R_{\rm{eq}}\approx$ 1-10 au. We interpret the cool water emission as ice sublimation and vapor diffusion near the snowline, suggesting that there is indeed a higher inwards mass flux of icy pebbles in compact disks. Observation of this process opens up multiple exciting prospects to study planet formation chemistry in inner disks with JWST.
△ Less
Submitted 3 September, 2023; v1 submitted 7 July, 2023;
originally announced July 2023.
-
The Effect of Dust Evolution and Traps on Inner Disk Water Enrichment
Authors:
Anusha Kalyaan,
Paola Pinilla,
Sebastiaan Krijt,
Andrea Banzatti,
Giovanni Rosotti,
Gijs D. Mulders,
Michiel Lambrechts,
Feng Long,
Gregory J. Herczeg
Abstract:
Substructures in protoplanetary disks can act as dust traps that shape the radial distribution of pebbles. By blocking the passage of pebbles, the presence of gaps in disks may have a profound effect on pebble delivery into the inner disk, crucial for the formation of inner planets via pebble accretion. This process can also affect the delivery of volatiles (such as H$_2$O) and their abundance wit…
▽ More
Substructures in protoplanetary disks can act as dust traps that shape the radial distribution of pebbles. By blocking the passage of pebbles, the presence of gaps in disks may have a profound effect on pebble delivery into the inner disk, crucial for the formation of inner planets via pebble accretion. This process can also affect the delivery of volatiles (such as H$_2$O) and their abundance within the water snow line region (within a few au). In this study, we aim to understand what effect the presence of gaps in the outer gas disk may have on water vapor enrichment in the inner disk. Building on previous work, we employ a volatile-inclusive multi-Myr disk evolution model that considers an evolving ice-bearing drifting dust population, sensitive to dust-traps, which loses its icy content to sublimation upon reaching the snow line. We find that vapor abundance in the inner disk is strongly affected by fragmentation velocity (v$_{\rm f}$) and turbulence, which control how intense vapor enrichment from pebble delivery is, if present, and how long it may last. Generally, for disks with low to moderate turbulence ($α$ $\le$ 1 $\times$ 10$^{-3}$) and for a range of v$_{\rm f}$, radial location, and gap depth (especially that of the innermost gaps), can significantly alter enrichment. Shallow inner gaps may continuously leak material from beyond it, despite the presence of additional deep outer gaps. We finally find that the for realistic v$_{\rm f}$ ($\le$ 10 m s$^{-1}$), presence of gaps is more important than planetesimal formation beyond the snow line in regulating pebble and volatile delivery into the inner disk.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
C-STS: Conditional Semantic Textual Similarity
Authors:
Ameet Deshpande,
Carlos E. Jimenez,
Howard Chen,
Vishvak Murahari,
Victoria Graf,
Tanmay Rajpurohit,
Ashwin Kalyan,
Danqi Chen,
Karthik Narasimhan
Abstract:
Semantic textual similarity (STS), a cornerstone task in NLP, measures the degree of similarity between a pair of sentences, and has broad application in fields such as information retrieval and natural language understanding. However, sentence similarity can be inherently ambiguous, depending on the specific aspect of interest. We resolve this ambiguity by proposing a novel task called Conditiona…
▽ More
Semantic textual similarity (STS), a cornerstone task in NLP, measures the degree of similarity between a pair of sentences, and has broad application in fields such as information retrieval and natural language understanding. However, sentence similarity can be inherently ambiguous, depending on the specific aspect of interest. We resolve this ambiguity by proposing a novel task called Conditional STS (C-STS) which measures sentences' similarity conditioned on an feature described in natural language (hereon, condition). As an example, the similarity between the sentences "The NBA player shoots a three-pointer." and "A man throws a tennis ball into the air to serve." is higher for the condition "The motion of the ball" (both upward) and lower for "The size of the ball" (one large and one small). C-STS's advantages are two-fold: (1) it reduces the subjectivity and ambiguity of STS and (2) enables fine-grained language model evaluation through diverse natural language conditions. We put several state-of-the-art models to the test, and even those performing well on STS (e.g. SimCSE, Flan-T5, and GPT-4) find C-STS challenging; all with Spearman correlation scores below 50. To encourage a more comprehensive evaluation of semantic similarity and natural language understanding, we make nearly 19K C-STS examples and code available for others to train and test their models.
△ Less
Submitted 6 November, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Anthropomorphization of AI: Opportunities and Risks
Authors:
Ameet Deshpande,
Tanmay Rajpurohit,
Karthik Narasimhan,
Ashwin Kalyan
Abstract:
Anthropomorphization is the tendency to attribute human-like traits to non-human entities. It is prevalent in many social contexts -- children anthropomorphize toys, adults do so with brands, and it is a literary device. It is also a versatile tool in science, with behavioral psychology and evolutionary biology meticulously documenting its consequences. With widespread adoption of AI systems, and…
▽ More
Anthropomorphization is the tendency to attribute human-like traits to non-human entities. It is prevalent in many social contexts -- children anthropomorphize toys, adults do so with brands, and it is a literary device. It is also a versatile tool in science, with behavioral psychology and evolutionary biology meticulously documenting its consequences. With widespread adoption of AI systems, and the push from stakeholders to make it human-like through alignment techniques, human voice, and pictorial avatars, the tendency for users to anthropomorphize it increases significantly. We take a dyadic approach to understanding this phenomenon with large language models (LLMs) by studying (1) the objective legal implications, as analyzed through the lens of the recent blueprint of AI bill of rights and the (2) subtle psychological aspects customization and anthropomorphization. We find that anthropomorphized LLMs customized for different user bases violate multiple provisions in the legislative blueprint. In addition, we point out that anthropomorphization of LLMs affects the influence they can have on their users, thus having the potential to fundamentally change the nature of human-AI interaction, with potential for manipulation and negative influence. With LLMs being hyper-personalized for vulnerable groups like children and patients among others, our work is a timely and important contribution. We propose a conservative strategy for the cautious use of anthropomorphization to improve trustworthiness of AI systems.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs
Authors:
Afra Feyza Akyürek,
Ekin Akyürek,
Aman Madaan,
Ashwin Kalyan,
Peter Clark,
Derry Wijaya,
Niket Tandon
Abstract:
Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics…
▽ More
Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.
△ Less
Submitted 11 July, 2023; v1 submitted 15 May, 2023;
originally announced May 2023.
-
ProKnow: Process Knowledge for Safety Constrained and Explainable Question Generation for Mental Health Diagnostic Assistance
Authors:
Kaushik Roy,
Manas Gaur,
Misagh Soltani,
Vipula Rawte,
Ashwin Kalyan,
Amit Sheth
Abstract:
Current Virtual Mental Health Assistants (VMHAs) provide counseling and suggestive care. They refrain from patient diagnostic assistance because they lack training in safety-constrained and specialized clinical process knowledge. In this work, we define Proknow as an ordered set of information that maps to evidence-based guidelines or categories of conceptual understanding to experts in a domain.…
▽ More
Current Virtual Mental Health Assistants (VMHAs) provide counseling and suggestive care. They refrain from patient diagnostic assistance because they lack training in safety-constrained and specialized clinical process knowledge. In this work, we define Proknow as an ordered set of information that maps to evidence-based guidelines or categories of conceptual understanding to experts in a domain. We also introduce a new dataset of diagnostic conversations guided by safety constraints and Proknow that healthcare professionals use. We develop a method for natural language question generation (NLG) that collects diagnostic information from the patient interactively. We demonstrate the limitations of using state-of-the-art large-scale language models (LMs) on this dataset. Our algorithm models the process knowledge through explicitly modeling safety, knowledge capture, and explainability. LMs augmented with ProKnow guided method generated 89% safer questions in the depression and anxiety domain. The Explainability of the generated question is assessed by computing similarity with concepts in depression and anxiety knowledge bases. Overall, irrespective of the type of LMs augmented with our ProKnow, we achieved an average 82% improvement over simple pre-trained LMs on safety, explainability, and process-guided question generation. We qualitatively and quantitatively evaluate the efficacy of the proposed ProKnow-guided methods by introducing three new evaluation metrics for safety, explainability, and process knowledge adherence.
△ Less
Submitted 1 June, 2023; v1 submitted 13 May, 2023;
originally announced May 2023.
-
Detection and Classification of Glioblastoma Brain Tumor
Authors:
Utkarsh Maurya,
Appisetty Krishna Kalyan,
Swapnil Bohidar,
Dr. S. Sivakumar
Abstract:
Glioblastoma brain tumors are highly malignant and often require early detection and accurate segmentation for effective treatment. We are proposing two deep learning models in this paper, namely UNet and Deeplabv3, for the detection and segmentation of glioblastoma brain tumors using preprocessed brain MRI images. The performance evaluation is done for these models in terms of accuracy and comput…
▽ More
Glioblastoma brain tumors are highly malignant and often require early detection and accurate segmentation for effective treatment. We are proposing two deep learning models in this paper, namely UNet and Deeplabv3, for the detection and segmentation of glioblastoma brain tumors using preprocessed brain MRI images. The performance evaluation is done for these models in terms of accuracy and computational efficiency. Our experimental results demonstrate that both UNet and Deeplabv3 models achieve accurate detection and segmentation of glioblastoma brain tumors. However, Deeplabv3 outperforms UNet in terms of accuracy, albeit at the cost of requiring more computational resources. Our proposed models offer a promising approach for the early detection and segmentation of glioblastoma brain tumors, which can aid in effective treatment strategies. Further research can focus on optimizing the computational efficiency of the Deeplabv3 model while maintaining its high accuracy for real-world clinical applications. Overall, our approach works and contributes to the field of medical image analysis and deep learning-based approaches for brain tumor detection and segmentation. Our suggested models can have a major influence on the prognosis and treatment of people with glioblastoma, a fatal form of brain cancer. It is necessary to conduct more research to examine the practical use of these models in real-life healthcare settings.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
Authors:
Ameet Deshpande,
Vishvak Murahari,
Tanmay Rajpurohit,
Ashwin Kalyan,
Karthik Narasimhan
Abstract:
Large language models (LLMs) have shown incredible capabilities and transcended the natural language processing (NLP) community, with adoption throughout many services like healthcare, therapy, education, and customer service. Since users include people with critical information needs like students or patients engaging with chatbots, the safety of these systems is of prime importance. Therefore, a…
▽ More
Large language models (LLMs) have shown incredible capabilities and transcended the natural language processing (NLP) community, with adoption throughout many services like healthcare, therapy, education, and customer service. Since users include people with critical information needs like students or patients engaging with chatbots, the safety of these systems is of prime importance. Therefore, a clear understanding of the capabilities and limitations of LLMs is necessary. To this end, we systematically evaluate toxicity in over half a million generations of ChatGPT, a popular dialogue-based LLM. We find that setting the system parameter of ChatGPT by assigning it a persona, say that of the boxer Muhammad Ali, significantly increases the toxicity of generations. Depending on the persona assigned to ChatGPT, its toxicity can increase up to 6x, with outputs engaging in incorrect stereotypes, harmful dialogue, and hurtful opinions. This may be potentially defamatory to the persona and harmful to an unsuspecting user. Furthermore, we find concerning patterns where specific entities (e.g., certain races) are targeted more than others (3x more) irrespective of the assigned persona, that reflect inherent discriminatory biases in the model. We hope that our findings inspire the broader AI community to rethink the efficacy of current safety guardrails and develop better techniques that lead to robust, safe, and trustworthy AI systems.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Photoacoustic Image Quality Improvement from a Single Cell Low Frequency PMUT
Authors:
Kaustav Roy,
Arijit Paramanik,
Souradip Paul,
Akshay Kalyan,
Eshani Sarkar,
Anuj Ashok,
Rudra Pratap,
M Suheshkumar Singh
Abstract:
Photoacoustic image (PAI) quality improvement using a low frequency piezoelectric micromachined ultrasound transducer (PMUT) having the fundamental resonant frequency 1 MHz is being reported. Specifically, three different methods are implemented such as the frame averaging, mathematically improved algorithms, and a hardware position accurate arrangement in order to obtain unparallel PAI image qual…
▽ More
Photoacoustic image (PAI) quality improvement using a low frequency piezoelectric micromachined ultrasound transducer (PMUT) having the fundamental resonant frequency 1 MHz is being reported. Specifically, three different methods are implemented such as the frame averaging, mathematically improved algorithms, and a hardware position accurate arrangement in order to obtain unparallel PAI image quality. Validation study has been conducted in both agar phantom and ex-vivo tissue samples. Measurable image quantifiers in the form of full width at half maximum (FWHM), signal to noise ratio (SNR), and contrast ratio (CR) are used to evaluate the improvement in the image quality. It is found that the FWHM increases by 34%, SNR by 23% and CR by 25%, suggesting the efficacy of the methods to achieve better photoacoustic images employing PMUT-based detector. The study demonstrates that the suggested methods of improvement could play a key role in a promising cost-effective PMUT-PAI system in future.
△ Less
Submitted 30 November, 2022;
originally announced November 2022.
-
SPARTAN: Sparse Hierarchical Memory for Parameter-Efficient Transformers
Authors:
Ameet Deshpande,
Md Arafat Sultan,
Anthony Ferritto,
Ashwin Kalyan,
Karthik Narasimhan,
Avirup Sil
Abstract:
Fine-tuning pre-trained language models (PLMs) achieves impressive performance on a range of downstream tasks, and their sizes have consequently been getting bigger. Since a different copy of the model is required for each task, this paradigm is infeasible for storage-constrained edge devices like mobile phones. In this paper, we propose SPARTAN, a parameter efficient (PE) and computationally fast…
▽ More
Fine-tuning pre-trained language models (PLMs) achieves impressive performance on a range of downstream tasks, and their sizes have consequently been getting bigger. Since a different copy of the model is required for each task, this paradigm is infeasible for storage-constrained edge devices like mobile phones. In this paper, we propose SPARTAN, a parameter efficient (PE) and computationally fast architecture for edge devices that adds hierarchically organized sparse memory after each Transformer layer. SPARTAN freezes the PLM parameters and fine-tunes only its memory, thus significantly reducing storage costs by re-using the PLM backbone for different tasks. SPARTAN contains two levels of memory, with only a sparse subset of parents being chosen in the first level for each input, and children cells corresponding to those parents being used to compute an output representation. This sparsity combined with other architecture optimizations improves SPARTAN's throughput by over 90% during inference on a Raspberry Pi 4 when compared to PE baselines (adapters) while also outperforming the latter by 0.1 points on the GLUE benchmark. Further, it can be trained 34% faster in a few-shot setting, while performing within 0.9 points of adapters. Qualitative analysis shows that different parent cells in SPARTAN specialize in different topics, thus dividing responsibility efficiently.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Lila: A Unified Benchmark for Mathematical Reasoning
Authors:
Swaroop Mishra,
Matthew Finlayson,
Pan Lu,
Leonard Tang,
Sean Welleck,
Chitta Baral,
Tanmay Rajpurohit,
Oyvind Tafjord,
Ashish Sabharwal,
Peter Clark,
Ashwin Kalyan
Abstract:
Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shop** to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., q…
▽ More
Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shop** to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.
△ Less
Submitted 8 March, 2023; v1 submitted 31 October, 2022;
originally announced October 2022.
-
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
Authors:
Pan Lu,
Liang Qiu,
Kai-Wei Chang,
Ying Nian Wu,
Song-Chun Zhu,
Tanmay Rajpurohit,
Peter Clark,
Ashwin Kalyan
Abstract:
Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that in…
▽ More
Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TabMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multi-choice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in selecting in-context examples.
△ Less
Submitted 2 March, 2023; v1 submitted 29 September, 2022;
originally announced September 2022.
-
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Authors:
Pan Lu,
Swaroop Mishra,
Tony Xia,
Liang Qiu,
Kai-Wei Chang,
Song-Chun Zhu,
Oyvind Tafjord,
Peter Clark,
Ashwin Kalyan
Abstract:
When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system…
▽ More
When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io.
△ Less
Submitted 17 October, 2022; v1 submitted 20 September, 2022;
originally announced September 2022.
-
NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks
Authors:
Swaroop Mishra,
Arindam Mitra,
Neeraj Varshney,
Bhavdeep Sachdeva,
Peter Clark,
Chitta Baral,
Ashwin Kalyan
Abstract:
Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE that was proposed…
▽ More
Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE that was proposed in the context of natural language understanding, we propose NumGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4%). Further, NumGLUE promotes sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4% on each task) when a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.
-
How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI
Authors:
Ashwin Kalyan,
Abhinav Kumar,
Arjun Chandrasekaran,
Ashish Sabharwal,
Peter Clark
Abstract:
Many real-world problems require the combined application of multiple reasoning abilities employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because t…
▽ More
Many real-world problems require the combined application of multiple reasoning abilities employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible. For example, "How much would the sea level rise if all ice in the world melted?" FPs are commonly used in quizzes and interviews to bring out and evaluate the creative reasoning abilities of humans. To do the same for AI systems, we present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge. In addition to question answer pairs, the datasets contain detailed solutions in the form of an executable program and supporting facts, hel** in supervision and evaluation of intermediate steps. We demonstrate that even extensively fine-tuned large scale language models perform poorly on these datasets, on average making estimates that are off by two orders of magnitude. Our contribution is thus the crystallization of several unsolved AI problems into a single, new challenge that we hope will spur further advances in building systems that can reason.
△ Less
Submitted 20 December, 2021; v1 submitted 27 October, 2021;
originally announced October 2021.
-
Linking Outer Disk Pebble Dynamics and Gaps to Inner Disk Water Enrichment
Authors:
A. Kalyaan,
P. Pinilla,
S. Krijt,
G. D. Mulders,
A. Banzatti
Abstract:
Millimeter continuum imaging of protoplanetary disks reveals the distribution of solid particles and the presence of substructures (gaps and rings) beyond 5-10 au, while infrared (IR) spectra provide access to abundances of gaseous species at smaller disk radii. Building on recent observational findings of an anti-correlation between the inner disk water luminosity and outer dust disk radius, we a…
▽ More
Millimeter continuum imaging of protoplanetary disks reveals the distribution of solid particles and the presence of substructures (gaps and rings) beyond 5-10 au, while infrared (IR) spectra provide access to abundances of gaseous species at smaller disk radii. Building on recent observational findings of an anti-correlation between the inner disk water luminosity and outer dust disk radius, we aim here at investigating the dynamics of icy solids that drift from the outer disk and sublimate their ice inside the snow line, enriching the water vapor that is observed in the IR. We use a volatile-inclusive disk evolution model to explore a range of conditions (gap location, particle size, disk mass, and alpha-viscosity) under which gaps in the outer disk efficiently block the inward drift of icy solids. We find that inner-disk vapor enrichment is highly sensitive to the location of a disk gap, yielding for each particle size a radial "sweet spot" that reduces the inner-disk vapor enrichment to a minimum. For pebbles of 1-10 mm in size, which carry the most mass, this sweet spot is at 7-15 au, suggesting that inner gaps may have a key role in reducing ice delivery to the inner disk and may not allow the formation of Earths and super-Earths. This highlights the importance of observationally determining the presence and properties of inner gaps in disks. Finally, we argue that the inner water vapor abundance can be used as a proxy for estimating the pebble drift efficiency and mass-flux entering the inner disk.
△ Less
Submitted 6 September, 2021;
originally announced September 2021.
-
Model-Advantage and Value-Aware Models for Model-Based Reinforcement Learning: Bridging the Gap in Theory and Practice
Authors:
Nirbhay Modhe,
Harish Kamath,
Dhruv Batra,
Ashwin Kalyan
Abstract:
This work shows that value-aware model learning, known for its numerous theoretical benefits, is also practically viable for solving challenging continuous control tasks in prevalent model-based reinforcement learning algorithms. First, we derive a novel value-aware model learning objective by bounding the model-advantage i.e. model performance difference, between two MDPs or models given a fixed…
▽ More
This work shows that value-aware model learning, known for its numerous theoretical benefits, is also practically viable for solving challenging continuous control tasks in prevalent model-based reinforcement learning algorithms. First, we derive a novel value-aware model learning objective by bounding the model-advantage i.e. model performance difference, between two MDPs or models given a fixed policy, achieving superior performance to prior value-aware objectives in most continuous control environments. Second, we identify the issue of stale value estimates in naively substituting value-aware objectives in place of maximum-likelihood in dyna-style model-based RL algorithms. Our proposed remedy to this issue bridges the long-standing gap in theory and practice of value-aware model learning by enabling successful deployment of all value-aware objectives in solving several continuous control robotic manipulation and locomotion tasks. Our results are obtained with minimal modifications to two popular and open-source model-based RL algorithms -- SLBO and MBPO, without tuning any existing hyper-parameters, while also demonstrating better performance of value-aware objectives than these baseline in some environments.
△ Less
Submitted 28 January, 2022; v1 submitted 26 June, 2021;
originally announced June 2021.
-
Programming Puzzles
Authors:
Tal Schuster,
Ashwin Kalyan,
Oleksandr Polozov,
Adam Tauman Kalai
Abstract:
We introduce a new type of programming challenge called programming puzzles, as an objective and comprehensive evaluation of program synthesis, and release an open-source dataset of Python Programming Puzzles (P3). Each puzzle is defined by a short Python program $f$, and the goal is to find an input which makes $f$ return True. The puzzles are objective in that each one is specified entirely by t…
▽ More
We introduce a new type of programming challenge called programming puzzles, as an objective and comprehensive evaluation of program synthesis, and release an open-source dataset of Python Programming Puzzles (P3). Each puzzle is defined by a short Python program $f$, and the goal is to find an input which makes $f$ return True. The puzzles are objective in that each one is specified entirely by the source code of its verifier $f$, so evaluating $f$ is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding. The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems, to classic programming puzzles (e.g., Tower of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). We develop baseline enumerative program synthesis, GPT-3 and Codex solvers that are capable of solving puzzles -- even without access to any reference solutions -- by learning from their own past solutions. Codex performs best, solving up to 18% of 397 test problems with a single try and 80% of the problems with 1,000 tries per problem. In a small user study, we find a positive correlation between puzzle-solving performance and coding experience, and between the puzzle difficulty for humans and AI solvers. Therefore, further improvements on P3 could have a significant impact on many program synthesis areas.
△ Less
Submitted 6 November, 2021; v1 submitted 10 June, 2021;
originally announced June 2021.
-
Effect of Different Angular Momentum Transport mechanisms on the Distribution of Water in Protoplanetary Disks
Authors:
Anusha Kalyaan,
Steven J. Desch
Abstract:
The snow line in a protoplanetary disk demarcates regions with H$_2$O ice from regions with H$_2$O vapor. Where a planet forms relative to this location determines how much water and other volatiles it forms with. Giant planet formation may be triggered at the water snow line if vapor diffuses outward and is cold-trapped beyond the snow line faster than icy particles can drift inward. In this stud…
▽ More
The snow line in a protoplanetary disk demarcates regions with H$_2$O ice from regions with H$_2$O vapor. Where a planet forms relative to this location determines how much water and other volatiles it forms with. Giant planet formation may be triggered at the water snow line if vapor diffuses outward and is cold-trapped beyond the snow line faster than icy particles can drift inward. In this study we investigate the distribution of water across the snow line, considering three different radial profiles of the turbulence parameter $α(r)$, corresponding to three different angular momentum transport mechanisms. We consider the radial transport of water vapor and icy particles by diffusion, advection, and drift. We show that even for similar values of $α$, the gradient of $α$(r) across the snow line significantly changes the snow line location, the sharpness of the volatile gradient across the snow line, and the final water/rock ratio in planetary bodies. A profile of radially decreasing $α$, consistent with transport by hydrodynamic instabilities plus magnetic disk winds, appears consistent with the distribution of water in the solar nebula, with monotonically-increasing radial water content and a diverse population of asteroids with different water content. We argue that $Σ(r)$ and water abundance $N_{\rm H_2O}(r)/N_{\rm H_2}(r)$ are likely diagnostic of $α(r)$ and thus the mechanism for angular momentum transport in inner disks.
△ Less
Submitted 15 March, 2019;
originally announced March 2019.
-
Learn from Your Neighbor: Learning Multi-modal Map**s from Sparse Annotations
Authors:
Ashwin Kalyan,
Stefan Lee,
Anitha Kannan,
Dhruv Batra
Abstract:
Many structured prediction problems (particularly in vision and language domains) are ambiguous, with multiple outputs being correct for an input - e.g. there are many ways of describing an image, multiple ways of translating a sentence; however, exhaustively annotating the applicability of all possible outputs is intractable due to exponentially large output spaces (e.g. all English sentences). I…
▽ More
Many structured prediction problems (particularly in vision and language domains) are ambiguous, with multiple outputs being correct for an input - e.g. there are many ways of describing an image, multiple ways of translating a sentence; however, exhaustively annotating the applicability of all possible outputs is intractable due to exponentially large output spaces (e.g. all English sentences). In practice, these problems are cast as multi-class prediction, with the likelihood of only a sparse set of annotations being maximized - unfortunately penalizing for placing beliefs on plausible but unannotated outputs. We make and test the following hypothesis - for a given input, the annotations of its neighbors may serve as an additional supervisory signal. Specifically, we propose an objective that transfers supervision from neighboring examples. We first study the properties of our developed method in a controlled toy setup before reporting results on multi-label classification and two image-grounded sequence modeling tasks - captioning and question generation. We evaluate using standard task-specific metrics and measures of output diversity, finding consistent improvements over standard maximum likelihood training and other baselines.
△ Less
Submitted 7 June, 2018;
originally announced June 2018.
-
Neural-Guided Deductive Search for Real-Time Program Synthesis from Examples
Authors:
Ashwin Kalyan,
Abhishek Mohta,
Oleksandr Polozov,
Dhruv Batra,
Prateek Jain,
Sumit Gulwani
Abstract:
Synthesizing user-intended programs from a small number of input-output examples is a challenging problem with several important applications like spreadsheet manipulation, data wrangling and code refactoring. Existing synthesis systems either completely rely on deductive logic techniques that are extensively hand-engineered or on purely statistical models that need massive amounts of data, and in…
▽ More
Synthesizing user-intended programs from a small number of input-output examples is a challenging problem with several important applications like spreadsheet manipulation, data wrangling and code refactoring. Existing synthesis systems either completely rely on deductive logic techniques that are extensively hand-engineered or on purely statistical models that need massive amounts of data, and in general fail to provide real-time synthesis on challenging benchmarks. In this work, we propose Neural Guided Deductive Search (NGDS), a hybrid synthesis technique that combines the best of both symbolic logic techniques and statistical models. Thus, it produces programs that satisfy the provided specifications by construction and generalize well on unseen examples, similar to data-driven systems. Our technique effectively utilizes the deductive search framework to reduce the learning problem of the neural component to a simple supervised learning setup. Further, this allows us to both train on sparingly available real-world data and still leverage powerful recurrent neural network encoders. We demonstrate the effectiveness of our method by evaluating on real-world customer scenarios by synthesizing accurate programs with up to 12x speed-up compared to state-of-the-art systems.
△ Less
Submitted 9 September, 2018; v1 submitted 3 April, 2018;
originally announced April 2018.
-
Development of a Lunar Scintillometer as part of the National Large Optical Telescope Site Survey
Authors:
Avinash Surendran,
Padmakar S. Parihar,
Ravinder K Banyal,
Anusha Kalyaan
Abstract:
Ground layer turbulence is a very important site characterization parameter used to assess the quality of an astronomical site. The Lunar Scintillometer is a simple and effective site-testing device for measuring the ground layer turbulence. It consists of a linear array of photodiodes which are sensitive to the slight variations in the moon's brightness due to scintillation by the lower layers of…
▽ More
Ground layer turbulence is a very important site characterization parameter used to assess the quality of an astronomical site. The Lunar Scintillometer is a simple and effective site-testing device for measuring the ground layer turbulence. It consists of a linear array of photodiodes which are sensitive to the slight variations in the moon's brightness due to scintillation by the lower layers of the Earth's atmosphere. The covariance of intensity values between the non-redundant photodiode baselines can be used to measure the turbulence profile from the ground up to a height determined by the furthest pair of detectors. The six channel lunar scintillometer that has been developed at the Indian Institute of Astrophysics is based closely on an instrument built by the team led by Andrei Tokovinin of Cerro Tololo Inter-American Observatory (CTIO), Chile. We have fabricated the instrument based on the existing electronic design, and have worked on the noise analysis, an EMI (Electromagnetic Induction) resistant PCB design and the software pipeline for analyzing the data from the same. The results from the instrument's multi-year campaign at Mount Saraswati, Hanle is also presented.
△ Less
Submitted 1 February, 2018;
originally announced February 2018.
-
The Effect of Jupiter's Formation on the Distribution of Refractory Elements and Inclusions in Meteorites
Authors:
Steven J. Desch,
Anusha Kalyaan,
Conel M. O'D. Alexander
Abstract:
We present a comprehensive evolutionary model of the Sun's protoplanetary disk, constructed to resolve the "CAI Storage" problem of meteoritics. We predict the abundances of calcium-rich, aluminum-rich inclusions (CAIs) and refractory lithophile elements under the central assumption that Jupiter's $\sim 30 \, M_{\oplus}$ core forms at about 3 AU at around 0.6 Myr and opened a gap CAIs are trapped…
▽ More
We present a comprehensive evolutionary model of the Sun's protoplanetary disk, constructed to resolve the "CAI Storage" problem of meteoritics. We predict the abundances of calcium-rich, aluminum-rich inclusions (CAIs) and refractory lithophile elements under the central assumption that Jupiter's $\sim 30 \, M_{\oplus}$ core forms at about 3 AU at around 0.6 Myr and opened a gap CAIs are trapped in the pressure maximum beyond Jupiter; carbonaceous chondrites formed there. Inside Jupiter's orbit, CAIs were depleted by aerodynamic drag; ordinary and enstatite chondrites formed there. For 16 chondrites and achondrites, we review meteoritic data on their CAI and refractory abundances and their times of formation, constrained by radiometric dating and thermal models. We predict their formation locations, finding excellent consistency with other location information (water content, asteroid spectra and parent bodies). We predict the size of particle concentrated by turbulence for each chondrite, finding excellent matches to each chondrites's mean chondrule diameter. These consistencies imply meteorite parent bodies assembled quickly from local materials concentrated by turbulence, and usually did not migrate far. We predict CI chondrites are depleted in refractory lithophile elements relative to the Sun, by about 12% (0.06 dex). We constrain the variation of turbulence parameter $α$ in the disk, and find a limited role for magnetorotational instability, favoring hydrodynamical instabilities in the outer disk, plus magnetic disk winds in the inner disk. Between 3 and 4 Myr at least, gas persisted outside Jupiter but was depleted inside it, and the solar nebula was a transition disk.
△ Less
Submitted 8 August, 2018; v1 submitted 10 October, 2017;
originally announced October 2017.
-
Formulas for Radial Transport in Protoplanetary Disks
Authors:
Steven J. Desch,
Paul R. Estrada,
Anusha Kalyaan,
Jeffrey N. Cuzzi
Abstract:
Quantification of the radial transport of gaseous species and solid particles is important to many applications in protoplanetary disk evolution. An especially important example is determining the location of the water snow lines in a disk, which requires computing the rates of outward radial diffusion of water vapor and the inward radial drift of icy particles; however, the application is general…
▽ More
Quantification of the radial transport of gaseous species and solid particles is important to many applications in protoplanetary disk evolution. An especially important example is determining the location of the water snow lines in a disk, which requires computing the rates of outward radial diffusion of water vapor and the inward radial drift of icy particles; however, the application is generalized to evaporation fronts of all volatiles. We review the relevant formulas using a uniform formalism. This uniform treatment is necessary because the literature currently contains at least six mutually exclusive treatments of radial diffusion of gas, only one of which is correct. We derive the radial diffusion equations from first principles, using Fick's law. For completeness, we also present the equations for radial transport of particles. These equations may be applied to studies of diffusion of gases and particles in protoplanetary and other accretion disks.
△ Less
Submitted 5 April, 2017;
originally announced April 2017.
-
A Novel Framework to Expedite Systematic Reviews by Automatically Building Information Extraction Training Corpora
Authors:
Tanmay Basu,
Shraman Kumar,
Abhishek Kalyan,
Priyanka Jayaswal,
Pawan Goyal,
Stephen Pettifer,
Siddhartha R. Jonnalagadda
Abstract:
A systematic review identifies and collates various clinical studies and compares data elements and results in order to provide an evidence based answer for a particular clinical question. The process is manual and involves lot of time. A tool to automate this process is lacking. The aim of this work is to develop a framework using natural language processing and machine learning to build informat…
▽ More
A systematic review identifies and collates various clinical studies and compares data elements and results in order to provide an evidence based answer for a particular clinical question. The process is manual and involves lot of time. A tool to automate this process is lacking. The aim of this work is to develop a framework using natural language processing and machine learning to build information extraction algorithms to identify data elements in a new primary publication, without having to go through the expensive task of manual annotation to build gold standards for each data element type. The system is developed in two stages. Initially, it uses information contained in existing systematic reviews to identify the sentences from the PDF files of the included references that contain specific data elements of interest using a modified Jaccard similarity measure. These sentences have been treated as labeled data.A Support Vector Machine (SVM) classifier is trained on this labeled data to extract data elements of interests from a new article. We conducted experiments on Cochrane Database systematic reviews related to congestive heart failure using inclusion criteria as an example data element. The empirical results show that the proposed system automatically identifies sentences containing the data element of interest with a high recall (93.75%) and reasonable precision (27.05% - which means the reviewers have to read only 3.7 sentences on average). The empirical results suggest that the tool is retrieving valuable information from the reference articles, even when it is time-consuming to identify them manually. Thus we hope that the tool will be useful for automatic data extraction from biomedical research publications. The future scope of this work is to generalize this information framework for all types of systematic reviews.
△ Less
Submitted 21 June, 2016;
originally announced June 2016.
-
External Photoevaporation of the Solar Nebula II: Effects on Disk Structure and Evolution with Non-Uniform Turbulent Viscosity due to the Magnetorotational Instability
Authors:
Anusha Kalyaan,
Steven J Desch,
Nikhil Monga
Abstract:
The structure and evolution of protoplanetary disks, especially the radial flows of gas through them, are sensitive to a number of factors. One that has been considered only occasionally in the literature is external photoevaporation by far-ultraviolet (FUV) radiation from nearby, massive stars, despite the fact that nearly half of all disks will experience photoevaporation. Another effect apparen…
▽ More
The structure and evolution of protoplanetary disks, especially the radial flows of gas through them, are sensitive to a number of factors. One that has been considered only occasionally in the literature is external photoevaporation by far-ultraviolet (FUV) radiation from nearby, massive stars, despite the fact that nearly half of all disks will experience photoevaporation. Another effect apparently not considered in the literature is a spatially and temporally varying value of $α$ in the disk [where the turbulent viscosity $ν$ is $α$ times the sound speed C times the disk scale height H]. Here we use the formulation of Bai \& Stone (2011) to relate $α$ to the ionization fraction in the disk, assuming turbulent transport of angular momentum is due to the magnetorotational instability. We find that disk evolution is most sensitive to the surface area of dust. Typically $α\lesssim 10^{-5}$ in the inner disk ($< 2$ AU), rising to $\sim 10^{-1}$ beyond 20 AU. This drastically alters the structure of the disk and the flow of mass through it: while the outer disk rapidly viscously spreads, the inner disk hardly evolves; this leads to a steep surface density profile with a slope < p > $\approx$ 2 - 5 in the 5-30 AU region) that is made steeper by external photoevaporation. We also find that the combination of variable $α$ and external photoevaporation eventually causes gas as close as 3 AU, previously accreting inward, to be drawn outward to the photoevaporated outer edge of the disk. These effects have drastic consequences for planet formation and volatile transport in protoplanetary disks.
△ Less
Submitted 17 November, 2015;
originally announced November 2015.
-
Lately Exposed Amorphous Water Ice on Comet 49P/Arend-Rigaux
Authors:
B. Sivaraman,
V. Venkataraman,
A. Kalyaan,
S. Arora,
S. Ganesh
Abstract:
Comet 49P/ Arend-Rigaux, thought to be a low activity comet since the 1980's was found to be active in its recent apparitions. Recent analysis of the data obtained from Spitzer observation of the comet in 2006 compared with laboratory spectra has revealed amorphous water ice on the surface. In addition, in 2012 a jet was found to appear during its subsequent perihelion passage as witnessed during…
▽ More
Comet 49P/ Arend-Rigaux, thought to be a low activity comet since the 1980's was found to be active in its recent apparitions. Recent analysis of the data obtained from Spitzer observation of the comet in 2006 compared with laboratory spectra has revealed amorphous water ice on the surface. In addition, in 2012 a jet was found to appear during its subsequent perihelion passage as witnessed during an observation carried out on 26th March 2012 using the PRL telescope at Mt. Abu. This confirms recent activity of Comet 49P/Arend-Rigaux due to the volatile subsurface materials exposed after several passages close to the Sun. Our result confirms the subsurface ices on cometary nuclei and insists for more observations for a better understanding.
△ Less
Submitted 13 October, 2014; v1 submitted 17 September, 2014;
originally announced September 2014.