-
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
Authors:
Bodhisattwa Prasad Majumder,
Harshit Surana,
Dhruv Agarwal,
Bhavana Dalvi Mishra,
Abhijeetsingh Meena,
Aryan Prakhar,
Tirth Vora,
Tushar Khot,
Ashish Sabharwal,
Peter Clark
Abstract:
Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systemat…
▽ More
Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. We additionally provide 903 synthetic tasks to conduct controlled evaluations across task complexity. Furthermore, our structured formalism of data-driven discovery enables a facet-based evaluation that provides useful insights into different failure modes. We evaluate several popular LLM-based reasoning frameworks using both open and closed LLMs as baselines on DiscoveryBench and find that even the best system scores only 25%. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
DISCOVERYWORLD: A Virtual Environment for Develo** and Evaluating Automated Scientific Discovery Agents
Authors:
Peter Jansen,
Marc-Alexandre Côté,
Tushar Khot,
Erin Bransom,
Bhavana Dalvi Mishra,
Bodhisattwa Prasad Majumder,
Oyvind Tafjord,
Peter Clark
Abstract:
Automated scientific discovery promises to accelerate progress across scientific domains. However, develo** and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for develo** and benchmarking an agent's abil…
▽ More
Automated scientific discovery promises to accelerate progress across scientific domains. However, develo** and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for develo** and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. DISCOVERYWORLD contains a variety of different challenges, covering topics as diverse as radioisotope dating, rocket science, and proteomics, to encourage development of general discovery skills rather than task-specific solutions. DISCOVERYWORLD itself is an inexpensive, simulated, text-based environment (with optional 2D visual overlay). It includes 120 different challenge tasks, spanning eight topics each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD further provides three automatic metrics for evaluating performance, based on (a) task completion, (b) task-relevant actions taken, and (c) the discovered explanatory knowledge. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks, suggesting that DISCOVERYWORLD captures some of the novel challenges of discovery, and thus that DISCOVERYWORLD may help accelerate near-term development and assessment of scientific discovery competency in agents. Code available at: www.github.com/allenai/discoveryworld
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
NEATH III: a molecular line survey of a simulated star-forming cloud
Authors:
F. D. Priestley,
P. C. Clark,
S. C. O. Glover,
S. E. Ragan,
O. Fehér,
L. R. Prole,
R. S. Klessen
Abstract:
We present synthetic line observations of a simulated molecular cloud, utilising a self-consistent treatment of the dynamics and time-dependent chemical evolution. We investigate line emission from the three most common CO isotopologues ($^{12}$CO, $^{13}$CO, C$^{18}$O) and six supposed tracers of dense gas (NH$_3$, HCN, N$_2$H$^+$, HCO$^+$, CS, HNC). Our simulation produces a range of line intens…
▽ More
We present synthetic line observations of a simulated molecular cloud, utilising a self-consistent treatment of the dynamics and time-dependent chemical evolution. We investigate line emission from the three most common CO isotopologues ($^{12}$CO, $^{13}$CO, C$^{18}$O) and six supposed tracers of dense gas (NH$_3$, HCN, N$_2$H$^+$, HCO$^+$, CS, HNC). Our simulation produces a range of line intensities consistent with that observed in real molecular clouds. The HCN-to-CO intensity ratio is relatively invariant with column density, making HCN (and chemically-similar species such as CS) a poor tracer of high-density material in the cloud. The ratio of N$_2$H$^+$ to HCN or CO, on the other hand, is highly selective of regions with densities above $10^{22} \, {\rm cm^{-2}}$, and the N$_2$H$^+$ line is a very good tracer of the dynamics of high volume density ($>10^4 \, {\rm cm^{-3}}$) material. Focusing on cores formed within the simulated cloud, we find good agreement with the line intensities of an observational sample of prestellar cores, including reproducing observed CS line intensities with an undepleted elemental abundance of sulphur. However, agreement between cores formed in the simulation, and models of isolated cores which have otherwise-comparable properties, is poor. The formation from and interaction with the large-scale environment has a significant impact on the line emission properties of the cores, making isolated models unsuitable for interpreting observational data.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Can Language Models Serve as Text-Based World Simulators?
Authors:
Ruoyao Wang,
Graham Todd,
Ziang Xiao,
Xingdi Yuan,
Marc-Alexandre Côté,
Peter Clark,
Peter Jansen
Abstract:
Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of tex…
▽ More
Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called ByteSized32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM's capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
$\textit{Kilonova Seekers}$: the GOTO project for real-time citizen science in time-domain astrophysics
Authors:
T. L. Killestein,
L. Kelsey,
E. Wickens,
L. Nuttall,
J. Lyman,
C. Krawczyk,
K. Ackley,
M. J. Dyer,
F. Jiménez-Ibarra,
K. Ulaczyk,
D. O'Neill,
A. Kumar,
D. Steeghs,
D. K. Galloway,
V. S. Dhillon,
P. O'Brien,
G. Ramsay,
K. Noysena,
R. Kotak,
R. P. Breton,
E. Pallé,
D. Pollacco,
S. Awiphan,
S. Belkin,
P. Chote
, et al. (29 additional authors not shown)
Abstract:
Time-domain astrophysics continues to grow rapidly, with the inception of new surveys drastically increasing data volumes. Democratised, distributed approaches to training sets for machine learning classifiers are crucial to make the most of this torrent of discovery -- with citizen science approaches proving effective at meeting these requirements. In this paper, we describe the creation of and t…
▽ More
Time-domain astrophysics continues to grow rapidly, with the inception of new surveys drastically increasing data volumes. Democratised, distributed approaches to training sets for machine learning classifiers are crucial to make the most of this torrent of discovery -- with citizen science approaches proving effective at meeting these requirements. In this paper, we describe the creation of and the initial results from the $\textit{Kilonova Seekers}$ citizen science project, built to find transient phenomena from the GOTO telescopes in near real-time. $\textit{Kilonova Seekers}$ launched in July 2023 and received over 600,000 classifications from approximately 2,000 volunteers over the course of the LIGO-Virgo-KAGRA O4a observing run. During this time, the project has yielded 20 discoveries, generated a `gold-standard' training set of 17,682 detections for augmenting deep-learned classifiers, and measured the performance and biases of Zooniverse volunteers on real-bogus classification. This project will continue throughout the lifetime of GOTO, pushing candidates at ever-greater cadence, and directly facilitate the next-generation classification algorithms currently in development.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
PDDLEGO: Iterative Planning in Textual Environments
Authors:
Li Zhang,
Peter Jansen,
Tianyi Zhang,
Peter Clark,
Chris Callison-Burch,
Niket Tandon
Abstract:
Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed…
▽ More
Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43% more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98%) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4%).
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Learning to Reason via Program Generation, Emulation, and Search
Authors:
Nathaniel Weir,
Muhammad Khalifa,
Linlu Qiu,
Orion Weller,
Peter Clark
Abstract:
Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understand…
▽ More
Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understanding. Our goal is to extend an LM's program synthesis skills to such tasks and evaluate the results via pseudo-programs, namely Python programs where some leaf function calls are left undefined. To that end, we propose, Code Generation and Emulated EXecution (CoGEX). CoGEX works by (1) training LMs to generate their own pseudo-programs, (2) teaching them to emulate their generated program's execution, including those leaf functions, allowing the LM's knowledge to fill in the execution gaps; and (3) using them to search over many programs to find an optimal one. To adapt the CoGEX model to a new task, we introduce a method for performing program search to find a single program whose pseudo-execution yields optimal performance when applied to all the instances of a given dataset. We show that our approach yields large improvements compared to standard in-context learning approaches on a battery of tasks, both algorithmic and soft reasoning. This result thus demonstrates that code synthesis can be applied to a much broader class of problems than previously considered. Our released dataset, fine-tuned models, and implementation can be found at \url{https://github.com/nweir127/CoGEX}.
△ Less
Submitted 28 May, 2024; v1 submitted 25 May, 2024;
originally announced May 2024.
-
Self-consistent modelling of the Milky Way structure using live potentials
Authors:
Eva Durán-Camacho,
Ana Duarte-Cabral,
Alex R. Pettitt,
Robin G. Treß,
Paul C. Clark,
Ralf S. Klessen,
Kamran R. J. Bogue,
Rowan J. Smith,
Mattia C. Sormani
Abstract:
To advance our understanding of the evolution of the interstellar medium (ISM) of our Galaxy, numerical models of Milky Way (MW) type galaxies are widely used. However, most models only vaguely resemble the MW (e.g. in total mass), and often use imposed analytic potentials (which cannot evolve dynamically). This poses a problem in asserting their applicability for the interpretation of observation…
▽ More
To advance our understanding of the evolution of the interstellar medium (ISM) of our Galaxy, numerical models of Milky Way (MW) type galaxies are widely used. However, most models only vaguely resemble the MW (e.g. in total mass), and often use imposed analytic potentials (which cannot evolve dynamically). This poses a problem in asserting their applicability for the interpretation of observations of our own Galaxy. The goal of this work is to identify a numerical model that is not only a MW-type galaxy, but one that can mimic some of the main observed structures of our Galaxy, using dynamically evolving potentials, so that it can be used as a base model to study the ISM cycle in a galaxy like our own. This paper introduces a suite of 15 MW-type galaxy models developed using the {\sc arepo} numerical code, that are compared to Galactic observations of $^{12}$CO and \ion{H}{I} emission via longitude-velocity plots, from where we extract and compare the skeletons of major galactic features and the terminal gas velocities. We found that our best-fitting model to the overall structure, also reproduces some of the more specific observed features of the MW, including a bar with a pattern speed of $30.0 \pm 0.2$ km\,s$^{-1}$\,kpc$^{-1}$, a bar half-length of $3.2 \pm 0.8$\,kpc. Our model shows large streaming motions around spiral arms, and strong radial motions well beyond the inner bar. This model highlights the complex motions of a dynamic MW-type galaxy and has the potential to offer valuable insight into how our Galaxy regulates the ISM and star formation.
△ Less
Submitted 11 June, 2024; v1 submitted 15 May, 2024;
originally announced May 2024.
-
Assessing the accuracy of the star formation rate measurements by direct star count in molecular clouds
Authors:
Sami Dib,
Jian Wen Zhou,
Sébastien Comerón,
Luis E. Garduño,
Valery V. Kravtsov,
Paul C. Clark,
Guang-Xing Li,
Maritza A. Lara-López,
Tie Liu,
Mohsen Shadmehri,
James R. Doughty
Abstract:
Star formation estimates based on the counting of YSOs is commonly applied to nearby star-forming regions in the Galaxy. With this method, the SFRs are measured using the counts of YSOs in a particular protostellar Class, a typical protostellar mass, and the lifetime associated with this Class. However, the assumptions underlying the validity of the method such as that of a constant star formation…
▽ More
Star formation estimates based on the counting of YSOs is commonly applied to nearby star-forming regions in the Galaxy. With this method, the SFRs are measured using the counts of YSOs in a particular protostellar Class, a typical protostellar mass, and the lifetime associated with this Class. However, the assumptions underlying the validity of the method such as that of a constant star formation history (SFH) and whether the method is valid for all protostellar Classes has never been fully tested. In this work, we use Monte Carlo models to test the validity of the method. We build synthetic clusters in which stars form at times that are randomly drawn from a specified SFH. The latter is either constant or time-dependent with a burst like behavior. The masses of the protostars are randomly drawn from an IMF which can be either similar to that of the Milky Way field or be variable . For each star in every cluster, the lifetimes associated with the different protostellar classes are also randomly drawn from Gaussian distribution functions centered around their most likely value as suggested by the observations. We find that only the SFR derived using the Class 0 population can reproduce the true SFR at all epochs, and this is true irrespective of the shape of the SFH. For a constant SFH, the SFR derived using the more evolved populations of protostars (Classes I, F, II, and III) reproduce the real SFR only at later epochs which correspond to epochs at which their numbers have reached a steady state. For a time-dependent burst-like SFH, all SFR estimates based on the number counts of the evolved populations fail to reproduce the true SFR. We also show how the offsets between Class I and Class II based SFRs and the true SFR plotted as a function of the number ratios of Class I and Class II versus Class III YSOs can be used in order to constrain the SFH of observed molecular clouds.
△ Less
Submitted 30 April, 2024;
originally announced May 2024.
-
I'm in AGNi: A new standard for AGN pluralisation
Authors:
Andrew D. Gow,
Peter Clark,
Dan Rycanowski
Abstract:
We present a new standard acronym for Active Galactic Nuclei, finally settling the argument of AGN vs. AGNs. Our new standard is not only etymologically superior (following the consensus set by SNe), but also boasts other linguistic opportunities, connecting strongly with relevant theology and streamlining descriptions of AGN properties.
We present a new standard acronym for Active Galactic Nuclei, finally settling the argument of AGN vs. AGNs. Our new standard is not only etymologically superior (following the consensus set by SNe), but also boasts other linguistic opportunities, connecting strongly with relevant theology and streamlining descriptions of AGN properties.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
N$_2$H$^+$(1-0) as a tracer of dense gas in and between spiral arms
Authors:
O. Feher,
S. E. Ragan,
F. D. Priestley,
P. C. Clark,
T. J. T. Moore
Abstract:
Recent advances in identifying giant molecular filaments in galactic surveys allow us to study the interstellar material and its dense, potentially star forming phase on scales comparable to resolved extragalactic clouds. Two large filaments detected in the CHIMPS $^{13}$CO(3-2) survey, one in the Sagittarius-arm and one in an inter-arm region, were mapped with dense gas tracers inside a 0.06 deg…
▽ More
Recent advances in identifying giant molecular filaments in galactic surveys allow us to study the interstellar material and its dense, potentially star forming phase on scales comparable to resolved extragalactic clouds. Two large filaments detected in the CHIMPS $^{13}$CO(3-2) survey, one in the Sagittarius-arm and one in an inter-arm region, were mapped with dense gas tracers inside a 0.06 deg$^2$ area and with a spatial resolution of around 0.4 and 0.65 pc at the distance of the targets using the IRAM 30m telescope, to investigate the environmental dependence of the dense gas fraction. The N$_2$H$^+$(1-0) transition, an excellent tracer of the dense gas, was detected in parsec-scale, elliptical clumps and with a filling factor of around 8.5% in our maps. The N$_2$H$^+$-emitting areas appear to have higher dense gas fraction (e.g. the ratio of N$_2$H$^+$ and $^{13}$CO emission) in the inter-arm than in the arm which is opposite to the behaviour found by previous studies, using dust emission rather than N$_2$H$^+$ as a tracer of dense gas. However, the arm filament is brighter in $^{13}$CO and the infrared emission of dust, and the dense gas fraction determined as above is governed by the $^{13}$CO brightness. We caution that measurements regarding the distribution and fraction of dense gas on these scales may be influenced by many scale- and environment-dependent factors, as well as the chemistry and excitation of the particular tracers, then consider several scenarios that can reproduce the observed effect.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
PROC2PDDL: Open-Domain Planning Representations from Texts
Authors:
Tianyi Zhang,
Li Zhang,
Zhaoyi Hou,
Ziyu Wang,
Yuling Gu,
Peter Clark,
Chris Callison-Burch,
Niket Tandon
Abstract:
Planning in a text-based environment continues to be a major challenge for AI systems. Recent approaches have used language models to predict a planning domain definition (e.g., PDDL) but have only been evaluated in closed-domain simulated environments. To address this, we present Proc2PDDL , the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representation…
▽ More
Planning in a text-based environment continues to be a major challenge for AI systems. Recent approaches have used language models to predict a planning domain definition (e.g., PDDL) but have only been evaluated in closed-domain simulated environments. To address this, we present Proc2PDDL , the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representations. Using this dataset, we evaluate state-of-the-art models on defining the preconditions and effects of actions. We show that Proc2PDDL is highly challenging, with GPT-3.5's success rate close to 0% and GPT-4's around 35%. Our analysis shows both syntactic and semantic errors, indicating LMs' deficiency in both generating domain-specific prgorams and reasoning about events. We hope this analysis and dataset helps future progress towards integrating the best of LMs and formal planning.
△ Less
Submitted 2 July, 2024; v1 submitted 29 February, 2024;
originally announced March 2024.
-
The rate of extreme coronal line emitting galaxies in the Sloan Digital Sky Survey and their relation to tidal disruption events
Authors:
Joseph Callow,
Or Graur,
Peter Clark,
Antonella Palmese,
Jessica Aguilar,
Steven Ahlen,
Segev BenZvi,
David Brooks,
Todd Claybaugh,
Axel de la Macorra,
Peter Doel,
Jaime E. Forero-Romero,
Enrique Gaztañaga,
Satya Gontcho A Gontcho,
Andrew Lambert,
Martin Landriau,
Marc Manera,
Aaron Meisner,
Ramon Miquel,
John Moustakas,
Jundan Nie,
Claire Poppett,
Francisco Prada,
Mehdi Rezaie,
Graziano Rossi
, et al. (5 additional authors not shown)
Abstract:
Strong high-ionization iron coronal lines (CLs) are a rare phenomenon observed in galaxy and quasi-stellar object spectra that are thought to be created as a result of tidal disruption event (TDE) flares. To test whether these CLs are the result of TDE activity, we search for extreme coronal line emitting galaxies (ECLEs) in the Sloan Digital Sky Survey (SDSS), measure their rate, and compare it t…
▽ More
Strong high-ionization iron coronal lines (CLs) are a rare phenomenon observed in galaxy and quasi-stellar object spectra that are thought to be created as a result of tidal disruption event (TDE) flares. To test whether these CLs are the result of TDE activity, we search for extreme coronal line emitting galaxies (ECLEs) in the Sloan Digital Sky Survey (SDSS), measure their rate, and compare it to TDE rates from the literature. We detect sufficiently strong CLs in 14 objects, doubling the number previously found in SDSS. Using follow-up spectra from the Dark Energy Spectroscopic Instrument and Gemini Multi-Object Spectrograph, Wide-field Infrared Survey Explorer mid-infrared observations, and Liverpool Telescope optical photometry, we find that of the seven new objects, only one evolves in a manner consistent with that of the five previously discovered variable ECLEs. Using this new sample of six variable ECLEs, we calculate the galaxy-normalised rate of ECLEs in SDSS to be $R_\mathrm{G}=2.2~^{+1.3}_{-0.8}~(\mathrm{statistical})~^{+0.0}_{-1.3}~(\mathrm{systematic})\times10^{-5}~\mathrm{galaxy}^{-1}~\mathrm{year}^{-1}$. The mass-normalised rate is $R_\mathrm{M}=1.9~^{+1.1}_{-0.7}~(\mathrm{statistical})~^{+0.0}_{-1.1}~(\mathrm{systematic})\times10^{-16}~\mathrm{M_\odot^{-1}}~\mathrm{year}^{-1}$ and the volumetric rate is $R_\mathrm{V}=6.9~^{+5.6}_{-2.1}~(\mathrm{statistical})~^{+0.0}_{-3.9}~(\mathrm{systematic})\times10^{-8}~\mathrm{Mpc}^{-3}~\mathrm{year}^{-1}$. Our rates are comparable to TDE rates from the literature, supporting the suggestion that the CLs in variable ECLEs are the product of TDEs.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic
Authors:
Nathaniel Weir,
Kate Sanders,
Orion Weller,
Shreya Sharma,
Dongwei Jiang,
Zheng** Jiang,
Bhavana Dalvi Mishra,
Oyvind Tafjord,
Peter Jansen,
Peter Clark,
Benjamin Van Durme
Abstract:
Contemporary language models enable new opportunities for structured reasoning with text, such as the construction and evaluation of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy…
▽ More
Contemporary language models enable new opportunities for structured reasoning with text, such as the construction and evaluation of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment datasets, and evaluate its impact on LLM-based textual inference. We find that our resulting dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency (+9%) than prior decompositional entailment datasets, suggesting that RDTE is a significant step forward in the long-standing problem of forming a clear protocol for discerning entailment. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in a modern neuro-symbolic reasoning engine significantly improves results (both accuracy and proof quality) over other entailment classifier baselines, illustrating the practical benefit of this advance for textual inference.
△ Less
Submitted 27 February, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
Data-driven Discovery with Large Generative Models
Authors:
Bodhisattwa Prasad Majumder,
Harshit Surana,
Dhruv Agarwal,
Sanchaita Hazra,
Ashish Sabharwal,
Peter Clark
Abstract:
With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a se…
▽ More
With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DATAVOYAGER, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata -- a feat previously unattainable -- while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills
Authors:
Kolby Nottingham,
Bodhisattwa Prasad Majumder,
Bhavana Dalvi Mishra,
Sameer Singh,
Peter Clark,
Roy Fox
Abstract:
Large language models (LLMs) have recently been used for sequential decision making in interactive environments. However, leveraging environment reward signals for continual LLM actor improvement is not straightforward. We propose Skill Set Optimization (SSO) for improving LLM actor performance through constructing and refining sets of transferable skills. SSO constructs skills by extracting commo…
▽ More
Large language models (LLMs) have recently been used for sequential decision making in interactive environments. However, leveraging environment reward signals for continual LLM actor improvement is not straightforward. We propose Skill Set Optimization (SSO) for improving LLM actor performance through constructing and refining sets of transferable skills. SSO constructs skills by extracting common subtrajectories with high rewards and generating subgoals and instructions to represent each skill. These skills are provided to the LLM actor in-context to reinforce behaviors with high rewards. Then, SSO further refines the skill set by pruning skills that do not continue to result in high rewards. We evaluate our method in the classic videogame NetHack and the text environment ScienceWorld to demonstrate SSO's ability to optimize a set of skills and perform in-context policy improvement. SSO outperforms baselines by 40% in our custom NetHack task and outperforms the previous state-of-the-art in ScienceWorld by 35%.
△ Less
Submitted 22 June, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
Authors:
Peter Hase,
Mohit Bansal,
Peter Clark,
Sarah Wiegreffe
Abstract:
How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have continually improved. In this paper, we present the surprising conclusion that current pretrained language models often generalize relatively well from…
▽ More
How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have continually improved. In this paper, we present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data, even performing as well as oracle models finetuned on hard data. We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear classifier heads, and QLoRA for seven different measures of datapoint hardness, including six empirically diverse human hardness measures (like grade level) and one model-based measure (loss-based). Furthermore, we show that even if one cares most about model performance on hard data, it can be better to collect easy data rather than hard data for finetuning, since hard data is generally noisier and costlier to collect. Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied. Our code is available at: https://github.com/allenai/easy-to-hard-generalization
△ Less
Submitted 5 June, 2024; v1 submitted 12 January, 2024;
originally announced January 2024.
-
BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability
Authors:
Peter Clark,
Bhavana Dalvi Mishra,
Oyvind Tafjord
Abstract:
While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a colle…
▽ More
While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a collection of human-annotated *entailment trees*, engineered to express both good and bad chains of reasoning, and using a mixture of true and false facts, in particular including counterfactual examples, to avoid belief bias (also known as the "content effect"). The resulting dataset, called BaRDa, contains 3000 entailments (1787 valid, 1213 invalid), using 6681 true and 2319 false statements. Testing on four GPT-series models, GPT3(curie)/GPT3(davinici)/3.5/4, we find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2. This shows the clear progression of models towards improved factual accuracy and entailment reasoning, and the dataset provides a new benchmark that more cleanly separates and quantifies these two notions.
△ Less
Submitted 23 March, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.
-
Heavy Black Hole Seed Formation in High-z Atomic Cooling Halos
Authors:
Lewis R. Prole,
John A. Regan,
Simon C. O. Glover,
Ralf S. Klessen,
Felix D. Priestley,
Paul C. Clark
Abstract:
Halos with masses in excess of the atomic limit are believed to be ideal environments in which to form heavy black hole seeds with masses above 10^3 Msun. In cases where the H_2 fraction is suppressed this is expected to lead to reduced fragmentation of the gas and the generation of a top heavy initial mass function. In extreme cases this can result in the formation of massive black hole seeds. Re…
▽ More
Halos with masses in excess of the atomic limit are believed to be ideal environments in which to form heavy black hole seeds with masses above 10^3 Msun. In cases where the H_2 fraction is suppressed this is expected to lead to reduced fragmentation of the gas and the generation of a top heavy initial mass function. In extreme cases this can result in the formation of massive black hole seeds. Resolving the initial fragmentation scale and the resulting protostellar masses has, until now, not been robustly tested. Cosmological simulations were performed with the moving mesh code Arepo using a primordial chemistry network until z = 11. Three haloes with masses in excess of the atomic cooling mass were then selected for detailed examination via zoom-ins. The highest resolution simulations resolve densities up to 10^-6 g cm^-3 (10^18 cm^-3) and capture a further 100 yr of fragmentation behaviour at the center of the halo. Our simulations show intense fragmentation in the central region of the halos, leading to a large number of near-solar mass protostars. Despite the increased fragmentation the halos produce a protostellar mass spectrum that peaks at higher masses relative to standard Population III star forming halos. The most massive protostars have accretion rates of 10^-3-10^-1 Msun yr^-1 after the first 100 years of evolution, while the total mass of the central region grows at 1 Msun yr^-1. Lower resolution zoom-ins show that the total mass of the system continues to accrete at 1 Msun yr^-1 for at least 10^4 yr, although how this mass is distributed amongst the rapidly growing number of protostars is unclear. However, assuming that a fraction of stars can continue to accrete rapidly the formation of a sub-population of stars with masses in excess of 10^3 Msun is likely in these halos.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Light-Curve Structure and Halpha Line Formation in the Tidal Disruption Event AT 2019azh
Authors:
Sara Faris,
Iair Arcavi,
Lydia Makrygianni,
Daichi Hiramatsu,
Giacomo Terreran,
Joseph Farah,
D. Andrew Howell,
Curtis McCully,
Megan Newsome,
Estefania Padilla Gonzalez,
Craig Pellegrino,
K. Azalee Bostroem,
Wiam Abojanb,
Marco C. Lam,
Lina Tomasella,
Thomas G. Brink,
Alexei V. Filippenko,
K. Decker French,
Peter Clark,
Or Graur,
Giorgos Leloudas,
Mariusz Gromadzki,
Joseph P. Anderson,
Matt Nicholl,
Claudia P. Gutierrez
, et al. (11 additional authors not shown)
Abstract:
AT 2019azh is a H+He tidal disruption event (TDE) with one of the most extensive ultraviolet and optical datasets available to date. We present our photometric and spectroscopic observations of this event starting several weeks before and out to approximately two years after g-band peak brightness and combine them with public photometric data. This extensive dataset robustly reveals a change in th…
▽ More
AT 2019azh is a H+He tidal disruption event (TDE) with one of the most extensive ultraviolet and optical datasets available to date. We present our photometric and spectroscopic observations of this event starting several weeks before and out to approximately two years after g-band peak brightness and combine them with public photometric data. This extensive dataset robustly reveals a change in the light-curve slope and a bump in the rising light curve of a TDE for the first time, which may indicate more than one dominant emission mechanism contributing to the pre-peak light curve. We further confirm the relation seen in previous TDEs whereby the redder emission peaks later than the bluer emission. The post-peak bolometric light curve of AT 2019azh is better described by an exponential decline than by the canonical t^{-5/3} (and in fact any) power-law decline. We find a possible mid-infrared excess around peak optical luminosity, but cannot determine its origin. In addition, we provide the earliest measurements of the Halpha emission-line evolution and find no significant time delay between the peak of the V-band light curve and that of the Halpha luminosity. These results can be used to constrain future models of TDE line formation and emission mechanisms in general. More pre-peak 1-2 day cadence observations of TDEs are required to determine whether the characteristics observed here are common among TDEs. More importantly, detailed emission models are needed to fully exploit such observations for understanding the emission physics of TDEs.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Overview of the distributed image processing infrastructure to produce the Legacy Survey of Space and Time
Authors:
Fabio Hernandez,
George Beckett,
Peter Clark,
Matt Doidge,
Tim Jenness,
Edward Karavakis,
Quentin Le Boulc'h,
Peter Love,
Gabriele Mainetti,
Timothy Noble,
Brandon White,
Wei Yang
Abstract:
The Vera C. Rubin Observatory is preparing to execute the most ambitious astronomical survey ever attempted, the Legacy Survey of Space and Time (LSST). Currently the final phase of construction is under way in the Chilean Andes, with the Observatory's ten-year science mission scheduled to begin in 2025. Rubin's 8.4-meter telescope will nightly scan the southern hemisphere collecting imagery in th…
▽ More
The Vera C. Rubin Observatory is preparing to execute the most ambitious astronomical survey ever attempted, the Legacy Survey of Space and Time (LSST). Currently the final phase of construction is under way in the Chilean Andes, with the Observatory's ten-year science mission scheduled to begin in 2025. Rubin's 8.4-meter telescope will nightly scan the southern hemisphere collecting imagery in the wavelength range 320-1050 nm covering the entire observable sky every 4 nights using a 3.2 gigapixel camera, the largest imaging device ever built for astronomy. Automated detection and classification of celestial objects will be performed by sophisticated algorithms on high-resolution images to progressively produce an astronomical catalog eventually composed of 20 billion galaxies and 17 billion stars and their associated physical properties.
In this article we present an overview of the system currently being constructed to perform data distribution as well as the annual campaigns which reprocess the entire image dataset collected since the beginning of the survey. These processing campaigns will utilize computing and storage resources provided by three Rubin data facilities (one in the US and two in Europe). Each year a Data Release will be produced and disseminated to science collaborations for use in studies comprising four main science pillars: probing dark matter and dark energy, taking inventory of solar system objects, exploring the transient optical sky and map** the Milky Way.
Also presented is the method by which we leverage some of the common tools and best practices used for management of large-scale distributed data processing projects in the high energy physics and astronomy communities. We also demonstrate how these tools and practices are utilized within the Rubin project in order to overcome the specific challenges faced by the Observatory.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
Functional degrees and arithmetic applications III: Beyond Prime Exponent
Authors:
Pete L. Clark,
Uwe Schauz
Abstract:
Continuing our work on group-theoretic generalizations of the prime Ax-Katz Theorem, we give a lower bound on the $p$-adic divisibility of the cardinality of the set of simultaneous zeros $Z(f_1,f_2,\ldots,f_r)$ of $r$ maps $f_j:A\rightarrow B_j$ between arbitrary finite commutative groups $A$ and $B_j$ in terms of the invariant factors of $A, B_1,B_2,\dotsc,B_r$ and the \emph{functional degrees}…
▽ More
Continuing our work on group-theoretic generalizations of the prime Ax-Katz Theorem, we give a lower bound on the $p$-adic divisibility of the cardinality of the set of simultaneous zeros $Z(f_1,f_2,\ldots,f_r)$ of $r$ maps $f_j:A\rightarrow B_j$ between arbitrary finite commutative groups $A$ and $B_j$ in terms of the invariant factors of $A, B_1,B_2,\dotsc,B_r$ and the \emph{functional degrees} of the maps $f_1,f_2,\dotsc,f_r$.
△ Less
Submitted 4 July, 2024; v1 submitted 17 November, 2023;
originally announced November 2023.
-
Digital Socrates: Evaluating LLMs through Explanation Critiques
Authors:
Yuling Gu,
Oyvind Tafjord,
Peter Clark
Abstract:
While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on…
▽ More
While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on expensive API calls or human annotations. Our approach is to (a) define the new task of explanation critiquing - identifying and categorizing any main flaw in an explanation and providing suggestions to address the flaw, (b) create a sizeable, human-verified dataset for this task, and (c) train an open-source, automatic critique model (called Digital Socrates) using this data. Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains, and how it can provide high-quality, nuanced, automatic evaluation of those model explanations for the first time. Digital Socrates thus fills an important gap in evaluation tools for understanding and improving the explanation behavior of models.
△ Less
Submitted 16 February, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Leveraging Code to Improve In-context Learning for Semantic Parsing
Authors:
Ben Bogin,
Shivanshu Gupta,
Peter Clark,
Ashish Sabharwal
Abstract:
In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs. In this work, we improve the effectiveness of ICL for semantic parsing by (1) using general-purpose p…
▽ More
In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs. In this work, we improve the effectiveness of ICL for semantic parsing by (1) using general-purpose programming languages such as Python instead of DSLs, and (2) augmenting prompts with a structured domain description that includes, e.g., the available classes and functions. We show that both these changes significantly improve accuracy across three popular datasets. Combined, they lead to dramatic improvements (e.g. 7.9% to 66.5% on SMCalFlow compositional split), nearly closing the performance gap between easier i.i.d.\ and harder compositional splits when used with a strong model, and reducing the need for a large number of demonstrations. We find that the resemblance of the target parse language to general-purpose code is a more important factor than the language's popularity in pre-training corpora. Our findings provide an improved methodology for building semantic parsers in the modern context of ICL with LLMs.
△ Less
Submitted 27 March, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Tailoring with Targeted Precision: Edit-Based Agents for Open-Domain Procedure Customization
Authors:
Yash Kumar Lal,
Li Zhang,
Faeze Brahman,
Bodhisattwa Prasad Majumder,
Peter Clark,
Niket Tandon
Abstract:
How-to procedures, such as how to plant a garden, are now used by millions of users, but sometimes need customizing to meet a user's specific needs, e.g., planting a garden without pesticides. Our goal is to measure and improve an LLM's ability to perform such customization. Our approach is to test several simple multi-LLM-agent architectures for customization, as well as an end-to-end LLM, using…
▽ More
How-to procedures, such as how to plant a garden, are now used by millions of users, but sometimes need customizing to meet a user's specific needs, e.g., planting a garden without pesticides. Our goal is to measure and improve an LLM's ability to perform such customization. Our approach is to test several simple multi-LLM-agent architectures for customization, as well as an end-to-end LLM, using a new evaluation set, called CustomPlans, of over 200 WikiHow procedures each with a customization need. We find that a simple architecture with two LLM agents used sequentially performs best, one that edits a generic how-to procedure and one that verifies its executability, significantly outperforming (10.5% absolute) an end-to-end prompted LLM. This suggests that LLMs can be configured reasonably effectively for procedure customization. This also suggests that multi-agent editing architectures may be worth exploring further for other customization applications (e.g. coding, creative writing) in the future.
△ Less
Submitted 30 May, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Data downloaded via parachute from a NASA super-pressure balloon
Authors:
Ellen L. Sirks,
Richard Massey,
Ajay S. Gill,
Jason Anderson,
Steven J. Benton,
Anthony M. Brown,
Paul Clark,
Joshua English,
Spencer W. Everett,
Aurelien A. Fraisse,
Hugo Franco,
John W. Hartley,
David Harvey,
Bradley Holder,
Andrew Hunter,
Eric M. Huff,
Andrew Hynous,
Mathilde Jauzac,
William C. Jones,
Nikky Joyce,
Duncan Kennedy,
David Lagattuta,
Jason S. -Y. Leung,
Lun Li,
Stephen Lishman
, et al. (18 additional authors not shown)
Abstract:
In April to May 2023, the superBIT telescope was lifted to the Earth's stratosphere by a helium-filled super-pressure balloon, to acquire astronomical imaging from above (99.5% of) the Earth's atmosphere. It was launched from New Zealand then, for 40 days, circumnavigated the globe five times at a latitude 40 to 50 degrees South. Attached to the telescope were four 'DRS' (Data Recovery System) cap…
▽ More
In April to May 2023, the superBIT telescope was lifted to the Earth's stratosphere by a helium-filled super-pressure balloon, to acquire astronomical imaging from above (99.5% of) the Earth's atmosphere. It was launched from New Zealand then, for 40 days, circumnavigated the globe five times at a latitude 40 to 50 degrees South. Attached to the telescope were four 'DRS' (Data Recovery System) capsules containing 5 TB solid state data storage, plus a GNSS receiver, Iridium transmitter, and parachute. Data from the telescope were copied to these, and two were dropped over Argentina. They drifted 61 km horizontally while they descended 32 km, but we predicted their descent vectors within 2.4 km: in this location, the discrepancy appears irreducible below 2 km because of high speed, gusty winds and local topography. The capsules then reported their own locations to within a few metres. We recovered the capsules and successfully retrieved all of superBIT's data - despite the telescope itself being later destroyed on landing.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
ADaPT: As-Needed Decomposition and Planning with Language Models
Authors:
Archiki Prasad,
Alexander Koller,
Mareike Hartmann,
Peter Clark,
Ashish Sabharwal,
Mohit Bansal,
Tushar Khot
Abstract:
Large Language Models (LLMs) are increasingly being used for interactive decision-making tasks requiring planning and adapting to the environment. Recent works employ LLMs-as-agents in broadly two ways: iteratively determining the next action (iterative executors) or generating plans and executing sub-tasks using LLMs (plan-and-execute). However, these methods struggle with task complexity, as the…
▽ More
Large Language Models (LLMs) are increasingly being used for interactive decision-making tasks requiring planning and adapting to the environment. Recent works employ LLMs-as-agents in broadly two ways: iteratively determining the next action (iterative executors) or generating plans and executing sub-tasks using LLMs (plan-and-execute). However, these methods struggle with task complexity, as the inability to execute any sub-task may lead to task failure. To address these shortcomings, we introduce As-Needed Decomposition and Planning for complex Tasks (ADaPT), an approach that explicitly plans and decomposes complex sub-tasks as-needed, i.e., when the LLM is unable to execute them. ADaPT recursively decomposes sub-tasks to adapt to both task complexity and LLM capability. Our results demonstrate that ADaPT substantially outperforms established strong baselines, achieving success rates up to 28.3% higher in ALFWorld, 27% in WebShop, and 33% in TextCraft -- a novel compositional dataset that we introduce. Through extensive analysis, we illustrate the importance of multilevel decomposition and establish that ADaPT dynamically adjusts to the capabilities of the executor LLM as well as to task complexity.
△ Less
Submitted 8 April, 2024; v1 submitted 8 November, 2023;
originally announced November 2023.
-
Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs
Authors:
Shashank Gupta,
Vaishnavi Shrivastava,
Ameet Deshpande,
Ashwin Kalyan,
Peter Clark,
Ashish Sabharwal,
Tushar Khot
Abstract:
Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of…
▽ More
Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks. Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups. Our experiments unveil that LLMs harbor deep rooted bias against various socio-demographics underneath a veneer of fairness. While they overtly reject stereotypes when explicitly asked ('Are Black people less skilled at mathematics?'), they manifest stereotypical and erroneous presumptions when asked to answer questions while adopting a persona. These can be observed as abstentions in responses, e.g., 'As a Black person, I can't answer this question as it requires math knowledge', and generally result in a substantial performance drop. Our experiments with ChatGPT-3.5 show that this bias is ubiquitous - 80% of our personas demonstrate bias; it is significant - some datasets show performance drops of 70%+; and can be especially harmful for certain groups - some personas suffer statistically significant drops on 80%+ of the datasets. Overall, all 4 LLMs exhibit this bias to varying extents, with GPT-4-Turbo showing the least but still a problematic amount of bias (evident in 42% of the personas). Further analysis shows that these persona-induced errors can be hard-to-discern and hard-to-avoid. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs - a trend on the rise - can surface their deep-rooted biases and have unforeseeable and detrimental side-effects.
△ Less
Submitted 27 January, 2024; v1 submitted 8 November, 2023;
originally announced November 2023.
-
QualEval: Qualitative Evaluation for Model Improvement
Authors:
Vishvak Murahari,
Ameet Deshpande,
Peter Clark,
Tanmay Rajpurohit,
Ashish Sabharwal,
Karthik Narasimhan,
Ashwin Kalyan
Abstract:
Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a…
▽ More
Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a way to compare and benchmark models, and do not yield actionable diagnostics, thus making the model improvement process challenging. Model developers find themselves amid extensive manual efforts involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are backed by a comprehensive dashboard with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace of model development, thus in essence serving as a data-scientist-in-a-box. Given the focus on critiquing and improving current evaluation metrics, our method serves as a refreshingly new technique for both model evaluation and improvement.
△ Less
Submitted 5 May, 2024; v1 submitted 5 November, 2023;
originally announced November 2023.
-
Population III star formation: multiple gas phases prevent the use of an equation of state at high densities
Authors:
Lewis R. Prole,
Paul C. Clark,
Felix D. Priestley,
Simon C. O. Glover,
John A. Regan
Abstract:
Advanced primordial chemistry networks have been developed to model the collapse of metal-free baryonic gas within the gravitational well of dark matter (DM) halos and its subsequent collapse into Population III stars. At the low densities of 10^-26-10^-21 g cm-3 (10-3-10^2 cm-3) the collapse is dependent on H2 production, which is a function of the compressional heating provided by the DM potenti…
▽ More
Advanced primordial chemistry networks have been developed to model the collapse of metal-free baryonic gas within the gravitational well of dark matter (DM) halos and its subsequent collapse into Population III stars. At the low densities of 10^-26-10^-21 g cm-3 (10-3-10^2 cm-3) the collapse is dependent on H2 production, which is a function of the compressional heating provided by the DM potential. Once the gas decouples from the DM, the temperature-density relationship follows a well established path dictated by various chemical reactions until the formation of the protostar at 10^-4 g cm-3 (10^19 cm-3). Here we explore the feasibility of replacing the chemical network (CN) with a barotropic equation of state (EoS) just before the formation of the first protostar, to reduce the computational load of simulating the further fragmentation, evolution and characteristics of the very high density gas. We find a significant reduction in fragmentation when using the EoS. The EoS method produces a protostellar mass distribution that peaks at higher masses when compared to CN runs. The change in fragmentation behaviour is due to a lack of cold gas falling in through the disc around the first protostar when using an EoS. Despite this, the total mass accreted across all sinks was invariant to the switch to an EoS, hence the star formation rate (Msun yr^-1) is accurately predicted using an EoS. The EoS routine is approximately 4000 times faster than the CN, however this numerical gain is offset by the lack of accuracy in modelling secondary protostar formation and hence its use must be considered carefully.
△ Less
Submitted 19 January, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
Authors:
Bodhisattwa Prasad Majumder,
Bhavana Dalvi Mishra,
Peter Jansen,
Oyvind Tafjord,
Niket Tandon,
Li Zhang,
Chris Callison-Burch,
Peter Clark
Abstract:
Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. However, despite their zero-shot capabilities, these agents to date do not continually improve over time beyond performance refinement on a specific task. Here we present C…
▽ More
Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. However, despite their zero-shot capabilities, these agents to date do not continually improve over time beyond performance refinement on a specific task. Here we present CLIN, the first language-based agent to achieve this, so that it continually improves over multiple trials, including when both the environment and task are varied, and without requiring parameter updates. Our approach is to use a persistent, dynamic, textual memory centered on causal abstractions (rather than general "helpful hints") that is regularly updated after each trial so that the agent gradually learns useful knowledge for new trials. In the ScienceWorld benchmark, CLIN is able to continually improve on repeated trials on the same task and environment, outperforming state-of-the-art reflective language agents like Reflexion by 23 absolute points. CLIN can also transfer its learning to new environments (or new tasks), improving its zero-shot performance by 4 points (13 for new tasks) and can further improve performance there through continual memory updates, enhancing performance by an additional 17 points (7 for new tasks). This suggests a new architecture for agents built on frozen models that can still continually and rapidly improve over time.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
NEATH II: N$_2$H$^+$ as a tracer of imminent star formation in quiescent high-density gas
Authors:
F. D. Priestley,
P. C. Clark,
S. C. O. Glover,
S. E. Ragan,
O. Fehér,
L. R. Prole,
R. S. Klessen
Abstract:
Star formation activity in molecular clouds is often found to be correlated with the amount of material above a column density threshold of $\sim 10^{22} \, {\rm cm^{-2}}$. Attempts to connect this column density threshold to a ${\it volume}$ density above which star formation can occur are limited by the fact that the volume density of gas is difficult to reliably measure from observations. We po…
▽ More
Star formation activity in molecular clouds is often found to be correlated with the amount of material above a column density threshold of $\sim 10^{22} \, {\rm cm^{-2}}$. Attempts to connect this column density threshold to a ${\it volume}$ density above which star formation can occur are limited by the fact that the volume density of gas is difficult to reliably measure from observations. We post-process hydrodynamical simulations of molecular clouds with a time-dependent chemical network, and investigate the connection between commonly-observed molecular species and star formation activity. We find that many molecules widely assumed to specifically trace the dense, star-forming component of molecular clouds (e.g. HCN, HCO$^+$, CS) actually also exist in substantial quantities in material only transiently enhanced in density, which will eventually return to a more diffuse state without forming any stars. By contrast, N$_2$H$^+$ only exists in detectable quantities above a volume density of $10^4 \, {\rm cm^{-3}}$, the point at which CO, which reacts destructively with N$_2$H$^+$, begins to deplete out of the gas phase onto grain surfaces. This density threshold for detectable quantities of N$_2$H$^+$ corresponds very closely to the volume density at which gas becomes irreversibly gravitationally bound in the simulations: the material traced by N$_2$H$^+$ never reverts to lower densities, and quiescent regions of molecular clouds with visible N$_2$H$^+$ emission are destined to eventually form stars. The N$_2$H$^+$ line intensity is likely to directly correlate with the star formation rate averaged over timescales of around a Myr.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
GW190425: Pan-STARRS and ATLAS coverage of the skymap and limits on optical emission associated with FRB190425
Authors:
S. J. Smartt,
M. Nicholl,
S. Srivastav,
M. E. Huber,
K. C. Chambers,
K. W. Smith,
D. R. Young,
M. D. Fulton,
J. L. Tonry,
C. W. Stubbs,
L. Denneau,
A. J. Cooper,
A. Aamer,
J. P. Anderson,
A. Andersson,
J. Bulger,
T. -W Chen,
P. Clark,
T. de Boer,
H. Gao,
J. H. Gillanders,
A. Lawrence,
C. C. Lin,
T. B. Lowe,
E. A. Magnier
, et al. (10 additional authors not shown)
Abstract:
GW190425 is the second of only two binary neutron star (BNS) merger events to be significantly detected by the LIGO-Virgo- Kagra gravitational wave detectors. With a detection only in LIGO Livingston, the skymap containing the source was large and no plausible electromagnetic counterpart was found in real time searching in 2019. Here we summarise our ATLAS and Pan-STARRS wide-field optical coverag…
▽ More
GW190425 is the second of only two binary neutron star (BNS) merger events to be significantly detected by the LIGO-Virgo- Kagra gravitational wave detectors. With a detection only in LIGO Livingston, the skymap containing the source was large and no plausible electromagnetic counterpart was found in real time searching in 2019. Here we summarise our ATLAS and Pan-STARRS wide-field optical coverage of the skymap beginning within 1 hour and 3 hours respectively of the GW190425 merger time. More recently, a potential coincidence between GW190425 and a fast radio burst FRB 190425 has been suggested, given their spatial and temporal coincidence. The smaller sky localisation area of FRB 190425 and its dispersion measure have led to the identification of a likely host galaxy, UGC 10667 at a distance of 141 +/- 10 Mpc. Our optical imaging covered the galaxy 6.0 hrs after GW190425 was detected and 3.5 hrs after the FRB 190425. No optical emission was detected and further imaging at +1.2 and +13.2 days also revealed no emission. If the FRB 190425 and GW190425 association were real, we highlight our limits on kilonova emission from a BNS merger in UGC 10667. The model for producing FRB 190425 from a BNS merger involves a supramassive magnetised neutron star spinning down by dipole emission on the timescale of hours. We show that magnetar enhanced kilonova emission is ruled out by optical upper limits. The lack of detected optical emission from a kilonova in UGC 10667 disfavours, but does not disprove, the FRB-GW link for this source.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Non-Equilibrium Abundances Treated Holistically (NEATH): the molecular composition of star-forming clouds
Authors:
F. D. Priestley,
P. C. Clark,
S. C. O. Glover,
S. E. Ragan,
O. Fehér,
L. R. Prole,
R. S. Klessen
Abstract:
Much of what we know about molecular clouds, and by extension star formation, comes from molecular line observations. Interpreting these correctly requires knowledge of the underlying molecular abundances. Simulations of molecular clouds typically only model species that are important for the gas thermodynamics, which tend to be poor tracers of the denser material where stars form. We construct a…
▽ More
Much of what we know about molecular clouds, and by extension star formation, comes from molecular line observations. Interpreting these correctly requires knowledge of the underlying molecular abundances. Simulations of molecular clouds typically only model species that are important for the gas thermodynamics, which tend to be poor tracers of the denser material where stars form. We construct a framework for post-processing these simulations with a full time-dependent chemical network, allowing us to model the behaviour of observationally-important species not present in the reduced network used for the thermodynamics. We use this to investigate the chemical evolution of molecular gas under realistic physical conditions. We find that molecules can be divided into those which reach peak abundances at moderate densities ($10^3 \, {\rm cm^{-3}}$) and decline sharply thereafter (such as CO and HCN), and those which peak at higher densities and then remain roughly constant (e.g. NH$_3$, N$_2$H$^+$). Evolving the chemistry with physical properties held constant at their final values results in a significant overestimation of gas-phase abundances for all molecules, and does not capture the drastic variations in abundance caused by different evolutionary histories. The dynamical evolution of molecular gas cannot be neglected when modelling its chemistry.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Lensing in the Blue II: Estimating the Sensitivity of Stratospheric Balloons to Weak Gravitational Lensing
Authors:
Jacqueline E. McCleary,
Spencer W. Everett,
Mohamed M. Shaaban,
Ajay S. Gill,
Georgios N. Vassilakis,
Eric M. Huff,
Richard J. Massey,
Steven J. Benton,
Anthony M. Brown,
Paul Clark,
Bradley Holder,
Aurelien A. Fraisse,
Mathilde Jauzac,
William C. Jones,
David Lagattuta,
Jason S. -Y. Leung,
Lun Li,
Thuy Vy T. Luu,
Johanna M. Nagy,
C. Barth Netterfield,
Emaad Paracha,
Susan F. Redmond,
Jason D. Rhodes,
J\''urgen Schmoll,
Ellen Sirks
, et al. (1 additional authors not shown)
Abstract:
The Superpressure Balloon-borne Imaging Telescope (SuperBIT) is a diffraction-limited, wide-field, 0.5 m, near-infrared to near-ultraviolet observatory designed to exploit the stratosphere's space-like conditions. SuperBIT's 2023 science flight will deliver deep, blue imaging of galaxy clusters for gravitational lensing analysis. In preparation, we have developed a weak lensing measurement pipelin…
▽ More
The Superpressure Balloon-borne Imaging Telescope (SuperBIT) is a diffraction-limited, wide-field, 0.5 m, near-infrared to near-ultraviolet observatory designed to exploit the stratosphere's space-like conditions. SuperBIT's 2023 science flight will deliver deep, blue imaging of galaxy clusters for gravitational lensing analysis. In preparation, we have developed a weak lensing measurement pipeline with modern algorithms for PSF characterization, shape measurement, and shear calibration. We validate our pipeline and forecast SuperBIT survey properties with simulated galaxy cluster observations in SuperBIT's near-UV and blue bandpasses. We predict imaging depth, galaxy number (source) density, and redshift distribution for observations in SuperBIT's three bluest filters; the effect of lensing sample selections is also considered. We find that in three hours of on-sky integration, SuperBIT can attain a depth of b = 26 mag and a total source density exceeding 40 galaxies per square arcminute. Even with the application of lensing-analysis catalog selections, we find b-band source densities between 25 and 30 galaxies per square arcminute with a median redshift of z = 1.1. Our analysis confirms SuperBIT's capability for weak gravitational lensing measurements in the blue.
△ Less
Submitted 6 July, 2023;
originally announced July 2023.
-
Long-term follow-up observations of extreme coronal line emitting galaxies
Authors:
Peter Clark,
Or Graur,
Joseph Callow,
Jessica Aguilar,
Steven Ahlen,
Joseph P. Anderson,
Edo Berger,
Thomas Brink,
David Brooks,
Ting-Wan Chen,
Todd Claybaugh,
Axel de la Macorra,
Peter Doel,
Alexei Filippenko,
Jamie Forero-Romero,
Sebastian Gomez,
Mariusz Gromadzki,
Klaus Honscheid,
Cosimo Inserra,
Theodore Kisner,
Martin Landriau,
Lydia Makrygianni,
Marc Manera,
Aaron Meisner,
Ramon Miquel
, et al. (18 additional authors not shown)
Abstract:
We present new spectroscopic and photometric follow-up observations of the known sample of extreme coronal line emitting galaxies (ECLEs) identified in the Sloan Digital Sky Survey (SDSS). With these new data, observations of the ECLE sample now span a period of two decades following their initial SDSS detections. We confirm the nonrecurrence of the iron coronal line signatures in five of the seve…
▽ More
We present new spectroscopic and photometric follow-up observations of the known sample of extreme coronal line emitting galaxies (ECLEs) identified in the Sloan Digital Sky Survey (SDSS). With these new data, observations of the ECLE sample now span a period of two decades following their initial SDSS detections. We confirm the nonrecurrence of the iron coronal line signatures in five of the seven objects, further supporting their identification as the transient light echoes of tidal disruption events (TDEs). Photometric observations of these objects in optical bands show little overall evolution. In contrast, mid-infrared (MIR) observations show ongoing long-term declines. The remaining two objects had been classified as active galactic nuclei (AGN) with unusually strong coronal lines rather than being TDE related, given the persistence of the coronal lines in earlier follow-up spectra. We confirm this classification, with our spectra continuing to show the presence of strong, unchanged coronal-line features and AGN-like MIR colours and behaviour. We have constructed spectral templates of both subtypes of ECLE to aid in distinguishing the likely origin of newly discovered ECLEs. We highlight the need for higher cadence, and more rapid, follow-up observations of such objects to better constrain their properties and evolution. We also discuss the relationships between ECLEs, TDEs, and other identified transients having significant MIR variability.
△ Less
Submitted 4 March, 2024; v1 submitted 6 July, 2023;
originally announced July 2023.
-
Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy
Authors:
Sarah Wiegreffe,
Matthew Finlayson,
Oyvind Tafjord,
Peter Clark,
Ashish Sabharwal
Abstract:
When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer choices. Spreading probability mass across multiple surface forms with identical meaning (such as "bath" and "bathtub") is thought to cause an underestimation of a model's true performance, referred to as th…
▽ More
When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer choices. Spreading probability mass across multiple surface forms with identical meaning (such as "bath" and "bathtub") is thought to cause an underestimation of a model's true performance, referred to as the "surface form competition" (SFC) hypothesis. This has motivated the introduction of various probability normalization methods. However, many core questions remain unanswered. How do we measure SFC? Are there direct ways of reducing it, and does doing so improve task performance?
We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time. We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example. We show this method eliminates the impact of SFC in the majority of instances. Our experiments on three diverse datasets and six LMs reveal several additional surprising findings. For example, both normalization and prompting methods for reducing SFC can be ineffective or even detrimental to task performance for some LMs. We conclude with practical insights for effectively prompting LMs for multiple-choice tasks.
△ Less
Submitted 31 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Let GPT be a Math Tutor: Teaching Math Word Problem Solvers with Customized Exercise Generation
Authors:
Zhenwen Liang,
Wenhao Yu,
Tanmay Rajpurohit,
Peter Clark,
Xiangliang Zhang,
Ashwin Kaylan
Abstract:
In this paper, we present a novel approach for distilling math word problem solving capabilities from large language models (LLMs) into smaller, more efficient student models. Our approach is designed to consider the student model's weaknesses and foster a tailored learning experience by generating targeted exercises aligned with educational science principles, such as knowledge tracing and person…
▽ More
In this paper, we present a novel approach for distilling math word problem solving capabilities from large language models (LLMs) into smaller, more efficient student models. Our approach is designed to consider the student model's weaknesses and foster a tailored learning experience by generating targeted exercises aligned with educational science principles, such as knowledge tracing and personalized learning. Concretely, we let GPT-3 be a math tutor and run two steps iteratively: 1) assessing the student model's current learning status on a GPT-generated exercise book, and 2) improving the student model by training it with tailored exercise samples generated by GPT-3. Experimental results reveal that our approach outperforms LLMs (e.g., GPT-3 and PaLM) in accuracy across three distinct benchmarks while employing significantly fewer parameters. Furthermore, we provide a comprehensive analysis of the various components within our methodology to substantiate their efficacy.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Language Models with Rationality
Authors:
Nora Kassner,
Oyvind Tafjord,
Ashish Sabharwal,
Kyle Richardson,
Hinrich Schuetze,
Peter Clark
Abstract:
While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent "beliefs". This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that…
▽ More
While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent "beliefs". This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that answers are supported by interpretable chains of reasoning drawn from a consistent network of beliefs. Our approach, which we call REFLEX, is to add a rational, self-reflecting layer on top of the LLM. First, given a question, we construct a belief graph using a backward-chaining process to materialize relevant model beliefs (including beliefs about answer candidates) and their inferential relationships. Second, we identify and minimize contradictions in that graph using a formal constraint reasoner. We find that REFLEX significantly improves consistency (by 8%-11% absolute) without harming overall answer accuracy, resulting in answers supported by faithful chains of reasoning drawn from a more consistent belief system. This suggests a new style of system architecture in which an LLM extended with a rational layer can provide an interpretable window into system beliefs, add a systematic reasoning capability, and repair latent inconsistencies present in the LLM.
△ Less
Submitted 29 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
IfQA: A Dataset for Open-domain Question Answering under Counterfactual Presuppositions
Authors:
Wenhao Yu,
Meng Jiang,
Peter Clark,
Ashish Sabharwal
Abstract:
Although counterfactual reasoning is a fundamental aspect of intelligence, the lack of large-scale counterfactual open-domain question-answering (QA) benchmarks makes it difficult to evaluate and improve models on this ability. To address this void, we introduce the first such dataset, named IfQA, where each question is based on a counterfactual presupposition via an "if" clause. For example, if L…
▽ More
Although counterfactual reasoning is a fundamental aspect of intelligence, the lack of large-scale counterfactual open-domain question-answering (QA) benchmarks makes it difficult to evaluate and improve models on this ability. To address this void, we introduce the first such dataset, named IfQA, where each question is based on a counterfactual presupposition via an "if" clause. For example, if Los Angeles was on the east coast of the U.S., what would be the time difference between Los Angeles and Paris? Such questions require models to go beyond retrieving direct factual knowledge from the Web: they must identify the right information to retrieve and reason about an imagined situation that may even go against the facts built into their parameters. The IfQA dataset contains over 3,800 questions that were annotated annotated by crowdworkers on relevant Wikipedia passages. Empirical analysis reveals that the IfQA dataset is highly challenging for existing open-domain QA methods, including supervised retrieve-then-read pipeline methods (EM score 36.2), as well as recent few-shot approaches such as chain-of-thought prompting with GPT-3 (EM score 27.4). The unique challenges posed by the IfQA benchmark will push open-domain QA research on both retrieval and counterfactual reasoning fronts.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs
Authors:
Afra Feyza Akyürek,
Ekin Akyürek,
Aman Madaan,
Ashwin Kalyan,
Peter Clark,
Derry Wijaya,
Niket Tandon
Abstract:
Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics…
▽ More
Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.
△ Less
Submitted 11 July, 2023; v1 submitted 15 May, 2023;
originally announced May 2023.
-
Functional degrees and arithmetic applications II: The Group-Theoretic Prime Ax-Katz Theorem
Authors:
Pete L. Clark,
Uwe Schauz
Abstract:
We give a version of Ax-Katz's $p$-adic congruences and Moreno-Moreno's $p$-weight refinement that holds over any finite commutative ring of prime characteristic. We deduce this from a purely group-theoretic result that gives a lower bound on the $p$-adic divisibility of the number of simultaneous zeros of a system of maps $f_j: A\to B_j$ from a fixed ``source'' finite commutative group $A$ of exp…
▽ More
We give a version of Ax-Katz's $p$-adic congruences and Moreno-Moreno's $p$-weight refinement that holds over any finite commutative ring of prime characteristic. We deduce this from a purely group-theoretic result that gives a lower bound on the $p$-adic divisibility of the number of simultaneous zeros of a system of maps $f_j: A\to B_j$ from a fixed ``source'' finite commutative group $A$ of exponent $p$ to varying ``target'' finite commutative $p$-groups $B_j$. Our proof combines Wilson's proof of Ax-Katz over $\mathbb{F}_p$ with the functional calculus of Aichinger-Moosbauer.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Densities of integer sets represented by quadratic forms
Authors:
Pete L. Clark,
Paul Pollack,
Jeremy Rouse,
Katherine Thompson
Abstract:
Let $f(t_1,\ldots,t_n)$ be a nondegenerate integral quadratic form. We analyze the asymptotic behavior of the function $D_f(X)$, the number of integers of absolute value up to $X$ represented by $f$. When $f$ is isotropic or $n$ is at least $3$, we show that there is a $δ(f) \in \mathbb{Q} \cap (0,1)$ such that $D_f(X) \sim δ(f) X$ and call $δ(f)$ the density of $f$. We consider the inverse proble…
▽ More
Let $f(t_1,\ldots,t_n)$ be a nondegenerate integral quadratic form. We analyze the asymptotic behavior of the function $D_f(X)$, the number of integers of absolute value up to $X$ represented by $f$. When $f$ is isotropic or $n$ is at least $3$, we show that there is a $δ(f) \in \mathbb{Q} \cap (0,1)$ such that $D_f(X) \sim δ(f) X$ and call $δ(f)$ the density of $f$. We consider the inverse problem of which densities arise. Our main technical tool is a Near Hasse Principle: a quadratic form may fail to represent infinitely many integers that it locally represents, but this set of exceptions has density $0$ within the set of locally represented integers.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
Self-Refine: Iterative Refinement with Self-Feedback
Authors:
Aman Madaan,
Niket Tandon,
Prakhar Gupta,
Skyler Hallinan,
Luyu Gao,
Sarah Wiegreffe,
Uri Alon,
Nouha Dziri,
Shrimai Prabhumoye,
Yiming Yang,
Shashank Gupta,
Bodhisattwa Prasad Majumder,
Katherine Hermann,
Sean Welleck,
Amir Yazdanbakhsh,
Peter Clark
Abstract:
Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it…
▽ More
Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it to refine itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner, and feedback provider. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by ~20% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
△ Less
Submitted 25 May, 2023; v1 submitted 30 March, 2023;
originally announced March 2023.
-
Multiwavelength observations of the extraordinary accretion event AT2021lwx
Authors:
P. Wiseman,
Y. Wang,
S. Hönig,
N. Castro-Segura,
P. Clark,
C. Frohmaier,
M. D. Fulton,
G. Leloudas,
M. Middleton,
T. E. Müller-Bravo,
A. Mummery,
M. Pursiainen,
S. J. Smartt,
K. Smith,
M. Sullivan,
J. P. Anderson,
J. A. Acosta Pulido,
P. Charalampopoulos,
M. Banerji,
M. Dennefeld,
L. Galbany,
M. Gromadzki,
C. P. Gutiérrez,
N. Ihanec,
E. Kankare
, et al. (21 additional authors not shown)
Abstract:
We present observations from X-ray to mid-infrared wavelengths of the most energetic non-quasar transient ever observed, AT2021lwx. Our data show a single optical brightening by a factor $>100$ to a luminosity of $7\times10^{45}$ erg s$^{-1}$, and a total radiated energy of $1.5\times10^{53}$ erg, both greater than any known optical transient. The decline is smooth and exponential and the ultra-vi…
▽ More
We present observations from X-ray to mid-infrared wavelengths of the most energetic non-quasar transient ever observed, AT2021lwx. Our data show a single optical brightening by a factor $>100$ to a luminosity of $7\times10^{45}$ erg s$^{-1}$, and a total radiated energy of $1.5\times10^{53}$ erg, both greater than any known optical transient. The decline is smooth and exponential and the ultra-violet - optical spectral energy distribution resembles a black body with temperature $1.2\times10^4$ K. Tentative X-ray detections indicate a secondary mode of emission, while a delayed mid-infrared flare points to the presence of dust surrounding the transient. The spectra are similar to recently discovered optical flares in known active galactic nuclei but lack some characteristic features. The lack of emission for the previous seven years is inconsistent with the short-term, stochastic variability observed in quasars, while the extreme luminosity and long timescale of the transient disfavour the disruption of a single solar-mass star. The luminosity could be generated by the disruption of a much more massive star, but the likelihood of such an event occurring is small. A plausible scenario is the accretion of a giant molecular cloud by a dormant black hole of $10^8 - 10^9$ solar masses. AT2021lwx thus represents an extreme extension of the known scenarios of black hole accretion.
△ Less
Submitted 31 March, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
Do simulated molecular clouds look like real ones?
Authors:
F. D. Priestley,
P. C. Clark,
A. P Whitworth
Abstract:
Simulations of molecular clouds often begin from highly idealised initial conditions, such as a uniform-density sphere with an artificially imposed turbulent velocity field. While the resulting structures may appear qualitatively similar to those detected in continuum and line observations, it is unclear whether they are genuinely representative of real molecular clouds. Recent observational work…
▽ More
Simulations of molecular clouds often begin from highly idealised initial conditions, such as a uniform-density sphere with an artificially imposed turbulent velocity field. While the resulting structures may appear qualitatively similar to those detected in continuum and line observations, it is unclear whether they are genuinely representative of real molecular clouds. Recent observational work has discovered a tight, often close-to-linear relationship between the integrated intensity of molecular lines and the total column density of the cloud material. We combine magnetohydrodynamical simulations, time-dependent chemistry, and radiative transfer to produce synthetic molecular line observations of model clouds. We find similarly tight correlations between line intensity and column density to those observed, although the linear behaviour is only seen in isolated (as opposed to colliding) model clouds. This linear relationship is not due to optically thin emission; all lines investigated have high optical depths, and the increase in integrated intensity with column density is due to higher velocity dispersion along the line of sight. Overall, the idealised models commonly used in the literature appear to be reasonably accurate representations of real molecular clouds.
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
From dark matter halos to pre-stellar cores: High resolution follow-up of cosmological Lyman-Werner simulations
Authors:
Lewis R. Prole,
Anna T. P. Schauer,
Paul C. Clark,
Simon C. O. Glover,
Felix D. Priestley,
Ralf S. Klessen
Abstract:
Molecular hydrogen allows cooling in primordial gas, facilitating its collapse into Population III stars within primordial halos. Lyman-Werner (LW) radiation from these stars can escape the halo and delay further star formation by destroying H$_2$ in other halos. As cosmological simulations show that increasing the background LW field strength increases the average halo mass required for star form…
▽ More
Molecular hydrogen allows cooling in primordial gas, facilitating its collapse into Population III stars within primordial halos. Lyman-Werner (LW) radiation from these stars can escape the halo and delay further star formation by destroying H$_2$ in other halos. As cosmological simulations show that increasing the background LW field strength increases the average halo mass required for star formation, we perform follow-up simulations of selected halos to investigate the knock-on effects this has on the Population III IMF. We follow 5 halos for each of the $J_{21}$ = 0, 0.01 and 0.1 LW field strengths, resolving the pre-stellar core density of $10^{-6}$ g cm$^{-3}$ (10$^{18}$ cm$^{-3}$) before inserting sink particles and following the fragmentation behaviour for hundreds of years further. We find that the mass accreted onto sinks by the end of the simulations is proportional to the mass within the $\sim 10^{-2}$ pc molecular core, which is not correlated to the initial mass of the halo. As such, the IMFs for masses above the brown dwarf limit show little dependence on the LW strength, although they do show variance in the number of low-mass clumps formed. As the range of background LW field strengths tested here covers the most likely values from literature, we conclude that the IMF for so-called Pop III.2 stars is not significantly different from the initial population of Pop III.1 stars. The primordial IMF therefore likely remains unchanged until the formation of the next generation of Population II stars.
△ Less
Submitted 19 January, 2023; v1 submitted 2 January, 2023;
originally announced January 2023.
-
CM Elliptic Curves: Volcanoes, Reality and Applications, Part II
Authors:
Pete L. Clark,
Frederick Saia
Abstract:
Let $M \mid N$ be positive integers, and let $Δ$ be the discriminant of an order in an imaginary quadratic field $K$. When $Δ_K < -4$, the first author determined the fiber of the morphism $X_0(M,N) \rightarrow X(1)$ over the closed point $J_Δ$ corresponding to $Δ$ and showed that all fibers of the map $X_1(M,N) \rightarrow X_0(M,N)$ over $J_Δ$ were connected. Here we complement this prior work by…
▽ More
Let $M \mid N$ be positive integers, and let $Δ$ be the discriminant of an order in an imaginary quadratic field $K$. When $Δ_K < -4$, the first author determined the fiber of the morphism $X_0(M,N) \rightarrow X(1)$ over the closed point $J_Δ$ corresponding to $Δ$ and showed that all fibers of the map $X_1(M,N) \rightarrow X_0(M,N)$ over $J_Δ$ were connected. Here we complement this prior work by addressing the most difficult cases $Δ_K \in \{-3,-4\}$. These works provide all the information needed to compute, for each positive integer $d$, all subgroups of $E(F)[\operatorname{tors}]$, where $F$ is a number field of degree $d$ and $E_{/F}$ is an elliptic curve with complex multiplication.
△ Less
Submitted 26 December, 2022;
originally announced December 2022.
-
CM Elliptic Curves: Volcanoes, Reality and Applications
Authors:
Pete L. Clark
Abstract:
For positive integers $M \mid N$ and an order of discriminant $Δ$ in an imaginary quadratic field $K$ with discriminant $Δ_K < -4$, we determine the fiber of the morphism $X_0(M,N) \rightarrow X(1)$ over the closed point $J_Δ$ corresponding to $Δ$. We also show that the fiber of the natural map $X_1(M,N) \rightarrow X_0(M,N)$ over $J_Δ$ is connected. Putting this together we deduce the number of p…
▽ More
For positive integers $M \mid N$ and an order of discriminant $Δ$ in an imaginary quadratic field $K$ with discriminant $Δ_K < -4$, we determine the fiber of the morphism $X_0(M,N) \rightarrow X(1)$ over the closed point $J_Δ$ corresponding to $Δ$. We also show that the fiber of the natural map $X_1(M,N) \rightarrow X_0(M,N)$ over $J_Δ$ is connected. Putting this together we deduce the number of points in the fiber of $X_1(M,N) \rightarrow X(1)$ over $J_Δ$ and their residual degrees. In the continuation of this work with F. Saia, these results will be extended to $Δ_K \in \{-4,3\}$. These works provide all the information needed to compute, for each positive integer $d$, all subgroups of $E(F)[\operatorname{tors}]$, where $F$ is a number field of degree $d$ and $E_{/F}$ is an elliptic curve with complex multiplication (CM).
△ Less
Submitted 26 December, 2022;
originally announced December 2022.
-
Do language models have coherent mental models of everyday things?
Authors:
Yuling Gu,
Bhavana Dalvi Mishra,
Peter Clark
Abstract:
When people think of everyday things like an egg, they typically have a mental image associated with it. This allows them to correctly judge, for example, that "the yolk surrounds the shell" is a false statement. Do language models similarly have a coherent picture of such everyday things? To investigate this, we propose a benchmark dataset consisting of 100 everyday things, their parts, and the r…
▽ More
When people think of everyday things like an egg, they typically have a mental image associated with it. This allows them to correctly judge, for example, that "the yolk surrounds the shell" is a false statement. Do language models similarly have a coherent picture of such everyday things? To investigate this, we propose a benchmark dataset consisting of 100 everyday things, their parts, and the relationships between these parts, expressed as 11,720 "X relation Y?" true/false questions. Using these questions as probes, we observe that state-of-the-art pre-trained language models (LMs) like GPT-3 and Macaw have fragments of knowledge about these everyday things, but do not have fully coherent "parts mental models" (54-59% accurate, 19-43% conditional constraint violation). We propose an extension where we add a constraint satisfaction layer on top of the LM's raw predictions to apply commonsense constraints. As well as removing inconsistencies, we find that this also significantly improves accuracy (by 16-20%), suggesting how the incoherence of the LM's pictures of everyday things can be significantly reduced.
△ Less
Submitted 8 June, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.