Search | arXiv e-print repository

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, Peter Clark

Abstract: Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systemat… ▽ More Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. We additionally provide 903 synthetic tasks to conduct controlled evaluations across task complexity. Furthermore, our structured formalism of data-driven discovery enables a facet-based evaluation that provides useful insights into different failure modes. We evaluate several popular LLM-based reasoning frameworks using both open and closed LLMs as baselines on DiscoveryBench and find that even the best system scores only 25%. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Website: https://github.com/allenai/discoverybench

arXiv:2406.06769 [pdf, other]

DISCOVERYWORLD: A Virtual Environment for Develo** and Evaluating Automated Scientific Discovery Agents

Authors: Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark

Abstract: Automated scientific discovery promises to accelerate progress across scientific domains. However, develo** and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for develo** and benchmarking an agent's abil… ▽ More Automated scientific discovery promises to accelerate progress across scientific domains. However, develo** and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for develo** and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. DISCOVERYWORLD contains a variety of different challenges, covering topics as diverse as radioisotope dating, rocket science, and proteomics, to encourage development of general discovery skills rather than task-specific solutions. DISCOVERYWORLD itself is an inexpensive, simulated, text-based environment (with optional 2D visual overlay). It includes 120 different challenge tasks, spanning eight topics each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD further provides three automatic metrics for evaluating performance, based on (a) task completion, (b) task-relevant actions taken, and (c) the discovered explanatory knowledge. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks, suggesting that DISCOVERYWORLD captures some of the novel challenges of discovery, and thus that DISCOVERYWORLD may help accelerate near-term development and assessment of scientific discovery competency in agents. Code available at: www.github.com/allenai/discoveryworld △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 9 pages, 4 figures. Preprint, under review

arXiv:2406.06702 [pdf, other]

NEATH III: a molecular line survey of a simulated star-forming cloud

Authors: F. D. Priestley, P. C. Clark, S. C. O. Glover, S. E. Ragan, O. Fehér, L. R. Prole, R. S. Klessen

Abstract: We present synthetic line observations of a simulated molecular cloud, utilising a self-consistent treatment of the dynamics and time-dependent chemical evolution. We investigate line emission from the three most common CO isotopologues ($^{12}$CO, $^{13}$CO, C$^{18}$O) and six supposed tracers of dense gas (NH$_3$, HCN, N$_2$H$^+$, HCO$^+$, CS, HNC). Our simulation produces a range of line intens… ▽ More We present synthetic line observations of a simulated molecular cloud, utilising a self-consistent treatment of the dynamics and time-dependent chemical evolution. We investigate line emission from the three most common CO isotopologues ($^{12}$CO, $^{13}$CO, C$^{18}$O) and six supposed tracers of dense gas (NH$_3$, HCN, N$_2$H$^+$, HCO$^+$, CS, HNC). Our simulation produces a range of line intensities consistent with that observed in real molecular clouds. The HCN-to-CO intensity ratio is relatively invariant with column density, making HCN (and chemically-similar species such as CS) a poor tracer of high-density material in the cloud. The ratio of N$_2$H$^+$ to HCN or CO, on the other hand, is highly selective of regions with densities above $10^{22} \, {\rm cm^{-2}}$, and the N$_2$H$^+$ line is a very good tracer of the dynamics of high volume density ($>10^4 \, {\rm cm^{-3}}$) material. Focusing on cores formed within the simulated cloud, we find good agreement with the line intensities of an observational sample of prestellar cores, including reproducing observed CS line intensities with an undepleted elemental abundance of sulphur. However, agreement between cores formed in the simulation, and models of isolated cores which have otherwise-comparable properties, is poor. The formation from and interaction with the large-scale environment has a significant impact on the line emission properties of the cores, making isolated models unsuitable for interpreting observational data. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 14 pages, 14 figures. MNRAS accepted

arXiv:2406.06485 [pdf, other]

Can Language Models Serve as Text-Based World Simulators?

Authors: Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, Peter Jansen

Abstract: Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of tex… ▽ More Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called ByteSized32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM's capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: ACL 2024

arXiv:2406.02334 [pdf, other]

$\textit{Kilonova Seekers}$: the GOTO project for real-time citizen science in time-domain astrophysics

Authors: T. L. Killestein, L. Kelsey, E. Wickens, L. Nuttall, J. Lyman, C. Krawczyk, K. Ackley, M. J. Dyer, F. Jiménez-Ibarra, K. Ulaczyk, D. O'Neill, A. Kumar, D. Steeghs, D. K. Galloway, V. S. Dhillon, P. O'Brien, G. Ramsay, K. Noysena, R. Kotak, R. P. Breton, E. Pallé, D. Pollacco, S. Awiphan, S. Belkin, P. Chote , et al. (29 additional authors not shown)

Abstract: Time-domain astrophysics continues to grow rapidly, with the inception of new surveys drastically increasing data volumes. Democratised, distributed approaches to training sets for machine learning classifiers are crucial to make the most of this torrent of discovery -- with citizen science approaches proving effective at meeting these requirements. In this paper, we describe the creation of and t… ▽ More Time-domain astrophysics continues to grow rapidly, with the inception of new surveys drastically increasing data volumes. Democratised, distributed approaches to training sets for machine learning classifiers are crucial to make the most of this torrent of discovery -- with citizen science approaches proving effective at meeting these requirements. In this paper, we describe the creation of and the initial results from the $\textit{Kilonova Seekers}$ citizen science project, built to find transient phenomena from the GOTO telescopes in near real-time. $\textit{Kilonova Seekers}$ launched in July 2023 and received over 600,000 classifications from approximately 2,000 volunteers over the course of the LIGO-Virgo-KAGRA O4a observing run. During this time, the project has yielded 20 discoveries, generated a `gold-standard' training set of 17,682 detections for augmenting deep-learned classifiers, and measured the performance and biases of Zooniverse volunteers on real-bogus classification. This project will continue throughout the lifetime of GOTO, pushing candidates at ever-greater cadence, and directly facilitate the next-generation classification algorithms currently in development. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 20 pages, 15 figures. Submitted to MNRAS

arXiv:2405.19793 [pdf, other]

PDDLEGO: Iterative Planning in Textual Environments

Authors: Li Zhang, Peter Jansen, Tianyi Zhang, Peter Clark, Chris Callison-Burch, Niket Tandon

Abstract: Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed… ▽ More Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43% more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98%) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4%). △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: In *SEM 2024

arXiv:2405.16337 [pdf, other]

Learning to Reason via Program Generation, Emulation, and Search

Authors: Nathaniel Weir, Muhammad Khalifa, Linlu Qiu, Orion Weller, Peter Clark

Abstract: Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understand… ▽ More Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understanding. Our goal is to extend an LM's program synthesis skills to such tasks and evaluate the results via pseudo-programs, namely Python programs where some leaf function calls are left undefined. To that end, we propose, Code Generation and Emulated EXecution (CoGEX). CoGEX works by (1) training LMs to generate their own pseudo-programs, (2) teaching them to emulate their generated program's execution, including those leaf functions, allowing the LM's knowledge to fill in the execution gaps; and (3) using them to search over many programs to find an optimal one. To adapt the CoGEX model to a new task, we introduce a method for performing program search to find a single program whose pseudo-execution yields optimal performance when applied to all the instances of a given dataset. We show that our approach yields large improvements compared to standard in-context learning approaches on a battery of tasks, both algorithmic and soft reasoning. This result thus demonstrates that code synthesis can be applied to a much broader class of problems than previously considered. Our released dataset, fine-tuned models, and implementation can be found at \url{https://github.com/nweir127/CoGEX}. △ Less

Submitted 28 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

Comments: 16 pages, 10 figures

arXiv:2405.09503 [pdf, other]

Self-consistent modelling of the Milky Way structure using live potentials

Authors: Eva Durán-Camacho, Ana Duarte-Cabral, Alex R. Pettitt, Robin G. Treß, Paul C. Clark, Ralf S. Klessen, Kamran R. J. Bogue, Rowan J. Smith, Mattia C. Sormani

Abstract: To advance our understanding of the evolution of the interstellar medium (ISM) of our Galaxy, numerical models of Milky Way (MW) type galaxies are widely used. However, most models only vaguely resemble the MW (e.g. in total mass), and often use imposed analytic potentials (which cannot evolve dynamically). This poses a problem in asserting their applicability for the interpretation of observation… ▽ More To advance our understanding of the evolution of the interstellar medium (ISM) of our Galaxy, numerical models of Milky Way (MW) type galaxies are widely used. However, most models only vaguely resemble the MW (e.g. in total mass), and often use imposed analytic potentials (which cannot evolve dynamically). This poses a problem in asserting their applicability for the interpretation of observations of our own Galaxy. The goal of this work is to identify a numerical model that is not only a MW-type galaxy, but one that can mimic some of the main observed structures of our Galaxy, using dynamically evolving potentials, so that it can be used as a base model to study the ISM cycle in a galaxy like our own. This paper introduces a suite of 15 MW-type galaxy models developed using the {\sc arepo} numerical code, that are compared to Galactic observations of $^{12}$CO and \ion{H}{I} emission via longitude-velocity plots, from where we extract and compare the skeletons of major galactic features and the terminal gas velocities. We found that our best-fitting model to the overall structure, also reproduces some of the more specific observed features of the MW, including a bar with a pattern speed of $30.0 \pm 0.2$ km\,s$^{-1}$\,kpc$^{-1}$, a bar half-length of $3.2 \pm 0.8$\,kpc. Our model shows large streaming motions around spiral arms, and strong radial motions well beyond the inner bar. This model highlights the complex motions of a dynamic MW-type galaxy and has the potential to offer valuable insight into how our Galaxy regulates the ISM and star formation. △ Less

Submitted 11 June, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

Comments: Accepted for publication in MNRAS. 24 pages, 23 figures, 3 tables

arXiv:2405.00095 [pdf, ps, other]

Assessing the accuracy of the star formation rate measurements by direct star count in molecular clouds

Authors: Sami Dib, Jian Wen Zhou, Sébastien Comerón, Luis E. Garduño, Valery V. Kravtsov, Paul C. Clark, Guang-Xing Li, Maritza A. Lara-López, Tie Liu, Mohsen Shadmehri, James R. Doughty

Abstract: Star formation estimates based on the counting of YSOs is commonly applied to nearby star-forming regions in the Galaxy. With this method, the SFRs are measured using the counts of YSOs in a particular protostellar Class, a typical protostellar mass, and the lifetime associated with this Class. However, the assumptions underlying the validity of the method such as that of a constant star formation… ▽ More Star formation estimates based on the counting of YSOs is commonly applied to nearby star-forming regions in the Galaxy. With this method, the SFRs are measured using the counts of YSOs in a particular protostellar Class, a typical protostellar mass, and the lifetime associated with this Class. However, the assumptions underlying the validity of the method such as that of a constant star formation history (SFH) and whether the method is valid for all protostellar Classes has never been fully tested. In this work, we use Monte Carlo models to test the validity of the method. We build synthetic clusters in which stars form at times that are randomly drawn from a specified SFH. The latter is either constant or time-dependent with a burst like behavior. The masses of the protostars are randomly drawn from an IMF which can be either similar to that of the Milky Way field or be variable . For each star in every cluster, the lifetimes associated with the different protostellar classes are also randomly drawn from Gaussian distribution functions centered around their most likely value as suggested by the observations. We find that only the SFR derived using the Class 0 population can reproduce the true SFR at all epochs, and this is true irrespective of the shape of the SFH. For a constant SFH, the SFR derived using the more evolved populations of protostars (Classes I, F, II, and III) reproduce the real SFR only at later epochs which correspond to epochs at which their numbers have reached a steady state. For a time-dependent burst-like SFH, all SFR estimates based on the number counts of the evolved populations fail to reproduce the true SFR. We also show how the offsets between Class I and Class II based SFRs and the true SFR plotted as a function of the number ratios of Class I and Class II versus Class III YSOs can be used in order to constrain the SFH of observed molecular clouds. △ Less

Submitted 30 April, 2024; originally announced May 2024.

Comments: Submitted. Comments are welcome

arXiv:2403.20302 [pdf, other]

I'm in AGNi: A new standard for AGN pluralisation

Authors: Andrew D. Gow, Peter Clark, Dan Rycanowski

Abstract: We present a new standard acronym for Active Galactic Nuclei, finally settling the argument of AGN vs. AGNs. Our new standard is not only etymologically superior (following the consensus set by SNe), but also boasts other linguistic opportunities, connecting strongly with relevant theology and streamlining descriptions of AGN properties. We present a new standard acronym for Active Galactic Nuclei, finally settling the argument of AGN vs. AGNs. Our new standard is not only etymologically superior (following the consensus set by SNe), but also boasts other linguistic opportunities, connecting strongly with relevant theology and streamlining descriptions of AGN properties. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: 4 pages, 3 figures, accepted for publication in Acta Prima Aprilia

arXiv:2403.19269 [pdf, other]

N$_2$H$^+$(1-0) as a tracer of dense gas in and between spiral arms

Authors: O. Feher, S. E. Ragan, F. D. Priestley, P. C. Clark, T. J. T. Moore

Abstract: Recent advances in identifying giant molecular filaments in galactic surveys allow us to study the interstellar material and its dense, potentially star forming phase on scales comparable to resolved extragalactic clouds. Two large filaments detected in the CHIMPS $^{13}$CO(3-2) survey, one in the Sagittarius-arm and one in an inter-arm region, were mapped with dense gas tracers inside a 0.06 deg… ▽ More Recent advances in identifying giant molecular filaments in galactic surveys allow us to study the interstellar material and its dense, potentially star forming phase on scales comparable to resolved extragalactic clouds. Two large filaments detected in the CHIMPS $^{13}$CO(3-2) survey, one in the Sagittarius-arm and one in an inter-arm region, were mapped with dense gas tracers inside a 0.06 deg$^2$ area and with a spatial resolution of around 0.4 and 0.65 pc at the distance of the targets using the IRAM 30m telescope, to investigate the environmental dependence of the dense gas fraction. The N$_2$H$^+$(1-0) transition, an excellent tracer of the dense gas, was detected in parsec-scale, elliptical clumps and with a filling factor of around 8.5% in our maps. The N$_2$H$^+$-emitting areas appear to have higher dense gas fraction (e.g. the ratio of N$_2$H$^+$ and $^{13}$CO emission) in the inter-arm than in the arm which is opposite to the behaviour found by previous studies, using dust emission rather than N$_2$H$^+$ as a tracer of dense gas. However, the arm filament is brighter in $^{13}$CO and the infrared emission of dust, and the dense gas fraction determined as above is governed by the $^{13}$CO brightness. We caution that measurements regarding the distribution and fraction of dense gas on these scales may be influenced by many scale- and environment-dependent factors, as well as the chemistry and excitation of the particular tracers, then consider several scenarios that can reproduce the observed effect. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: 18 pages, 9 figures, accepted by MNRAS

arXiv:2403.00092 [pdf, other]

PROC2PDDL: Open-Domain Planning Representations from Texts

Authors: Tianyi Zhang, Li Zhang, Zhaoyi Hou, Ziyu Wang, Yuling Gu, Peter Clark, Chris Callison-Burch, Niket Tandon

Abstract: Planning in a text-based environment continues to be a major challenge for AI systems. Recent approaches have used language models to predict a planning domain definition (e.g., PDDL) but have only been evaluated in closed-domain simulated environments. To address this, we present Proc2PDDL , the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representation… ▽ More Planning in a text-based environment continues to be a major challenge for AI systems. Recent approaches have used language models to predict a planning domain definition (e.g., PDDL) but have only been evaluated in closed-domain simulated environments. To address this, we present Proc2PDDL , the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representations. Using this dataset, we evaluate state-of-the-art models on defining the preconditions and effects of actions. We show that Proc2PDDL is highly challenging, with GPT-3.5's success rate close to 0% and GPT-4's around 35%. Our analysis shows both syntactic and semantic errors, indicating LMs' deficiency in both generating domain-specific prgorams and reasoning about events. We hope this analysis and dataset helps future progress towards integrating the best of LMs and formal planning. △ Less

Submitted 2 July, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

Comments: In NLRSE 2024, the 2nd Natural Language Reasoning and Structured Explanations Workshop

arXiv:2402.16951 [pdf, other]

The rate of extreme coronal line emitting galaxies in the Sloan Digital Sky Survey and their relation to tidal disruption events

Authors: Joseph Callow, Or Graur, Peter Clark, Antonella Palmese, Jessica Aguilar, Steven Ahlen, Segev BenZvi, David Brooks, Todd Claybaugh, Axel de la Macorra, Peter Doel, Jaime E. Forero-Romero, Enrique Gaztañaga, Satya Gontcho A Gontcho, Andrew Lambert, Martin Landriau, Marc Manera, Aaron Meisner, Ramon Miquel, John Moustakas, Jundan Nie, Claire Poppett, Francisco Prada, Mehdi Rezaie, Graziano Rossi , et al. (5 additional authors not shown)

Abstract: Strong high-ionization iron coronal lines (CLs) are a rare phenomenon observed in galaxy and quasi-stellar object spectra that are thought to be created as a result of tidal disruption event (TDE) flares. To test whether these CLs are the result of TDE activity, we search for extreme coronal line emitting galaxies (ECLEs) in the Sloan Digital Sky Survey (SDSS), measure their rate, and compare it t… ▽ More Strong high-ionization iron coronal lines (CLs) are a rare phenomenon observed in galaxy and quasi-stellar object spectra that are thought to be created as a result of tidal disruption event (TDE) flares. To test whether these CLs are the result of TDE activity, we search for extreme coronal line emitting galaxies (ECLEs) in the Sloan Digital Sky Survey (SDSS), measure their rate, and compare it to TDE rates from the literature. We detect sufficiently strong CLs in 14 objects, doubling the number previously found in SDSS. Using follow-up spectra from the Dark Energy Spectroscopic Instrument and Gemini Multi-Object Spectrograph, Wide-field Infrared Survey Explorer mid-infrared observations, and Liverpool Telescope optical photometry, we find that of the seven new objects, only one evolves in a manner consistent with that of the five previously discovered variable ECLEs. Using this new sample of six variable ECLEs, we calculate the galaxy-normalised rate of ECLEs in SDSS to be $R_\mathrm{G}=2.2~^{+1.3}_{-0.8}~(\mathrm{statistical})~^{+0.0}_{-1.3}~(\mathrm{systematic})\times10^{-5}~\mathrm{galaxy}^{-1}~\mathrm{year}^{-1}$. The mass-normalised rate is $R_\mathrm{M}=1.9~^{+1.1}_{-0.7}~(\mathrm{statistical})~^{+0.0}_{-1.1}~(\mathrm{systematic})\times10^{-16}~\mathrm{M_\odot^{-1}}~\mathrm{year}^{-1}$ and the volumetric rate is $R_\mathrm{V}=6.9~^{+5.6}_{-2.1}~(\mathrm{statistical})~^{+0.0}_{-3.9}~(\mathrm{systematic})\times10^{-8}~\mathrm{Mpc}^{-3}~\mathrm{year}^{-1}$. Our rates are comparable to TDE rates from the literature, supporting the suggestion that the CLs in variable ECLEs are the product of TDEs. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: Submitted to MNRAS. 19 pages, 12 figures

arXiv:2402.14798 [pdf, other]

Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic

Authors: Nathaniel Weir, Kate Sanders, Orion Weller, Shreya Sharma, Dongwei Jiang, Zheng** Jiang, Bhavana Dalvi Mishra, Oyvind Tafjord, Peter Jansen, Peter Clark, Benjamin Van Durme

Abstract: Contemporary language models enable new opportunities for structured reasoning with text, such as the construction and evaluation of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy… ▽ More Contemporary language models enable new opportunities for structured reasoning with text, such as the construction and evaluation of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment datasets, and evaluate its impact on LLM-based textual inference. We find that our resulting dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency (+9%) than prior decompositional entailment datasets, suggesting that RDTE is a significant step forward in the long-standing problem of forming a clear protocol for discerning entailment. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in a modern neuro-symbolic reasoning engine significantly improves results (both accuracy and proof quality) over other entailment classifier baselines, illustrating the practical benefit of this advance for textual inference. △ Less

Submitted 27 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.13610 [pdf, other]

Data-driven Discovery with Large Generative Models

Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, Peter Clark

Abstract: With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a se… ▽ More With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DATAVOYAGER, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata -- a feat previously unattainable -- while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.03244 [pdf, other]

Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills

Authors: Kolby Nottingham, Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Sameer Singh, Peter Clark, Roy Fox

Abstract: Large language models (LLMs) have recently been used for sequential decision making in interactive environments. However, leveraging environment reward signals for continual LLM actor improvement is not straightforward. We propose Skill Set Optimization (SSO) for improving LLM actor performance through constructing and refining sets of transferable skills. SSO constructs skills by extracting commo… ▽ More Large language models (LLMs) have recently been used for sequential decision making in interactive environments. However, leveraging environment reward signals for continual LLM actor improvement is not straightforward. We propose Skill Set Optimization (SSO) for improving LLM actor performance through constructing and refining sets of transferable skills. SSO constructs skills by extracting common subtrajectories with high rewards and generating subgoals and instructions to represent each skill. These skills are provided to the LLM actor in-context to reinforce behaviors with high rewards. Then, SSO further refines the skill set by pruning skills that do not continue to result in high rewards. We evaluate our method in the classic videogame NetHack and the text environment ScienceWorld to demonstrate SSO's ability to optimize a set of skills and perform in-context policy improvement. SSO outperforms baselines by 40% in our custom NetHack task and outperforms the previous state-of-the-art in ScienceWorld by 35%. △ Less

Submitted 22 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024

arXiv:2401.06751 [pdf, other]

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Authors: Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe

Abstract: How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have continually improved. In this paper, we present the surprising conclusion that current pretrained language models often generalize relatively well from… ▽ More How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have continually improved. In this paper, we present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data, even performing as well as oracle models finetuned on hard data. We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear classifier heads, and QLoRA for seven different measures of datapoint hardness, including six empirically diverse human hardness measures (like grade level) and one model-based measure (loss-based). Furthermore, we show that even if one cares most about model performance on hard data, it can be better to collect easy data rather than hard data for finetuning, since hard data is generally noisier and costlier to collect. Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied. Our code is available at: https://github.com/allenai/easy-to-hard-generalization △ Less

Submitted 5 June, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

Comments: ACL 2024. 23 pages, 20 figures

arXiv:2312.07527 [pdf, other]

BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability

Authors: Peter Clark, Bhavana Dalvi Mishra, Oyvind Tafjord

Abstract: While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a colle… ▽ More While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a collection of human-annotated *entailment trees*, engineered to express both good and bad chains of reasoning, and using a mixture of true and false facts, in particular including counterfactual examples, to avoid belief bias (also known as the "content effect"). The resulting dataset, called BaRDa, contains 3000 entailments (1787 valid, 1213 invalid), using 6681 true and 2319 false statements. Testing on four GPT-series models, GPT3(curie)/GPT3(davinici)/3.5/4, we find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2. This shows the clear progression of models towards improved factual accuracy and entailment reasoning, and the dataset provides a new benchmark that more cleanly separates and quantifies these two notions. △ Less

Submitted 23 March, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

Comments: Added note about how dataset sampling was performed

arXiv:2312.06769 [pdf, other]

Heavy Black Hole Seed Formation in High-z Atomic Cooling Halos

Authors: Lewis R. Prole, John A. Regan, Simon C. O. Glover, Ralf S. Klessen, Felix D. Priestley, Paul C. Clark

Abstract: Halos with masses in excess of the atomic limit are believed to be ideal environments in which to form heavy black hole seeds with masses above 10^3 Msun. In cases where the H_2 fraction is suppressed this is expected to lead to reduced fragmentation of the gas and the generation of a top heavy initial mass function. In extreme cases this can result in the formation of massive black hole seeds. Re… ▽ More Halos with masses in excess of the atomic limit are believed to be ideal environments in which to form heavy black hole seeds with masses above 10^3 Msun. In cases where the H_2 fraction is suppressed this is expected to lead to reduced fragmentation of the gas and the generation of a top heavy initial mass function. In extreme cases this can result in the formation of massive black hole seeds. Resolving the initial fragmentation scale and the resulting protostellar masses has, until now, not been robustly tested. Cosmological simulations were performed with the moving mesh code Arepo using a primordial chemistry network until z = 11. Three haloes with masses in excess of the atomic cooling mass were then selected for detailed examination via zoom-ins. The highest resolution simulations resolve densities up to 10^-6 g cm^-3 (10^18 cm^-3) and capture a further 100 yr of fragmentation behaviour at the center of the halo. Our simulations show intense fragmentation in the central region of the halos, leading to a large number of near-solar mass protostars. Despite the increased fragmentation the halos produce a protostellar mass spectrum that peaks at higher masses relative to standard Population III star forming halos. The most massive protostars have accretion rates of 10^-3-10^-1 Msun yr^-1 after the first 100 years of evolution, while the total mass of the central region grows at 1 Msun yr^-1. Lower resolution zoom-ins show that the total mass of the system continues to accrete at 1 Msun yr^-1 for at least 10^4 yr, although how this mass is distributed amongst the rapidly growing number of protostars is unclear. However, assuming that a fraction of stars can continue to accrete rapidly the formation of a sub-population of stars with masses in excess of 10^3 Msun is likely in these halos. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: Submitted to A&A, comments welcome

arXiv:2312.03842 [pdf, other]

Light-Curve Structure and Halpha Line Formation in the Tidal Disruption Event AT 2019azh

Authors: Sara Faris, Iair Arcavi, Lydia Makrygianni, Daichi Hiramatsu, Giacomo Terreran, Joseph Farah, D. Andrew Howell, Curtis McCully, Megan Newsome, Estefania Padilla Gonzalez, Craig Pellegrino, K. Azalee Bostroem, Wiam Abojanb, Marco C. Lam, Lina Tomasella, Thomas G. Brink, Alexei V. Filippenko, K. Decker French, Peter Clark, Or Graur, Giorgos Leloudas, Mariusz Gromadzki, Joseph P. Anderson, Matt Nicholl, Claudia P. Gutierrez , et al. (11 additional authors not shown)

Abstract: AT 2019azh is a H+He tidal disruption event (TDE) with one of the most extensive ultraviolet and optical datasets available to date. We present our photometric and spectroscopic observations of this event starting several weeks before and out to approximately two years after g-band peak brightness and combine them with public photometric data. This extensive dataset robustly reveals a change in th… ▽ More AT 2019azh is a H+He tidal disruption event (TDE) with one of the most extensive ultraviolet and optical datasets available to date. We present our photometric and spectroscopic observations of this event starting several weeks before and out to approximately two years after g-band peak brightness and combine them with public photometric data. This extensive dataset robustly reveals a change in the light-curve slope and a bump in the rising light curve of a TDE for the first time, which may indicate more than one dominant emission mechanism contributing to the pre-peak light curve. We further confirm the relation seen in previous TDEs whereby the redder emission peaks later than the bluer emission. The post-peak bolometric light curve of AT 2019azh is better described by an exponential decline than by the canonical t^{-5/3} (and in fact any) power-law decline. We find a possible mid-infrared excess around peak optical luminosity, but cannot determine its origin. In addition, we provide the earliest measurements of the Halpha emission-line evolution and find no significant time delay between the peak of the V-band light curve and that of the Halpha luminosity. These results can be used to constrain future models of TDE line formation and emission mechanisms in general. More pre-peak 1-2 day cadence observations of TDEs are required to determine whether the characteristics observed here are common among TDEs. More importantly, detailed emission models are needed to fully exploit such observations for understanding the emission physics of TDEs. △ Less

Submitted 6 December, 2023; originally announced December 2023.

Comments: Submitted to ApJ

arXiv:2311.13981 [pdf, other]

Overview of the distributed image processing infrastructure to produce the Legacy Survey of Space and Time

Authors: Fabio Hernandez, George Beckett, Peter Clark, Matt Doidge, Tim Jenness, Edward Karavakis, Quentin Le Boulc'h, Peter Love, Gabriele Mainetti, Timothy Noble, Brandon White, Wei Yang

Abstract: The Vera C. Rubin Observatory is preparing to execute the most ambitious astronomical survey ever attempted, the Legacy Survey of Space and Time (LSST). Currently the final phase of construction is under way in the Chilean Andes, with the Observatory's ten-year science mission scheduled to begin in 2025. Rubin's 8.4-meter telescope will nightly scan the southern hemisphere collecting imagery in th… ▽ More The Vera C. Rubin Observatory is preparing to execute the most ambitious astronomical survey ever attempted, the Legacy Survey of Space and Time (LSST). Currently the final phase of construction is under way in the Chilean Andes, with the Observatory's ten-year science mission scheduled to begin in 2025. Rubin's 8.4-meter telescope will nightly scan the southern hemisphere collecting imagery in the wavelength range 320-1050 nm covering the entire observable sky every 4 nights using a 3.2 gigapixel camera, the largest imaging device ever built for astronomy. Automated detection and classification of celestial objects will be performed by sophisticated algorithms on high-resolution images to progressively produce an astronomical catalog eventually composed of 20 billion galaxies and 17 billion stars and their associated physical properties. In this article we present an overview of the system currently being constructed to perform data distribution as well as the annual campaigns which reprocess the entire image dataset collected since the beginning of the survey. These processing campaigns will utilize computing and storage resources provided by three Rubin data facilities (one in the US and two in Europe). Each year a Data Release will be produced and disseminated to science collaborations for use in studies comprising four main science pillars: probing dark matter and dark energy, taking inventory of solar system objects, exploring the transient optical sky and map** the Milky Way. Also presented is the method by which we leverage some of the common tools and best practices used for management of large-scale distributed data processing projects in the high energy physics and astronomy communities. We also demonstrate how these tools and practices are utilized within the Rubin project in order to overcome the specific challenges faced by the Observatory. △ Less

Submitted 23 November, 2023; originally announced November 2023.

Comments: 8 pages, 2 figures, 26th International Conference on Computing in High Energy & Nuclear Physics

arXiv:2311.10527 [pdf, other]

Functional degrees and arithmetic applications III: Beyond Prime Exponent

Authors: Pete L. Clark, Uwe Schauz

Abstract: Continuing our work on group-theoretic generalizations of the prime Ax-Katz Theorem, we give a lower bound on the $p$-adic divisibility of the cardinality of the set of simultaneous zeros $Z(f_1,f_2,\ldots,f_r)$ of $r$ maps $f_j:A\rightarrow B_j$ between arbitrary finite commutative groups $A$ and $B_j$ in terms of the invariant factors of $A, B_1,B_2,\dotsc,B_r$ and the \emph{functional degrees}… ▽ More Continuing our work on group-theoretic generalizations of the prime Ax-Katz Theorem, we give a lower bound on the $p$-adic divisibility of the cardinality of the set of simultaneous zeros $Z(f_1,f_2,\ldots,f_r)$ of $r$ maps $f_j:A\rightarrow B_j$ between arbitrary finite commutative groups $A$ and $B_j$ in terms of the invariant factors of $A, B_1,B_2,\dotsc,B_r$ and the \emph{functional degrees} of the maps $f_1,f_2,\dotsc,f_r$. △ Less

Submitted 4 July, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

Comments: 27 pages

MSC Class: 20K01; 13F20; 20C05

arXiv:2311.09613 [pdf, other]

Digital Socrates: Evaluating LLMs through Explanation Critiques

Authors: Yuling Gu, Oyvind Tafjord, Peter Clark

Abstract: While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on… ▽ More While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on expensive API calls or human annotations. Our approach is to (a) define the new task of explanation critiquing - identifying and categorizing any main flaw in an explanation and providing suggestions to address the flaw, (b) create a sizeable, human-verified dataset for this task, and (c) train an open-source, automatic critique model (called Digital Socrates) using this data. Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains, and how it can provide high-quality, nuanced, automatic evaluation of those model explanations for the first time. Digital Socrates thus fills an important gap in evaluation tools for understanding and improving the explanation behavior of models. △ Less

Submitted 16 February, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2311.09519 [pdf, other]

Leveraging Code to Improve In-context Learning for Semantic Parsing

Authors: Ben Bogin, Shivanshu Gupta, Peter Clark, Ashish Sabharwal

Abstract: In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs. In this work, we improve the effectiveness of ICL for semantic parsing by (1) using general-purpose p… ▽ More In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs. In this work, we improve the effectiveness of ICL for semantic parsing by (1) using general-purpose programming languages such as Python instead of DSLs, and (2) augmenting prompts with a structured domain description that includes, e.g., the available classes and functions. We show that both these changes significantly improve accuracy across three popular datasets. Combined, they lead to dramatic improvements (e.g. 7.9% to 66.5% on SMCalFlow compositional split), nearly closing the performance gap between easier i.i.d.\ and harder compositional splits when used with a strong model, and reducing the need for a large number of demonstrations. We find that the resemblance of the target parse language to general-purpose code is a more important factor than the language's popularity in pre-training corpora. Our findings provide an improved methodology for building semantic parsers in the modern context of ICL with LLMs. △ Less

Submitted 27 March, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: Accepted to NAACL 2024

arXiv:2311.09510 [pdf, other]

Tailoring with Targeted Precision: Edit-Based Agents for Open-Domain Procedure Customization

Authors: Yash Kumar Lal, Li Zhang, Faeze Brahman, Bodhisattwa Prasad Majumder, Peter Clark, Niket Tandon

Abstract: How-to procedures, such as how to plant a garden, are now used by millions of users, but sometimes need customizing to meet a user's specific needs, e.g., planting a garden without pesticides. Our goal is to measure and improve an LLM's ability to perform such customization. Our approach is to test several simple multi-LLM-agent architectures for customization, as well as an end-to-end LLM, using… ▽ More How-to procedures, such as how to plant a garden, are now used by millions of users, but sometimes need customizing to meet a user's specific needs, e.g., planting a garden without pesticides. Our goal is to measure and improve an LLM's ability to perform such customization. Our approach is to test several simple multi-LLM-agent architectures for customization, as well as an end-to-end LLM, using a new evaluation set, called CustomPlans, of over 200 WikiHow procedures each with a customization need. We find that a simple architecture with two LLM agents used sequentially performs best, one that edits a generic how-to procedure and one that verifies its executability, significantly outperforming (10.5% absolute) an end-to-end prompted LLM. This suggests that LLMs can be configured reasonably effectively for procedure customization. This also suggests that multi-agent editing architectures may be worth exploring further for other customization applications (e.g. coding, creative writing) in the future. △ Less

Submitted 30 May, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: Camera ready version accepted to Findings of ACL 2024

arXiv:2311.08602 [pdf, other]

doi 10.3390/aerospace10110960

Data downloaded via parachute from a NASA super-pressure balloon

Authors: Ellen L. Sirks, Richard Massey, Ajay S. Gill, Jason Anderson, Steven J. Benton, Anthony M. Brown, Paul Clark, Joshua English, Spencer W. Everett, Aurelien A. Fraisse, Hugo Franco, John W. Hartley, David Harvey, Bradley Holder, Andrew Hunter, Eric M. Huff, Andrew Hynous, Mathilde Jauzac, William C. Jones, Nikky Joyce, Duncan Kennedy, David Lagattuta, Jason S. -Y. Leung, Lun Li, Stephen Lishman , et al. (18 additional authors not shown)

Abstract: In April to May 2023, the superBIT telescope was lifted to the Earth's stratosphere by a helium-filled super-pressure balloon, to acquire astronomical imaging from above (99.5% of) the Earth's atmosphere. It was launched from New Zealand then, for 40 days, circumnavigated the globe five times at a latitude 40 to 50 degrees South. Attached to the telescope were four 'DRS' (Data Recovery System) cap… ▽ More In April to May 2023, the superBIT telescope was lifted to the Earth's stratosphere by a helium-filled super-pressure balloon, to acquire astronomical imaging from above (99.5% of) the Earth's atmosphere. It was launched from New Zealand then, for 40 days, circumnavigated the globe five times at a latitude 40 to 50 degrees South. Attached to the telescope were four 'DRS' (Data Recovery System) capsules containing 5 TB solid state data storage, plus a GNSS receiver, Iridium transmitter, and parachute. Data from the telescope were copied to these, and two were dropped over Argentina. They drifted 61 km horizontally while they descended 32 km, but we predicted their descent vectors within 2.4 km: in this location, the discrepancy appears irreducible below 2 km because of high speed, gusty winds and local topography. The capsules then reported their own locations to within a few metres. We recovered the capsules and successfully retrieved all of superBIT's data - despite the telescope itself being later destroyed on landing. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: 12 pages

Journal ref: Aerospace 2023, 10, 960

arXiv:2311.05772 [pdf, other]

ADaPT: As-Needed Decomposition and Planning with Language Models

Authors: Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, Tushar Khot

Abstract: Large Language Models (LLMs) are increasingly being used for interactive decision-making tasks requiring planning and adapting to the environment. Recent works employ LLMs-as-agents in broadly two ways: iteratively determining the next action (iterative executors) or generating plans and executing sub-tasks using LLMs (plan-and-execute). However, these methods struggle with task complexity, as the… ▽ More Large Language Models (LLMs) are increasingly being used for interactive decision-making tasks requiring planning and adapting to the environment. Recent works employ LLMs-as-agents in broadly two ways: iteratively determining the next action (iterative executors) or generating plans and executing sub-tasks using LLMs (plan-and-execute). However, these methods struggle with task complexity, as the inability to execute any sub-task may lead to task failure. To address these shortcomings, we introduce As-Needed Decomposition and Planning for complex Tasks (ADaPT), an approach that explicitly plans and decomposes complex sub-tasks as-needed, i.e., when the LLM is unable to execute them. ADaPT recursively decomposes sub-tasks to adapt to both task complexity and LLM capability. Our results demonstrate that ADaPT substantially outperforms established strong baselines, achieving success rates up to 28.3% higher in ALFWorld, 27% in WebShop, and 33% in TextCraft -- a novel compositional dataset that we introduce. Through extensive analysis, we illustrate the importance of multilevel decomposition and establish that ADaPT dynamically adjusts to the capabilities of the executor LLM as well as to task complexity. △ Less

Submitted 8 April, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: NAACL 2024 (findings) camera-ready. Project Page: https://allenai.github.io/adaptllm

arXiv:2311.04892 [pdf, other]

Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs

Authors: Shashank Gupta, Vaishnavi Shrivastava, Ameet Deshpande, Ashwin Kalyan, Peter Clark, Ashish Sabharwal, Tushar Khot

Abstract: Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of… ▽ More Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks. Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups. Our experiments unveil that LLMs harbor deep rooted bias against various socio-demographics underneath a veneer of fairness. While they overtly reject stereotypes when explicitly asked ('Are Black people less skilled at mathematics?'), they manifest stereotypical and erroneous presumptions when asked to answer questions while adopting a persona. These can be observed as abstentions in responses, e.g., 'As a Black person, I can't answer this question as it requires math knowledge', and generally result in a substantial performance drop. Our experiments with ChatGPT-3.5 show that this bias is ubiquitous - 80% of our personas demonstrate bias; it is significant - some datasets show performance drops of 70%+; and can be especially harmful for certain groups - some personas suffer statistically significant drops on 80%+ of the datasets. Overall, all 4 LLMs exhibit this bias to varying extents, with GPT-4-Turbo showing the least but still a problematic amount of bias (evident in 42% of the personas). Further analysis shows that these persona-induced errors can be hard-to-discern and hard-to-avoid. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs - a trend on the rise - can surface their deep-rooted biases and have unforeseeable and detrimental side-effects. △ Less

Submitted 27 January, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: Project page: https://allenai.github.io/persona-bias. Paper to appear at ICLR 2024. Added results for other LLMs in v2 (similar findings)

arXiv:2311.02807 [pdf, other]

QualEval: Qualitative Evaluation for Model Improvement

Authors: Vishvak Murahari, Ameet Deshpande, Peter Clark, Tanmay Rajpurohit, Ashish Sabharwal, Karthik Narasimhan, Ashwin Kalyan

Abstract: Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a… ▽ More Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a way to compare and benchmark models, and do not yield actionable diagnostics, thus making the model improvement process challenging. Model developers find themselves amid extensive manual efforts involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are backed by a comprehensive dashboard with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace of model development, thus in essence serving as a data-scientist-in-a-box. Given the focus on critiquing and improving current evaluation metrics, our method serves as a refreshingly new technique for both model evaluation and improvement. △ Less

Submitted 5 May, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

Comments: NAACL 2024

arXiv:2310.10730 [pdf, other]

doi 10.21105/astro.2310.10730

Population III star formation: multiple gas phases prevent the use of an equation of state at high densities

Authors: Lewis R. Prole, Paul C. Clark, Felix D. Priestley, Simon C. O. Glover, John A. Regan

Abstract: Advanced primordial chemistry networks have been developed to model the collapse of metal-free baryonic gas within the gravitational well of dark matter (DM) halos and its subsequent collapse into Population III stars. At the low densities of 10^-26-10^-21 g cm-3 (10-3-10^2 cm-3) the collapse is dependent on H2 production, which is a function of the compressional heating provided by the DM potenti… ▽ More Advanced primordial chemistry networks have been developed to model the collapse of metal-free baryonic gas within the gravitational well of dark matter (DM) halos and its subsequent collapse into Population III stars. At the low densities of 10^-26-10^-21 g cm-3 (10-3-10^2 cm-3) the collapse is dependent on H2 production, which is a function of the compressional heating provided by the DM potential. Once the gas decouples from the DM, the temperature-density relationship follows a well established path dictated by various chemical reactions until the formation of the protostar at 10^-4 g cm-3 (10^19 cm-3). Here we explore the feasibility of replacing the chemical network (CN) with a barotropic equation of state (EoS) just before the formation of the first protostar, to reduce the computational load of simulating the further fragmentation, evolution and characteristics of the very high density gas. We find a significant reduction in fragmentation when using the EoS. The EoS method produces a protostellar mass distribution that peaks at higher masses when compared to CN runs. The change in fragmentation behaviour is due to a lack of cold gas falling in through the disc around the first protostar when using an EoS. Despite this, the total mass accreted across all sinks was invariant to the switch to an EoS, hence the star formation rate (Msun yr^-1) is accurately predicted using an EoS. The EoS routine is approximately 4000 times faster than the CN, however this numerical gain is offset by the lack of accuracy in modelling secondary protostar formation and hence its use must be considered carefully. △ Less

Submitted 19 January, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: Accepted in the Open Journal of Astrophysics

arXiv:2310.10134 [pdf, other]

CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization

Authors: Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark

Abstract: Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. However, despite their zero-shot capabilities, these agents to date do not continually improve over time beyond performance refinement on a specific task. Here we present C… ▽ More Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. However, despite their zero-shot capabilities, these agents to date do not continually improve over time beyond performance refinement on a specific task. Here we present CLIN, the first language-based agent to achieve this, so that it continually improves over multiple trials, including when both the environment and task are varied, and without requiring parameter updates. Our approach is to use a persistent, dynamic, textual memory centered on causal abstractions (rather than general "helpful hints") that is regularly updated after each trial so that the agent gradually learns useful knowledge for new trials. In the ScienceWorld benchmark, CLIN is able to continually improve on repeated trials on the same task and environment, outperforming state-of-the-art reflective language agents like Reflexion by 23 absolute points. CLIN can also transfer its learning to new environments (or new tasks), improving its zero-shot performance by 4 points (13 for new tasks) and can further improve performance there through continual memory updates, enhancing performance by an additional 17 points (7 for new tasks). This suggests a new architecture for agents built on frozen models that can still continually and rapidly improve over time. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: Project page: https://allenai.github.io/clin/

arXiv:2310.06037 [pdf, other]

NEATH II: N$_2$H$^+$ as a tracer of imminent star formation in quiescent high-density gas

Authors: F. D. Priestley, P. C. Clark, S. C. O. Glover, S. E. Ragan, O. Fehér, L. R. Prole, R. S. Klessen

Abstract: Star formation activity in molecular clouds is often found to be correlated with the amount of material above a column density threshold of $\sim 10^{22} \, {\rm cm^{-2}}$. Attempts to connect this column density threshold to a ${\it volume}$ density above which star formation can occur are limited by the fact that the volume density of gas is difficult to reliably measure from observations. We po… ▽ More Star formation activity in molecular clouds is often found to be correlated with the amount of material above a column density threshold of $\sim 10^{22} \, {\rm cm^{-2}}$. Attempts to connect this column density threshold to a ${\it volume}$ density above which star formation can occur are limited by the fact that the volume density of gas is difficult to reliably measure from observations. We post-process hydrodynamical simulations of molecular clouds with a time-dependent chemical network, and investigate the connection between commonly-observed molecular species and star formation activity. We find that many molecules widely assumed to specifically trace the dense, star-forming component of molecular clouds (e.g. HCN, HCO$^+$, CS) actually also exist in substantial quantities in material only transiently enhanced in density, which will eventually return to a more diffuse state without forming any stars. By contrast, N$_2$H$^+$ only exists in detectable quantities above a volume density of $10^4 \, {\rm cm^{-3}}$, the point at which CO, which reacts destructively with N$_2$H$^+$, begins to deplete out of the gas phase onto grain surfaces. This density threshold for detectable quantities of N$_2$H$^+$ corresponds very closely to the volume density at which gas becomes irreversibly gravitationally bound in the simulations: the material traced by N$_2$H$^+$ never reverts to lower densities, and quiescent regions of molecular clouds with visible N$_2$H$^+$ emission are destined to eventually form stars. The N$_2$H$^+$ line intensity is likely to directly correlate with the star formation rate averaged over timescales of around a Myr. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: 10 pages, 10 figures. MNRAS accepted

arXiv:2309.11340 [pdf, other]

GW190425: Pan-STARRS and ATLAS coverage of the skymap and limits on optical emission associated with FRB190425

Authors: S. J. Smartt, M. Nicholl, S. Srivastav, M. E. Huber, K. C. Chambers, K. W. Smith, D. R. Young, M. D. Fulton, J. L. Tonry, C. W. Stubbs, L. Denneau, A. J. Cooper, A. Aamer, J. P. Anderson, A. Andersson, J. Bulger, T. -W Chen, P. Clark, T. de Boer, H. Gao, J. H. Gillanders, A. Lawrence, C. C. Lin, T. B. Lowe, E. A. Magnier , et al. (10 additional authors not shown)

Abstract: GW190425 is the second of only two binary neutron star (BNS) merger events to be significantly detected by the LIGO-Virgo- Kagra gravitational wave detectors. With a detection only in LIGO Livingston, the skymap containing the source was large and no plausible electromagnetic counterpart was found in real time searching in 2019. Here we summarise our ATLAS and Pan-STARRS wide-field optical coverag… ▽ More GW190425 is the second of only two binary neutron star (BNS) merger events to be significantly detected by the LIGO-Virgo- Kagra gravitational wave detectors. With a detection only in LIGO Livingston, the skymap containing the source was large and no plausible electromagnetic counterpart was found in real time searching in 2019. Here we summarise our ATLAS and Pan-STARRS wide-field optical coverage of the skymap beginning within 1 hour and 3 hours respectively of the GW190425 merger time. More recently, a potential coincidence between GW190425 and a fast radio burst FRB 190425 has been suggested, given their spatial and temporal coincidence. The smaller sky localisation area of FRB 190425 and its dispersion measure have led to the identification of a likely host galaxy, UGC 10667 at a distance of 141 +/- 10 Mpc. Our optical imaging covered the galaxy 6.0 hrs after GW190425 was detected and 3.5 hrs after the FRB 190425. No optical emission was detected and further imaging at +1.2 and +13.2 days also revealed no emission. If the FRB 190425 and GW190425 association were real, we highlight our limits on kilonova emission from a BNS merger in UGC 10667. The model for producing FRB 190425 from a BNS merger involves a supramassive magnetised neutron star spinning down by dipole emission on the timescale of hours. We show that magnetar enhanced kilonova emission is ruled out by optical upper limits. The lack of detected optical emission from a kilonova in UGC 10667 disfavours, but does not disprove, the FRB-GW link for this source. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: Submitted to MNRAS, 20th Sept 2023, 9 pages

arXiv:2307.13072 [pdf, other]

doi 10.1093/mnras/stad2278

Non-Equilibrium Abundances Treated Holistically (NEATH): the molecular composition of star-forming clouds

Authors: F. D. Priestley, P. C. Clark, S. C. O. Glover, S. E. Ragan, O. Fehér, L. R. Prole, R. S. Klessen

Abstract: Much of what we know about molecular clouds, and by extension star formation, comes from molecular line observations. Interpreting these correctly requires knowledge of the underlying molecular abundances. Simulations of molecular clouds typically only model species that are important for the gas thermodynamics, which tend to be poor tracers of the denser material where stars form. We construct a… ▽ More Much of what we know about molecular clouds, and by extension star formation, comes from molecular line observations. Interpreting these correctly requires knowledge of the underlying molecular abundances. Simulations of molecular clouds typically only model species that are important for the gas thermodynamics, which tend to be poor tracers of the denser material where stars form. We construct a framework for post-processing these simulations with a full time-dependent chemical network, allowing us to model the behaviour of observationally-important species not present in the reduced network used for the thermodynamics. We use this to investigate the chemical evolution of molecular gas under realistic physical conditions. We find that molecules can be divided into those which reach peak abundances at moderate densities ($10^3 \, {\rm cm^{-3}}$) and decline sharply thereafter (such as CO and HCN), and those which peak at higher densities and then remain roughly constant (e.g. NH$_3$, N$_2$H$^+$). Evolving the chemistry with physical properties held constant at their final values results in a significant overestimation of gas-phase abundances for all molecules, and does not capture the drastic variations in abundance caused by different evolutionary histories. The dynamical evolution of molecular gas cannot be neglected when modelling its chemistry. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: 14 pages, 13 figures. MNRAS accepted

arXiv:2307.03295 [pdf, other]

Lensing in the Blue II: Estimating the Sensitivity of Stratospheric Balloons to Weak Gravitational Lensing

Authors: Jacqueline E. McCleary, Spencer W. Everett, Mohamed M. Shaaban, Ajay S. Gill, Georgios N. Vassilakis, Eric M. Huff, Richard J. Massey, Steven J. Benton, Anthony M. Brown, Paul Clark, Bradley Holder, Aurelien A. Fraisse, Mathilde Jauzac, William C. Jones, David Lagattuta, Jason S. -Y. Leung, Lun Li, Thuy Vy T. Luu, Johanna M. Nagy, C. Barth Netterfield, Emaad Paracha, Susan F. Redmond, Jason D. Rhodes, J\''urgen Schmoll, Ellen Sirks , et al. (1 additional authors not shown)

Abstract: The Superpressure Balloon-borne Imaging Telescope (SuperBIT) is a diffraction-limited, wide-field, 0.5 m, near-infrared to near-ultraviolet observatory designed to exploit the stratosphere's space-like conditions. SuperBIT's 2023 science flight will deliver deep, blue imaging of galaxy clusters for gravitational lensing analysis. In preparation, we have developed a weak lensing measurement pipelin… ▽ More The Superpressure Balloon-borne Imaging Telescope (SuperBIT) is a diffraction-limited, wide-field, 0.5 m, near-infrared to near-ultraviolet observatory designed to exploit the stratosphere's space-like conditions. SuperBIT's 2023 science flight will deliver deep, blue imaging of galaxy clusters for gravitational lensing analysis. In preparation, we have developed a weak lensing measurement pipeline with modern algorithms for PSF characterization, shape measurement, and shear calibration. We validate our pipeline and forecast SuperBIT survey properties with simulated galaxy cluster observations in SuperBIT's near-UV and blue bandpasses. We predict imaging depth, galaxy number (source) density, and redshift distribution for observations in SuperBIT's three bluest filters; the effect of lensing sample selections is also considered. We find that in three hours of on-sky integration, SuperBIT can attain a depth of b = 26 mag and a total source density exceeding 40 galaxies per square arcminute. Even with the application of lensing-analysis catalog selections, we find b-band source densities between 25 and 30 galaxies per square arcminute with a median redshift of z = 1.1. Our analysis confirms SuperBIT's capability for weak gravitational lensing measurements in the blue. △ Less

Submitted 6 July, 2023; originally announced July 2023.

Comments: Submitted to Astronomical Journal

arXiv:2307.03182 [pdf, other]

doi 10.1093/mnras/stae460

Long-term follow-up observations of extreme coronal line emitting galaxies

Authors: Peter Clark, Or Graur, Joseph Callow, Jessica Aguilar, Steven Ahlen, Joseph P. Anderson, Edo Berger, Thomas Brink, David Brooks, Ting-Wan Chen, Todd Claybaugh, Axel de la Macorra, Peter Doel, Alexei Filippenko, Jamie Forero-Romero, Sebastian Gomez, Mariusz Gromadzki, Klaus Honscheid, Cosimo Inserra, Theodore Kisner, Martin Landriau, Lydia Makrygianni, Marc Manera, Aaron Meisner, Ramon Miquel , et al. (18 additional authors not shown)

Abstract: We present new spectroscopic and photometric follow-up observations of the known sample of extreme coronal line emitting galaxies (ECLEs) identified in the Sloan Digital Sky Survey (SDSS). With these new data, observations of the ECLE sample now span a period of two decades following their initial SDSS detections. We confirm the nonrecurrence of the iron coronal line signatures in five of the seve… ▽ More We present new spectroscopic and photometric follow-up observations of the known sample of extreme coronal line emitting galaxies (ECLEs) identified in the Sloan Digital Sky Survey (SDSS). With these new data, observations of the ECLE sample now span a period of two decades following their initial SDSS detections. We confirm the nonrecurrence of the iron coronal line signatures in five of the seven objects, further supporting their identification as the transient light echoes of tidal disruption events (TDEs). Photometric observations of these objects in optical bands show little overall evolution. In contrast, mid-infrared (MIR) observations show ongoing long-term declines. The remaining two objects had been classified as active galactic nuclei (AGN) with unusually strong coronal lines rather than being TDE related, given the persistence of the coronal lines in earlier follow-up spectra. We confirm this classification, with our spectra continuing to show the presence of strong, unchanged coronal-line features and AGN-like MIR colours and behaviour. We have constructed spectral templates of both subtypes of ECLE to aid in distinguishing the likely origin of newly discovered ECLEs. We highlight the need for higher cadence, and more rapid, follow-up observations of such objects to better constrain their properties and evolution. We also discuss the relationships between ECLEs, TDEs, and other identified transients having significant MIR variability. △ Less

Submitted 4 March, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

Comments: This is a pre-copyedited, author-produced PDF of an article accepted for publication in Monthly Notices of the Royal Astronomical Society following peer review. Note the corrected caption of Figure 1 continued, which in this version correctly refers to 'SDSS J124' rather than the erroneous 'SDSS J1341' in the published version. 29 Pages, 14 Figures

Journal ref: Monthly Notices of the Royal Astronomical Society, Volume 528, Issue 4, March 2024, Pages 7076-7102

arXiv:2305.14596 [pdf, other]

Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy

Authors: Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, Ashish Sabharwal

Abstract: When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer choices. Spreading probability mass across multiple surface forms with identical meaning (such as "bath" and "bathtub") is thought to cause an underestimation of a model's true performance, referred to as th… ▽ More When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer choices. Spreading probability mass across multiple surface forms with identical meaning (such as "bath" and "bathtub") is thought to cause an underestimation of a model's true performance, referred to as the "surface form competition" (SFC) hypothesis. This has motivated the introduction of various probability normalization methods. However, many core questions remain unanswered. How do we measure SFC? Are there direct ways of reducing it, and does doing so improve task performance? We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time. We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example. We show this method eliminates the impact of SFC in the majority of instances. Our experiments on three diverse datasets and six LMs reveal several additional surprising findings. For example, both normalization and prompting methods for reducing SFC can be ineffective or even detrimental to task performance for some LMs. We conclude with practical insights for effectively prompting LMs for multiple-choice tasks. △ Less

Submitted 31 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: EMNLP 2023

arXiv:2305.14386 [pdf, other]

Let GPT be a Math Tutor: Teaching Math Word Problem Solvers with Customized Exercise Generation

Authors: Zhenwen Liang, Wenhao Yu, Tanmay Rajpurohit, Peter Clark, Xiangliang Zhang, Ashwin Kaylan

Abstract: In this paper, we present a novel approach for distilling math word problem solving capabilities from large language models (LLMs) into smaller, more efficient student models. Our approach is designed to consider the student model's weaknesses and foster a tailored learning experience by generating targeted exercises aligned with educational science principles, such as knowledge tracing and person… ▽ More In this paper, we present a novel approach for distilling math word problem solving capabilities from large language models (LLMs) into smaller, more efficient student models. Our approach is designed to consider the student model's weaknesses and foster a tailored learning experience by generating targeted exercises aligned with educational science principles, such as knowledge tracing and personalized learning. Concretely, we let GPT-3 be a math tutor and run two steps iteratively: 1) assessing the student model's current learning status on a GPT-generated exercise book, and 2) improving the student model by training it with tailored exercise samples generated by GPT-3. Experimental results reveal that our approach outperforms LLMs (e.g., GPT-3 and PaLM) in accuracy across three distinct benchmarks while employing significantly fewer parameters. Furthermore, we provide a comprehensive analysis of the various components within our methodology to substantiate their efficacy. △ Less

Submitted 22 May, 2023; originally announced May 2023.

arXiv:2305.14250 [pdf, other]

Language Models with Rationality

Authors: Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle Richardson, Hinrich Schuetze, Peter Clark

Abstract: While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent "beliefs". This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that… ▽ More While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent "beliefs". This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that answers are supported by interpretable chains of reasoning drawn from a consistent network of beliefs. Our approach, which we call REFLEX, is to add a rational, self-reflecting layer on top of the LLM. First, given a question, we construct a belief graph using a backward-chaining process to materialize relevant model beliefs (including beliefs about answer candidates) and their inferential relationships. Second, we identify and minimize contradictions in that graph using a formal constraint reasoner. We find that REFLEX significantly improves consistency (by 8%-11% absolute) without harming overall answer accuracy, resulting in answers supported by faithful chains of reasoning drawn from a more consistent belief system. This suggests a new style of system architecture in which an LLM extended with a rational layer can provide an interpretable window into system beliefs, add a systematic reasoning capability, and repair latent inconsistencies present in the LLM. △ Less

Submitted 29 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.14010 [pdf, other]

IfQA: A Dataset for Open-domain Question Answering under Counterfactual Presuppositions

Authors: Wenhao Yu, Meng Jiang, Peter Clark, Ashish Sabharwal

Abstract: Although counterfactual reasoning is a fundamental aspect of intelligence, the lack of large-scale counterfactual open-domain question-answering (QA) benchmarks makes it difficult to evaluate and improve models on this ability. To address this void, we introduce the first such dataset, named IfQA, where each question is based on a counterfactual presupposition via an "if" clause. For example, if L… ▽ More Although counterfactual reasoning is a fundamental aspect of intelligence, the lack of large-scale counterfactual open-domain question-answering (QA) benchmarks makes it difficult to evaluate and improve models on this ability. To address this void, we introduce the first such dataset, named IfQA, where each question is based on a counterfactual presupposition via an "if" clause. For example, if Los Angeles was on the east coast of the U.S., what would be the time difference between Los Angeles and Paris? Such questions require models to go beyond retrieving direct factual knowledge from the Web: they must identify the right information to retrieve and reason about an imagined situation that may even go against the facts built into their parameters. The IfQA dataset contains over 3,800 questions that were annotated annotated by crowdworkers on relevant Wikipedia passages. Empirical analysis reveals that the IfQA dataset is highly challenging for existing open-domain QA methods, including supervised retrieve-then-read pipeline methods (EM score 36.2), as well as recent few-shot approaches such as chain-of-thought prompting with GPT-3 (EM score 27.4). The unique challenges posed by the IfQA benchmark will push open-domain QA research on both retrieval and counterfactual reasoning fronts. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.08844 [pdf, other]

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

Authors: Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan, Peter Clark, Derry Wijaya, Niket Tandon

Abstract: Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics… ▽ More Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators. △ Less

Submitted 11 July, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

Comments: ACL 2023

arXiv:2305.01304 [pdf, ps, other]

Functional degrees and arithmetic applications II: The Group-Theoretic Prime Ax-Katz Theorem

Authors: Pete L. Clark, Uwe Schauz

Abstract: We give a version of Ax-Katz's $p$-adic congruences and Moreno-Moreno's $p$-weight refinement that holds over any finite commutative ring of prime characteristic. We deduce this from a purely group-theoretic result that gives a lower bound on the $p$-adic divisibility of the number of simultaneous zeros of a system of maps $f_j: A\to B_j$ from a fixed ``source'' finite commutative group $A$ of exp… ▽ More We give a version of Ax-Katz's $p$-adic congruences and Moreno-Moreno's $p$-weight refinement that holds over any finite commutative ring of prime characteristic. We deduce this from a purely group-theoretic result that gives a lower bound on the $p$-adic divisibility of the number of simultaneous zeros of a system of maps $f_j: A\to B_j$ from a fixed ``source'' finite commutative group $A$ of exponent $p$ to varying ``target'' finite commutative $p$-groups $B_j$. Our proof combines Wilson's proof of Ax-Katz over $\mathbb{F}_p$ with the functional calculus of Aichinger-Moosbauer. △ Less

Submitted 2 May, 2023; originally announced May 2023.

Comments: 29 pages

MSC Class: 20K01; 13F20; 20C05

arXiv:2304.07399 [pdf, ps, other]

Densities of integer sets represented by quadratic forms

Authors: Pete L. Clark, Paul Pollack, Jeremy Rouse, Katherine Thompson

Abstract: Let $f(t_1,\ldots,t_n)$ be a nondegenerate integral quadratic form. We analyze the asymptotic behavior of the function $D_f(X)$, the number of integers of absolute value up to $X$ represented by $f$. When $f$ is isotropic or $n$ is at least $3$, we show that there is a $δ(f) \in \mathbb{Q} \cap (0,1)$ such that $D_f(X) \sim δ(f) X$ and call $δ(f)$ the density of $f$. We consider the inverse proble… ▽ More Let $f(t_1,\ldots,t_n)$ be a nondegenerate integral quadratic form. We analyze the asymptotic behavior of the function $D_f(X)$, the number of integers of absolute value up to $X$ represented by $f$. When $f$ is isotropic or $n$ is at least $3$, we show that there is a $δ(f) \in \mathbb{Q} \cap (0,1)$ such that $D_f(X) \sim δ(f) X$ and call $δ(f)$ the density of $f$. We consider the inverse problem of which densities arise. Our main technical tool is a Near Hasse Principle: a quadratic form may fail to represent infinitely many integers that it locally represents, but this set of exceptions has density $0$ within the set of locally represented integers. △ Less

Submitted 14 April, 2023; originally announced April 2023.

Comments: 25 pages

MSC Class: Primary 11E12; Secondary 11E20

arXiv:2303.17651 [pdf, other]

Self-Refine: Iterative Refinement with Self-Feedback

Authors: Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark

Abstract: Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it… ▽ More Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it to refine itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner, and feedback provider. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by ~20% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach. △ Less

Submitted 25 May, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

Comments: Code, data, and demo at https://selfrefine.info/

arXiv:2303.04412 [pdf, other]

doi 10.1093/mnras/stad1000

Multiwavelength observations of the extraordinary accretion event AT2021lwx

Authors: P. Wiseman, Y. Wang, S. Hönig, N. Castro-Segura, P. Clark, C. Frohmaier, M. D. Fulton, G. Leloudas, M. Middleton, T. E. Müller-Bravo, A. Mummery, M. Pursiainen, S. J. Smartt, K. Smith, M. Sullivan, J. P. Anderson, J. A. Acosta Pulido, P. Charalampopoulos, M. Banerji, M. Dennefeld, L. Galbany, M. Gromadzki, C. P. Gutiérrez, N. Ihanec, E. Kankare , et al. (21 additional authors not shown)

Abstract: We present observations from X-ray to mid-infrared wavelengths of the most energetic non-quasar transient ever observed, AT2021lwx. Our data show a single optical brightening by a factor $>100$ to a luminosity of $7\times10^{45}$ erg s$^{-1}$, and a total radiated energy of $1.5\times10^{53}$ erg, both greater than any known optical transient. The decline is smooth and exponential and the ultra-vi… ▽ More We present observations from X-ray to mid-infrared wavelengths of the most energetic non-quasar transient ever observed, AT2021lwx. Our data show a single optical brightening by a factor $>100$ to a luminosity of $7\times10^{45}$ erg s$^{-1}$, and a total radiated energy of $1.5\times10^{53}$ erg, both greater than any known optical transient. The decline is smooth and exponential and the ultra-violet - optical spectral energy distribution resembles a black body with temperature $1.2\times10^4$ K. Tentative X-ray detections indicate a secondary mode of emission, while a delayed mid-infrared flare points to the presence of dust surrounding the transient. The spectra are similar to recently discovered optical flares in known active galactic nuclei but lack some characteristic features. The lack of emission for the previous seven years is inconsistent with the short-term, stochastic variability observed in quasars, while the extreme luminosity and long timescale of the transient disfavour the disruption of a single solar-mass star. The luminosity could be generated by the disruption of a much more massive star, but the likelihood of such an event occurring is small. A plausible scenario is the accretion of a giant molecular cloud by a dormant black hole of $10^8 - 10^9$ solar masses. AT2021lwx thus represents an extreme extension of the known scenarios of black hole accretion. △ Less

Submitted 31 March, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

Comments: 11 pages, 5 figures, Accepted for publication in MNRAS

arXiv:2301.05245 [pdf, other]

doi 10.1093/mnras/stad150

Do simulated molecular clouds look like real ones?

Authors: F. D. Priestley, P. C. Clark, A. P Whitworth

Abstract: Simulations of molecular clouds often begin from highly idealised initial conditions, such as a uniform-density sphere with an artificially imposed turbulent velocity field. While the resulting structures may appear qualitatively similar to those detected in continuum and line observations, it is unclear whether they are genuinely representative of real molecular clouds. Recent observational work… ▽ More Simulations of molecular clouds often begin from highly idealised initial conditions, such as a uniform-density sphere with an artificially imposed turbulent velocity field. While the resulting structures may appear qualitatively similar to those detected in continuum and line observations, it is unclear whether they are genuinely representative of real molecular clouds. Recent observational work has discovered a tight, often close-to-linear relationship between the integrated intensity of molecular lines and the total column density of the cloud material. We combine magnetohydrodynamical simulations, time-dependent chemistry, and radiative transfer to produce synthetic molecular line observations of model clouds. We find similarly tight correlations between line intensity and column density to those observed, although the linear behaviour is only seen in isolated (as opposed to colliding) model clouds. This linear relationship is not due to optically thin emission; all lines investigated have high optical depths, and the increase in integrated intensity with column density is due to higher velocity dispersion along the line of sight. Overall, the idealised models commonly used in the literature appear to be reasonably accurate representations of real molecular clouds. △ Less

Submitted 12 January, 2023; originally announced January 2023.

Comments: 10 pages, 10 figures. MNRAS accepted. Data publicly available at http://cloudzoo.astro.cf.ac.uk/Downloads/202211/

arXiv:2301.00828 [pdf, other]

doi 10.1093/mnras/stad188

From dark matter halos to pre-stellar cores: High resolution follow-up of cosmological Lyman-Werner simulations

Authors: Lewis R. Prole, Anna T. P. Schauer, Paul C. Clark, Simon C. O. Glover, Felix D. Priestley, Ralf S. Klessen

Abstract: Molecular hydrogen allows cooling in primordial gas, facilitating its collapse into Population III stars within primordial halos. Lyman-Werner (LW) radiation from these stars can escape the halo and delay further star formation by destroying H$_2$ in other halos. As cosmological simulations show that increasing the background LW field strength increases the average halo mass required for star form… ▽ More Molecular hydrogen allows cooling in primordial gas, facilitating its collapse into Population III stars within primordial halos. Lyman-Werner (LW) radiation from these stars can escape the halo and delay further star formation by destroying H$_2$ in other halos. As cosmological simulations show that increasing the background LW field strength increases the average halo mass required for star formation, we perform follow-up simulations of selected halos to investigate the knock-on effects this has on the Population III IMF. We follow 5 halos for each of the $J_{21}$ = 0, 0.01 and 0.1 LW field strengths, resolving the pre-stellar core density of $10^{-6}$ g cm$^{-3}$ (10$^{18}$ cm$^{-3}$) before inserting sink particles and following the fragmentation behaviour for hundreds of years further. We find that the mass accreted onto sinks by the end of the simulations is proportional to the mass within the $\sim 10^{-2}$ pc molecular core, which is not correlated to the initial mass of the halo. As such, the IMFs for masses above the brown dwarf limit show little dependence on the LW strength, although they do show variance in the number of low-mass clumps formed. As the range of background LW field strengths tested here covers the most likely values from literature, we conclude that the IMF for so-called Pop III.2 stars is not significantly different from the initial population of Pop III.1 stars. The primordial IMF therefore likely remains unchanged until the formation of the next generation of Population II stars. △ Less

Submitted 19 January, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

Comments: MNRAS Accepted 2023 January 16 ref. MN-22-5075-MJ.R2

arXiv:2212.13327 [pdf, other]

CM Elliptic Curves: Volcanoes, Reality and Applications, Part II

Authors: Pete L. Clark, Frederick Saia

Abstract: Let $M \mid N$ be positive integers, and let $Δ$ be the discriminant of an order in an imaginary quadratic field $K$. When $Δ_K < -4$, the first author determined the fiber of the morphism $X_0(M,N) \rightarrow X(1)$ over the closed point $J_Δ$ corresponding to $Δ$ and showed that all fibers of the map $X_1(M,N) \rightarrow X_0(M,N)$ over $J_Δ$ were connected. Here we complement this prior work by… ▽ More Let $M \mid N$ be positive integers, and let $Δ$ be the discriminant of an order in an imaginary quadratic field $K$. When $Δ_K < -4$, the first author determined the fiber of the morphism $X_0(M,N) \rightarrow X(1)$ over the closed point $J_Δ$ corresponding to $Δ$ and showed that all fibers of the map $X_1(M,N) \rightarrow X_0(M,N)$ over $J_Δ$ were connected. Here we complement this prior work by addressing the most difficult cases $Δ_K \in \{-3,-4\}$. These works provide all the information needed to compute, for each positive integer $d$, all subgroups of $E(F)[\operatorname{tors}]$, where $F$ is a number field of degree $d$ and $E_{/F}$ is an elliptic curve with complex multiplication. △ Less

Submitted 26 December, 2022; originally announced December 2022.

Comments: 37 pages

arXiv:2212.13316 [pdf, other]

CM Elliptic Curves: Volcanoes, Reality and Applications

Authors: Pete L. Clark

Abstract: For positive integers $M \mid N$ and an order of discriminant $Δ$ in an imaginary quadratic field $K$ with discriminant $Δ_K < -4$, we determine the fiber of the morphism $X_0(M,N) \rightarrow X(1)$ over the closed point $J_Δ$ corresponding to $Δ$. We also show that the fiber of the natural map $X_1(M,N) \rightarrow X_0(M,N)$ over $J_Δ$ is connected. Putting this together we deduce the number of p… ▽ More For positive integers $M \mid N$ and an order of discriminant $Δ$ in an imaginary quadratic field $K$ with discriminant $Δ_K < -4$, we determine the fiber of the morphism $X_0(M,N) \rightarrow X(1)$ over the closed point $J_Δ$ corresponding to $Δ$. We also show that the fiber of the natural map $X_1(M,N) \rightarrow X_0(M,N)$ over $J_Δ$ is connected. Putting this together we deduce the number of points in the fiber of $X_1(M,N) \rightarrow X(1)$ over $J_Δ$ and their residual degrees. In the continuation of this work with F. Saia, these results will be extended to $Δ_K \in \{-4,3\}$. These works provide all the information needed to compute, for each positive integer $d$, all subgroups of $E(F)[\operatorname{tors}]$, where $F$ is a number field of degree $d$ and $E_{/F}$ is an elliptic curve with complex multiplication (CM). △ Less

Submitted 26 December, 2022; originally announced December 2022.

Comments: 105 pages

arXiv:2212.10029 [pdf, other]

Do language models have coherent mental models of everyday things?

Authors: Yuling Gu, Bhavana Dalvi Mishra, Peter Clark

Abstract: When people think of everyday things like an egg, they typically have a mental image associated with it. This allows them to correctly judge, for example, that "the yolk surrounds the shell" is a false statement. Do language models similarly have a coherent picture of such everyday things? To investigate this, we propose a benchmark dataset consisting of 100 everyday things, their parts, and the r… ▽ More When people think of everyday things like an egg, they typically have a mental image associated with it. This allows them to correctly judge, for example, that "the yolk surrounds the shell" is a false statement. Do language models similarly have a coherent picture of such everyday things? To investigate this, we propose a benchmark dataset consisting of 100 everyday things, their parts, and the relationships between these parts, expressed as 11,720 "X relation Y?" true/false questions. Using these questions as probes, we observe that state-of-the-art pre-trained language models (LMs) like GPT-3 and Macaw have fragments of knowledge about these everyday things, but do not have fully coherent "parts mental models" (54-59% accurate, 19-43% conditional constraint violation). We propose an extension where we add a constraint satisfaction layer on top of the LM's raw predictions to apply commonsense constraints. As well as removing inconsistencies, we find that this also significantly improves accuracy (by 16-20%), suggesting how the incoherence of the LM's pictures of everyday things can be significantly reduced. △ Less

Submitted 8 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: ACL 2023

Showing 1–50 of 334 results for author: Clark, P