Search | arXiv e-print repository

Lila: A Unified Benchmark for Mathematical Reasoning

Authors: Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan

Abstract: Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shop** to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., q… ▽ More Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shop** to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding. △ Less

Submitted 8 March, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: EMNLP 2022

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2210.16407 [pdf, other]

Just-DREAM-about-it: Figurative Language Understanding with DREAM-FLUTE

Authors: Yuling Gu, Yao Fu, Valentina Pyatkin, Ian Magnusson, Bhavana Dalvi Mishra, Peter Clark

Abstract: Figurative language (e.g., "he flew like the wind") is challenging to understand, as it is hard to tell what implicit information is being conveyed from the surface form alone. We hypothesize that to perform this task well, the reader needs to mentally elaborate the scene being described to identify a sensible meaning of the language. We present DREAM-FLUTE, a figurative language understanding sys… ▽ More Figurative language (e.g., "he flew like the wind") is challenging to understand, as it is hard to tell what implicit information is being conveyed from the surface form alone. We hypothesize that to perform this task well, the reader needs to mentally elaborate the scene being described to identify a sensible meaning of the language. We present DREAM-FLUTE, a figurative language understanding system that does this, first forming a "mental model" of situations described in a premise and hypothesis before making an entailment/contradiction decision and generating an explanation. DREAM-FLUTE uses an existing scene elaboration model, DREAM, for constructing its "mental model." In the FigLang2022 Shared Task evaluation, DREAM-FLUTE achieved (joint) first place (Acc@60=63.3%), and can perform even better with ensemble techniques, demonstrating the effectiveness of this approach. More generally, this work suggests that adding a reflective component to pretrained language models can improve their performance beyond standard fine-tuning (3.3% improvement in Acc@60). △ Less

Submitted 28 October, 2022; originally announced October 2022.

Comments: Accepted at The Third Workshop on Figurative Language Processing @ EMNLP 2022

arXiv:2210.12217 [pdf, other]

Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning

Authors: Oyvind Tafjord, Bhavana Dalvi Mishra, Peter Clark

Abstract: Our goal is a question-answering (QA) system that can show how its answers are implied by its own internal beliefs via a systematic chain of reasoning. Such a capability would allow better understanding of why a model produced the answer it did. Our approach is to recursively combine a trained backward-chaining model, capable of generating a set of premises entailing an answer hypothesis, with a v… ▽ More Our goal is a question-answering (QA) system that can show how its answers are implied by its own internal beliefs via a systematic chain of reasoning. Such a capability would allow better understanding of why a model produced the answer it did. Our approach is to recursively combine a trained backward-chaining model, capable of generating a set of premises entailing an answer hypothesis, with a verifier that checks that the model itself believes those premises (and the entailment itself) through self-querying. To our knowledge, this is the first system to generate multistep chains that are both faithful (the answer follows from the reasoning) and truthful (the chain reflects the system's own internal beliefs). In evaluation using two different datasets, users judge that a majority (70%+) of generated chains clearly show how an answer follows from a set of facts - substantially better than a high-performance baseline - while preserving answer accuracy. By materializing model beliefs that systematically support an answer, new opportunities arise for understanding the model's system of belief, and diagnosing and correcting its misunderstandings when an answer is wrong. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: accepted at EMNLP 2022. arXiv admin note: substantial text overlap with arXiv:2204.13074

arXiv:2210.02406 [pdf, other]

Decomposed Prompting: A Modular Approach for Solving Complex Tasks

Authors: Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, Ashish Sabharwal

Abstract: Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to solve various tasks. However, this approach struggles as the task complexity increases or when the individual reasoning steps of the task themselves are hard to learn, especially when embedded in more complex tasks. To address this, we propose Decomposed Prompting, a new approach to solve complex tasks by deco… ▽ More Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to solve various tasks. However, this approach struggles as the task complexity increases or when the individual reasoning steps of the task themselves are hard to learn, especially when embedded in more complex tasks. To address this, we propose Decomposed Prompting, a new approach to solve complex tasks by decomposing them (via prompting) into simpler sub-tasks that can be delegated to a library of prompting-based LLMs dedicated to these sub-tasks. This modular structure allows each prompt to be optimized for its specific sub-task, further decomposed if necessary, and even easily replaced with more effective prompts, trained models, or symbolic functions if desired. We show that the flexibility and modularity of Decomposed Prompting allows it to outperform prior work on few-shot prompting using GPT3. On symbolic reasoning tasks, we can further decompose sub-tasks that are hard for LLMs into even simpler solvable sub-tasks. When the complexity comes from the input length, we can recursively decompose the task into the same task but with smaller inputs. We also evaluate our approach on textual multi-step reasoning tasks: on long-context multi-hop QA task, we can more effectively teach the sub-tasks via our separate sub-tasks prompts; and on open-domain multi-hop QA, we can incorporate a symbolic information retrieval within our decomposition framework, leading to improved performance on both tasks. Datasets, Code and Prompts available at https://github.com/allenai/DecomP. △ Less

Submitted 11 April, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: ICLR'23 Camera Ready

arXiv:2210.00720 [pdf, other]

Complexity-Based Prompting for Multi-Step Reasoning

Authors: Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, Tushar Khot

Abstract: We study the task of prompting large-scale language models to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer, large language models can generate new reasoning chains and predict answers for new inputs. A central question is which reasoning examples make… ▽ More We study the task of prompting large-scale language models to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer, large language models can generate new reasoning chains and predict answers for new inputs. A central question is which reasoning examples make the most effective prompts. In this work, we propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning. We show that prompts with higher reasoning complexity, i.e., chains with more reasoning steps, achieve substantially better performance on multi-step reasoning tasks over strong baselines. We further extend our complexity-based criteria from prompting (selecting inputs) to decoding (selecting outputs), where we sample multiple reasoning chains from the model, then choose the majority of generated answers from complex reasoning chains (over simple chains). When used to prompt GPT-3 and Codex, our approach substantially improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA) performance on three math benchmarks (GSM8K, MultiArith, and MathQA) and two BigBenchHard tasks (Date Understanding and Penguins), with an average +5.3 and up to +18 accuracy improvements. Compared with existing example selection schemes like manual tuning or retrieval-based selection, selection based on reasoning complexity is intuitive, easy to implement, and annotation-efficient. Further results demonstrate the robustness of performance gains from complex prompts under format perturbation and distribution shift. △ Less

Submitted 30 January, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: Preprint

arXiv:2209.14610 [pdf, other]

Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning

Authors: Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, Ashwin Kalyan

Abstract: Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that in… ▽ More Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TabMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multi-choice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in selecting in-context examples. △ Less

Submitted 2 March, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: ICLR 2023. 26 pages and 18 figures. The data and code are available at https://promptpg.github.io

arXiv:2209.09513 [pdf, other]

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Authors: Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, Ashwin Kalyan

Abstract: When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system… ▽ More When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io. △ Less

Submitted 17 October, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

Comments: Accepted to NeurIPS 2022. 22 pages, 17 figures, 9 tables. Project: https://scienceqa.github.io

arXiv:2209.07662 [pdf, other]

NELLIE: A Neuro-Symbolic Inference Engine for Grounded, Compositional, and Explainable Reasoning

Authors: Nathaniel Weir, Peter Clark, Benjamin Van Durme

Abstract: Our goal is a modern approach to answering questions via systematic reasoning where answers are supported by human interpretable proof trees grounded in an NL corpus of authoritative facts. Such a system would help alleviate the challenges of interpretability and hallucination with modern LMs, and the lack of grounding of current explanation methods (e.g., Chain-of-Thought). This paper proposes a… ▽ More Our goal is a modern approach to answering questions via systematic reasoning where answers are supported by human interpretable proof trees grounded in an NL corpus of authoritative facts. Such a system would help alleviate the challenges of interpretability and hallucination with modern LMs, and the lack of grounding of current explanation methods (e.g., Chain-of-Thought). This paper proposes a new take on Prolog-based inference engines, where we replace handcrafted rules with a combination of neural language modeling, guided generation, and semiparametric dense retrieval. Our implementation, NELLIE, is the first system to demonstrate fully interpretable, end-to-end grounded QA as entailment tree proof search, going beyond earlier work explaining known-to-be-true facts from text. In experiments, NELLIE outperforms a similar-sized state-of-the-art reasoner [Tafjord et al., 2022] while producing knowledge-grounded explanations. We also find NELLIE can exploit both semi-structured and NL text corpora to guide reasoning. Together these suggest a new way to jointly reap the benefits of both modern neural methods and traditional symbolic reasoning. △ Less

Submitted 21 December, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

arXiv:2206.11919 [pdf, other]

doi 10.1093/mnras/stac2327

Primordial magnetic fields in Population III star formation: a magnetised resolution study

Authors: Lewis Prole, Paul Clark, Ralf Klessen, Simon Glover, Ruediger Pakmor

Abstract: Population III stars form in groups due to the fragmentation of primordial gas. While uniform magnetic fields have been shown to support against fragmentation in present day star formation, it is unclear whether realistic k^3/2 primordial fields can have the same effect. We bypass the issues associated with simulating the turbulent dynamo by introducing a saturated magnetic field at equipartition… ▽ More Population III stars form in groups due to the fragmentation of primordial gas. While uniform magnetic fields have been shown to support against fragmentation in present day star formation, it is unclear whether realistic k^3/2 primordial fields can have the same effect. We bypass the issues associated with simulating the turbulent dynamo by introducing a saturated magnetic field at equipartition with the velocity field when the central densities reaches 10-13 g cm-3. We test a range of sink particle creation densities from 10-10-10-8 g cm-3. Within the range tested, the fields did not suppress fragmentation of the gas and hence could not prevent the degree of fragmentation from increasing with increased resolution. The number of sink particles formed and total mass in sink particles was unaffected by the magnetic field across all seed fields and resolutions. The magnetic pressure remained sub-dominant to the gas pressure except in the highest density regions of the simulation box, where it became equal to but never exceeded gas pressure. Our results suggest that the inclusion of magnetic fields in numerical simulations of Pop III star formation is largely unimportant. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: Submitted to MNRAS

arXiv:2206.00049 [pdf, other]

doi 10.1093/mnras/stac2673

The nuclear transient AT 2017gge: a tidal disruption event in a dusty and gas-rich environment and the awakening of a dormant SMBH

Authors: F. Onori, G. Cannizzaro, P. G. Jonker, M. Kim, M. Nicholl, S. Mattila, T. M. Reynolds, M. Fraser, T. Wevers, E. Brocato, J. P. Anderson, R. Carini, P. Charalampopoulos, P. Clark, M. Gromadzki, C. P. Gutiérrez, N. Ihanec, C. Inserra, A. Lawrence, G. Leloudas, P. Lundqvist, T. E. Müller-Bravo, S. Piranomonte, M. Pursiainen, K. A. Rybicki , et al. (6 additional authors not shown)

Abstract: We present the results from a dense multi-wavelength (optical/UV, near-infrared (IR), and X-ray) follow-up campaign of the nuclear transient AT2017gge, covering a total of 1698 days from the transient's discovery. The bolometric lightcurve, the black body temperature and radius, the broad H and He I $λ$5876 emission lines and their evolution with time, are all consistent with a tidal disruption ev… ▽ More We present the results from a dense multi-wavelength (optical/UV, near-infrared (IR), and X-ray) follow-up campaign of the nuclear transient AT2017gge, covering a total of 1698 days from the transient's discovery. The bolometric lightcurve, the black body temperature and radius, the broad H and He I $λ$5876 emission lines and their evolution with time, are all consistent with a tidal disruption event (TDE) nature. A soft X-ray flare is detected with a delay of $\sim$200 days with respect to the optical/UV peak and it is rapidly followed by the emergence of a broad He II $λ$4686 and by a number of long-lasting high ionization coronal emission lines. This indicate a clear connection between a TDE flare and the appearance of extreme coronal line emission (ECLEs). An IR echo, resulting from dust re-radiation of the optical/UV TDE light is observed after the X-ray flare and the associated near-IR spectra show a transient broad feature in correspondence of the He I $λ$10830 and, for the first time in a TDE, a transient high-ionization coronal NIR line (the [Fe XIII] $λ$10798) is also detected. The data are well explained by a scenario in which a TDE occurs in a gas and dust rich environment and its optical/UV, soft X-ray, and IR emission have different origins and locations. The optical emission may be produced by stellar debris stream collisions prior to the accretion disk formation, which is instead responsible for the soft X-ray flare, emitted after the end of the circularization process. △ Less

Submitted 9 September, 2022; v1 submitted 31 May, 2022; originally announced June 2022.

Comments: Accepted for publication in MNRAS

arXiv:2204.13074 [pdf, other]

Towards Teachable Reasoning Systems: Using a Dynamic Memory of User Feedback for Continual System Improvement

Authors: Bhavana Dalvi Mishra, Oyvind Tafjord, Peter Clark

Abstract: Our goal is a teachable reasoning system for question-answering (QA), where a user can interact with faithful answer explanations, and correct its errors so that the system improves over time. Our approach is to augment a QA model with a dynamic memory of user feedback, containing user-supplied corrections to erroneous model beliefs that users identify during interaction. Retrievals from memory ar… ▽ More Our goal is a teachable reasoning system for question-answering (QA), where a user can interact with faithful answer explanations, and correct its errors so that the system improves over time. Our approach is to augment a QA model with a dynamic memory of user feedback, containing user-supplied corrections to erroneous model beliefs that users identify during interaction. Retrievals from memory are used as additional context for QA, to help avoid previous mistakes in similar new situations - a novel application of memory-based continuous learning. With simulated feedback, we find that our system (called TeachMe) continually improves with time, and without model retraining, requiring feedback on only 25% of training examples to reach within 1% of the upper-bound (feedback on all examples). Similarly, in experiments with real users, we observe a similar trend, with performance improving by over 15% on a hidden test set after teaching. This suggests new opportunities for using frozen language models in an interactive setting where users can inspect, debug, and correct the model's beliefs, leading to improved system's performance over time. △ Less

Submitted 21 October, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

Comments: accepted at EMNLP 2022

arXiv:2204.09148 [pdf, other]

What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment

Authors: Matthew Finlayson, Kyle Richardson, Ashish Sabharwal, Peter Clark

Abstract: The instruction learning paradigm -- where a model learns to perform new tasks from task descriptions alone -- has become popular in general-purpose model research. The capabilities of large transformer models as instruction learners, however, remain poorly understood. We use a controlled synthetic environment to characterize such capabilities. Specifically, we use the task of deciding whether a g… ▽ More The instruction learning paradigm -- where a model learns to perform new tasks from task descriptions alone -- has become popular in general-purpose model research. The capabilities of large transformer models as instruction learners, however, remain poorly understood. We use a controlled synthetic environment to characterize such capabilities. Specifically, we use the task of deciding whether a given string matches a regular expression (viewed as an instruction) to identify properties of tasks, instructions, and instances that make instruction learning challenging. For instance, we find that our model, a fine-tuned T5-based text2text transformer, struggles with large regular languages, suggesting that less precise instructions are challenging for models. Additionally, instruction executions that require tracking longer contexts of prior steps are also more difficult. We use our findings to systematically construct a challenging instruction learning dataset, which we call Hard RegSet. Fine-tuning on Hard RegSet, our large transformer learns to correctly interpret only 65.6% of test instructions (with at least 90% accuracy), and 11%-24% of the instructions in out-of-distribution generalization settings. We propose Hard RegSet as a challenging instruction learning task, and a controlled environment for studying instruction learning. △ Less

Submitted 24 May, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: Typos corrected, rewordings

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2204.05660 [pdf, other]

NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks

Authors: Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, Ashwin Kalyan

Abstract: Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE that was proposed… ▽ More Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE that was proposed in the context of natural language understanding, we propose NumGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4%). Further, NumGLUE promotes sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4% on each task) when a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning. △ Less

Submitted 12 April, 2022; originally announced April 2022.

Comments: ACL 2022

arXiv:2203.05839 [pdf, other]

doi 10.1016/j.astropartphys.2022.102695

Inter-Calibration of Atmospheric Cherenkov Telescopes with UAV-based Airborne Calibration System

Authors: A. M. Brown, J. Muller, M. de Naurois, P. Clark

Abstract: The recent advances in the flight capability of remotely piloted aerial vehicles (here after referred to as UAVs) have afforded the astronomical community the possibility of a new telescope calibration technique: UAV-based calibration. Building upon a feasibility study which characterised the potential that a UAV-based calibration system has for the future Cherenkov Telescope Array, we created a f… ▽ More The recent advances in the flight capability of remotely piloted aerial vehicles (here after referred to as UAVs) have afforded the astronomical community the possibility of a new telescope calibration technique: UAV-based calibration. Building upon a feasibility study which characterised the potential that a UAV-based calibration system has for the future Cherenkov Telescope Array, we created a first-generation UAV-calibration prototype and undertook a field-campaign of inter-calibrating the sensitivity of the H.E.S.S. telescope array with two successful calibration flights. In this paper we report the key results of our first test campaign: firstly, by comparing the intensity of the UAV-calibration events, as recorded by the individual HESS-I cameras, we find that a UAV-based inter-calibration is consistent with the standard muon inter-calibration technique at the level of \SI{5.4}{\%} and \SI{5.8}{\%} for the two individual UAV-calibration runs. Secondly, by comparing the position of the UAV-calibration signal on the camera focal plane, for a variety of telescope pointing models, we were able to constrain the pointing accuracy of the HESS-I telescopes at the tens of arc-second accuracy level. This is consistent with the pointing accuracy derived from other pointing calibration methods. Importantly both the inter-calibration and pointing accuracy results were achieved with a first-generation UAV-calibration prototype, which eludes to the potential of the technique and highlights that a UAV-based system is a viable calibration technique for current and future ground-based $γ$-ray telescope arrays. △ Less

Submitted 11 March, 2022; originally announced March 2022.

Comments: 30 pages, accepted for publication in Astroparticle Physics

Journal ref: Astroparticle Physics, Volume 140, July 2022, 102695

arXiv:2201.11236 [pdf, ps, other]

Restricted Variable Chevalley-Warning Theorems

Authors: Anurag Bishnoi, Pete L. Clark

Abstract: We pursue various restricted variable generalizations of the Chevalley-Warning theorem for low degree polynomial systems over a finite field. Our first such result involves variables restricted to Cartesian products of the Vandermonde subsets of $\F_q$ defined by Gács-Weiner and Sziklai-Takáts. We then define an invariant $\uomega(X)$ of a nonempty subset of $\F_q^n$. Our second result involves… ▽ More We pursue various restricted variable generalizations of the Chevalley-Warning theorem for low degree polynomial systems over a finite field. Our first such result involves variables restricted to Cartesian products of the Vandermonde subsets of $\F_q$ defined by Gács-Weiner and Sziklai-Takáts. We then define an invariant $\uomega(X)$ of a nonempty subset of $\F_q^n$. Our second result involves $X$-restricted variables when the degrees of the polynomials are small compared to $\uomega(X)$. We end by exploring various classes of subsets for which $\uomega(X)$ can be bounded from below. △ Less

Submitted 26 January, 2022; originally announced January 2022.

Comments: 13 pages

arXiv:2201.06009 [pdf, other]

Memory-assisted prompt editing to improve GPT-3 after deployment

Authors: Aman Madaan, Niket Tandon, Peter Clark, Yiming Yang

Abstract: Large LMs such as GPT-3 are powerful, but can commit mistakes that are obvious to humans. For example, GPT-3 would mistakenly interpret "What word is similar to good?" to mean a homophone, while the user intended a synonym. Our goal is to effectively correct such errors via user interactions with the system but without retraining, which will be prohibitively costly. We pair GPT-3 with a growing me… ▽ More Large LMs such as GPT-3 are powerful, but can commit mistakes that are obvious to humans. For example, GPT-3 would mistakenly interpret "What word is similar to good?" to mean a homophone, while the user intended a synonym. Our goal is to effectively correct such errors via user interactions with the system but without retraining, which will be prohibitively costly. We pair GPT-3 with a growing memory of recorded cases where the model misunderstood the user's intents, along with user feedback for clarification. Such a memory allows our system to produce enhanced prompts for any new query based on the user feedback for error correction on similar cases in the past. On four tasks (two lexical tasks, two advanced ethical reasoning tasks), we show how a (simulated) user can interactively teach a deployed GPT-3, substantially increasing its accuracy over the queries with different kinds of misunderstandings by the GPT-3. Our approach is a step towards the low-cost utility enhancement for very large pre-trained LMs. Code, data, and instructions to implement MEMPROMPT for a new task at https://www.memprompt.com/. △ Less

Submitted 18 February, 2023; v1 submitted 16 January, 2022; originally announced January 2022.

Comments: EMNLP 2022. This version updates the title to be consistent with EMNLP camera ready

arXiv:2201.02763 [pdf, ps, other]

Functional Degrees And Arithmetic Applications, I: The Set Of Functional Degrees

Authors: P. L. Clark, U. Schauz

Abstract: We give a further development of the Aichinger-Moosbauer calculus of functional degrees of maps between commutative groups. For any fixed given commutative groups $A$ and $B$, we compute the largest possible finite functional degree that a map $f: A \longrightarrow B$ can have. We also determine the set of all possible degrees of such maps. This also yields a solution to Aichinger and Moosbauer's… ▽ More We give a further development of the Aichinger-Moosbauer calculus of functional degrees of maps between commutative groups. For any fixed given commutative groups $A$ and $B$, we compute the largest possible finite functional degree that a map $f: A \longrightarrow B$ can have. We also determine the set of all possible degrees of such maps. This also yields a solution to Aichinger and Moosbauer's problem of finding the nilpotency index of the augmentation ideal of group rings of the form $Z_{p^β}[Z_{p^{α_1}}\times Z_{p^{α_2}}\times\dotsm\times Z_{p^{α_n}}]$ with $p,β,n,α_1,\dotsc,α_n\in\mathbb{Z}^+$, $p$ prime. △ Less

Submitted 8 January, 2022; originally announced January 2022.

Comments: 21 pages

MSC Class: 20K01; 13F20; 20C05

arXiv:2112.10800 [pdf, other]

doi 10.1093/mnras/stab3697

Fragmentation induced starvation in Population III star formation: a resolution study

Authors: Lewis R. Prole, Paul C. Clark, Ralf S. Klessen, Simon C. O. Glover

Abstract: The Population III initial mass function (IMF) is currently unknown, but recent studies agree that fragmentation of primordial gas gives a broader IMF than the initially suggested singular star per halo. In this study we introduce sink particle mergers into Arepo, to perform the first resolution study for primordial star formation simulations and present the first Population III simulations to run… ▽ More The Population III initial mass function (IMF) is currently unknown, but recent studies agree that fragmentation of primordial gas gives a broader IMF than the initially suggested singular star per halo. In this study we introduce sink particle mergers into Arepo, to perform the first resolution study for primordial star formation simulations and present the first Population III simulations to run up to densities of 10-6g cm-3 for hundreds of years after the formation of sink particles. The total number of sinks formed increases with increasing sink particle creation density, without achieving numerical convergence. The total mass in sinks remains invariant to the maximum resolution and is safe to estimate using low resolution studies. This results in an IMF that shifts towards lower masses with increasing resolution. Greater numbers of sinks cause increased fragmentation-induced starvation of the most massive sink, yielding lower accretion rates, masses and ionising photons emitted per second. The lack of convergence up to densities 2 orders of magnitudes higher than all relevant chemical reactions suggests that the number of sinks will continue to grow with increasing resolution until H2 is fully dissociated and the collapse becomes almost adiabatic at 10-4g cm-3. These results imply that many Population III studies utilising sink particles have produced IMFs which have overestimated the masses of primordial stars, and underestimated the number of stars formed. In the highest resolution runs, sinks with masses capable of surviving until the present day had an ejection fraction of 0.21. △ Less

Submitted 20 December, 2021; originally announced December 2021.

Comments: Accepted in MNRAS

arXiv:2112.09737 [pdf, other]

Learning to Repair: Repairing model output errors after deployment using a dynamic memory of feedback

Authors: Niket Tandon, Aman Madaan, Peter Clark, Yiming Yang

Abstract: Large language models (LMs), while powerful, are not immune to mistakes, but can be difficult to retrain. Our goal is for an LM to continue to improve after deployment, without retraining, using feedback from the user. Our approach pairs an LM with (i) a growing memory of cases where the user identified an output error and provided general feedback on how to correct it (ii) a corrector model, trai… ▽ More Large language models (LMs), while powerful, are not immune to mistakes, but can be difficult to retrain. Our goal is for an LM to continue to improve after deployment, without retraining, using feedback from the user. Our approach pairs an LM with (i) a growing memory of cases where the user identified an output error and provided general feedback on how to correct it (ii) a corrector model, trained to translate this general feedback into specific edits to repair the model output. Given a new, unseen input, our model can then use feedback from similar, past cases to repair output errors that may occur. We instantiate our approach using an existing, fixed model for script generation, that takes a goal (e.g., "bake a cake") and generates a partially ordered sequence of actions to achieve that goal, sometimes containing errors. Our memory-enhanced system, FBNet, learns to apply user feedback to repair such errors (up to 30 points improvement), while making a start at avoiding similar past mistakes on new, unseen examples (up to 7 points improvement in a controlled setting). This is a first step towards strengthening deployed models, potentially broadening their utility. Our code and data is available at https://github.com/allenai/interscript/. △ Less

Submitted 9 May, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: NAACL 2022 (Findings)

arXiv:2112.08656 [pdf, other]

DREAM: Improving Situational QA by First Elaborating the Situation

Authors: Yuling Gu, Bhavana Dalvi Mishra, Peter Clark

Abstract: When people answer questions about a specific situation, e.g., "I cheated on my mid-term exam last week. Was that wrong?", cognitive science suggests that they form a mental picture of that situation before answering. While we do not know how language models (LMs) answer such questions, we conjecture that they may answer more accurately if they are also provided with additional details about the q… ▽ More When people answer questions about a specific situation, e.g., "I cheated on my mid-term exam last week. Was that wrong?", cognitive science suggests that they form a mental picture of that situation before answering. While we do not know how language models (LMs) answer such questions, we conjecture that they may answer more accurately if they are also provided with additional details about the question situation, elaborating the "scene". To test this conjecture, we train a new model, DREAM, to answer questions that elaborate the scenes that situated questions are about, and then provide those elaborations as additional context to a question-answering (QA) model. We find that DREAM is able to create better scene elaborations (more accurate, useful, and consistent) than a representative state-of-the-art, zero-shot model (Macaw). We also find that using the scene elaborations as additional context improves the answer accuracy of a downstream QA system, including beyond that obtainable by simply further finetuning the QA system on DREAM's training data. These results suggest that adding focused elaborations about a situation can improve a system's reasoning about it, and may serve as an effective way of injecting new scenario based knowledge into QA models. Finally, our approach is dataset-neutral; we observe improved QA performance across different models, with even bigger gains on models with fewer parameters. We make our dataset and model publicly available at https://github.com/allenai/dream. △ Less

Submitted 5 May, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: to be published in NAACL 2022

arXiv:2112.07867 [pdf, other]

Interscript: A dataset for interactive learning of scripts through error feedback

Authors: Niket Tandon, Aman Madaan, Peter Clark, Keisuke Sakaguchi, Yiming Yang

Abstract: How can an end-user provide feedback if a deployed structured prediction model generates inconsistent output, ignoring the structural complexity of human language? This is an emerging topic with recent progress in synthetic or constrained settings, and the next big leap would require testing and tuning models in real-world settings. We present a new dataset, Interscript, containing user feedback o… ▽ More How can an end-user provide feedback if a deployed structured prediction model generates inconsistent output, ignoring the structural complexity of human language? This is an emerging topic with recent progress in synthetic or constrained settings, and the next big leap would require testing and tuning models in real-world settings. We present a new dataset, Interscript, containing user feedback on a deployed model that generates complex everyday tasks. Interscript contains 8,466 data points -- the input is a possibly erroneous script and a user feedback, and the output is a modified script. We posit two use-cases of \ours that might significantly advance the state-of-the-art in interactive learning. The dataset is available at: https://github.com/allenai/interscript. △ Less

Submitted 15 December, 2021; v1 submitted 14 December, 2021; originally announced December 2021.

Comments: AAAI'22-Workshop on Interactive Machine Learning

arXiv:2112.05543 [pdf, other]

doi 10.1093/mnras/stad202

On the density regime probed by HCN emission

Authors: Gerwyn H. Jones, Paul C. Clark, Simon C. O. Glover, Alvaro Hacar

Abstract: HCN J$\, =\,$1$\, -\,$0 emission is commonly used as a dense gas tracer, thought to mainly arise from gas with densities $\mathrm{\sim 10^4\ -\ 10^5\ cm^{-3}}$. This has made it a popular tracer in star formation studies. However, there is increasing evidence from observational surveys of `resolved' molecular clouds that HCN can trace more diffuse gas. We investigate the relationship between gas d… ▽ More HCN J$\, =\,$1$\, -\,$0 emission is commonly used as a dense gas tracer, thought to mainly arise from gas with densities $\mathrm{\sim 10^4\ -\ 10^5\ cm^{-3}}$. This has made it a popular tracer in star formation studies. However, there is increasing evidence from observational surveys of `resolved' molecular clouds that HCN can trace more diffuse gas. We investigate the relationship between gas density and HCN emission through post-processing of high resolution magnetohydrodynamical simulations of cloud-cloud collisions. We find that HCN emission traces gas with a mean volumetric density of $\mathrm{\sim 3 \times 10^3\ cm^{-3}}$ and a median visual extinction of $\mathrm{\sim 5\ mag}$. We therefore predict a characteristic density that is an order of magnitude less than the "standard" characteristic density of $\mathrm{n \sim 3 \times 10^4\ cm^{-3}}$. Indeed, we find in some cases that there is clear HCN emission from the cloud even though there is no gas denser than this standard critical density. We derive luminosity-to-mass conversion factors for the amount of gas at $A_{\rm V} > 8$ or at densities $n > 2.85 \times 10^{3} \: {\rm cm^{-3}}$ or $n > 3 \times 10^{4} \: {\rm cm^{-3}}$, finding values of $α_{\rm HCN} = 6.79, 8.62$ and $27.98 \: {\rm M_{\odot}} ({\rm K \, km \, s^{-1} \, pc^{2}})$, respectively. In some cases, the luminosity to mass conversion factor predicted mass in regions where in actuality there contains no mass. △ Less

Submitted 18 January, 2023; v1 submitted 10 December, 2021; originally announced December 2021.

arXiv:2111.11805 [pdf, ps, other]

doi 10.1051/0004-6361/202140902

The Spatial Evolution of Young Massive Clusters III. Effect of the Gaia Filter on 2D Spatial Distribution Studies

Authors: Anne S. M. Buckner, Zeinab Khorrami, Marta González, Stuart L. Lumsden, Paul Clark, Estelle Moraux

Abstract: [Context.] Gaia is limited in the optical down to G~21 mag so it is essential to understand the biases introduced by a magnitude-limited sample on spatial distribution studies. [Aims.] We ascertain how sample incompleteness in Gaia observations of young clusters affects the local spatial analysis tool INDICATE and subsequently the perceived spatial properties of these clusters. [Methods.] We creat… ▽ More [Context.] Gaia is limited in the optical down to G~21 mag so it is essential to understand the biases introduced by a magnitude-limited sample on spatial distribution studies. [Aims.] We ascertain how sample incompleteness in Gaia observations of young clusters affects the local spatial analysis tool INDICATE and subsequently the perceived spatial properties of these clusters. [Methods.] We created a mock Gaia cluster catalogue from a synthetic dataset using the observation generating tool MYOSOTIS. The effect of cluster distance, uniform and variable extinction, binary fraction, population masking by the point spread function wings of high-mass members, and contrast sensitivity limits on the trends identified by INDICATE are explored. A comparison of the typical index values derived by INDICATE for members of the synthetic dataset and their corresponding mock Gaia catalogue observations is made to identify any significant changes. [Results.] We typically find only small variations in the pre- and post-observation index values of cluster populations, which can increase as a function of incompleteness percentage and binarity. No significant strengthening, or false signatures, of stellar concentrations are found but real signatures may be diluted. Conclusions drawn about the spatial behaviour of Gaia-observed cluster populations that are, and are not, associated with their natal nebulosity are reliable for most clusters but the perceived behaviours of individual members can change so INDICATE should be used as a measure of spatial behaviours between members as a function of their intrinsic properties (e.g. mass, age, object type), rather than to draw conclusions about any specific observed member. [Conclusions.] INDICATE is a robust spatial analysis tool to reliably study Gaia-observed young cluster populations within 1 kpc, up to a sample incompleteness of 83.3% and binarity of 50%. △ Less

Submitted 15 December, 2021; v1 submitted 23 November, 2021; originally announced November 2021.

Comments: Accepted for publication in A&A. 20 pages, 4 figures, 2 manuscript tables, appendix with 9 reader reference tables

Journal ref: A&A 659, A72 (2022)

arXiv:2111.10356 [pdf, ps, other]

Investigation of generalised Fredholm equations and their direct solution subject to constraints

Authors: Peter Clark, Alastair Wood, Peter Olley

Abstract: This paper explores the solution of Fredholm-like equations with infinite dimensional solution spaces. We set out to find a method for determining a particular solution to a Fredholm-like equation subject to a given constraint. The relevance and application comes via a connection to certain dynamics of gas-like systems. This paper explores the solution of Fredholm-like equations with infinite dimensional solution spaces. We set out to find a method for determining a particular solution to a Fredholm-like equation subject to a given constraint. The relevance and application comes via a connection to certain dynamics of gas-like systems. △ Less

Submitted 19 November, 2021; originally announced November 2021.

arXiv:2110.14207 [pdf, other]

How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI

Authors: Ashwin Kalyan, Abhinav Kumar, Arjun Chandrasekaran, Ashish Sabharwal, Peter Clark

Abstract: Many real-world problems require the combined application of multiple reasoning abilities employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because t… ▽ More Many real-world problems require the combined application of multiple reasoning abilities employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible. For example, "How much would the sea level rise if all ice in the world melted?" FPs are commonly used in quizzes and interviews to bring out and evaluate the creative reasoning abilities of humans. To do the same for AI systems, we present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge. In addition to question answer pairs, the datasets contain detailed solutions in the form of an executable program and supporting facts, hel** in supervision and evaluation of intermediate steps. We demonstrate that even extensively fine-tuned large scale language models perform poorly on these datasets, on average making estimates that are off by two orders of magnitude. Our contribution is thus the crystallization of several unsolved AI problems into a single, new challenge that we hope will spur further advances in building systems that can reason. △ Less

Submitted 20 December, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: Accepted for publication at EMNLP 2021, 11 pages, 5 tables, 4 figures

arXiv:2110.12349 [pdf, other]

Think about it! Improving defeasible reasoning by first modeling the question scenario

Authors: Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Peter Clark, Yiming Yang, Eduard Hovy

Abstract: Defeasible reasoning is the mode of reasoning where conclusions can be overturned by taking into account new evidence. Existing cognitive science literature on defeasible reasoning suggests that a person forms a mental model of the problem scenario before answering questions. Our research goal asks whether neural models can similarly benefit from envisioning the question scenario before answering… ▽ More Defeasible reasoning is the mode of reasoning where conclusions can be overturned by taking into account new evidence. Existing cognitive science literature on defeasible reasoning suggests that a person forms a mental model of the problem scenario before answering questions. Our research goal asks whether neural models can similarly benefit from envisioning the question scenario before answering a defeasible query. Our approach is, given a question, to have a model first create a graph of relevant influences, and then leverage that graph as an additional input when answering the question. Our system, CURIOUS, achieves a new state-of-the-art on three different defeasible reasoning datasets. This result is significant as it illustrates that performance can be improved by guiding a system to "think about" a question and explicitly model the scenario, rather than answering reflexively. Code, data, and pre-trained models are located at https://github.com/madaan/thinkaboutit. △ Less

Submitted 24 October, 2021; originally announced October 2021.

Comments: EMNLP 2021

arXiv:2110.01398 [pdf]

Enabling Blockchain Scalability and Interoperability with Mobile Computing through LayerOne.X

Authors: Kevin Coutinho, Ponnie Clark, Ferdinand Azis, Norman Lip, Josh Hunt

Abstract: Interoperability and scalability are currently the bottlenecks preventing mass adoption of blockchain technology. Development of an interoperable and scalable network that promotes a truly decentralised, permissionless and secure blockchain as well as one that enables micro validation is the main goal of this project. Layer-One.X, a truly decentralised ledger which utilises para-sharding, Directed… ▽ More Interoperability and scalability are currently the bottlenecks preventing mass adoption of blockchain technology. Development of an interoperable and scalable network that promotes a truly decentralised, permissionless and secure blockchain as well as one that enables micro validation is the main goal of this project. Layer-One.X, a truly decentralised ledger which utilises para-sharding, Directed Acyclic Graphs, Proof of Participation consensus mechanism, mobile computing, flash contracts and nucleus scripting is introduced in this paper. The conceptual framework including tokenomics is also explained along with a number of use cases. The framework facilitates the growing need of transaction per second enabling micro based payments and value transfer through tokenisation. △ Less

Submitted 30 September, 2021; originally announced October 2021.

Comments: 40 pages

arXiv:2109.14723 [pdf, other]

BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief

Authors: Nora Kassner, Oyvind Tafjord, Hinrich Schütze, Peter Clark

Abstract: Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually "believes" about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our appro… ▽ More Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually "believes" about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our approach is to embed a PTLM in a broader system that also includes an evolving, symbolic memory of beliefs -- a BeliefBank -- that records but then may modify the raw PTLM answers. We describe two mechanisms to improve belief consistency in the overall system. First, a reasoning component -- a weighted MaxSAT solver -- revises beliefs that significantly clash with others. Second, a feedback component issues future queries to the PTLM using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms result in more consistent beliefs in the overall system, improving both the accuracy and consistency of its answers over time. This is significant as it is a first step towards PTLM-based architectures with a systematic notion of belief, enabling them to construct a more coherent picture of the world, and improve over time without model retraining. △ Less

Submitted 29 September, 2021; originally announced September 2021.

Comments: EMNLP 2021 Camera Ready. arXiv admin note: substantial text overlap with arXiv:2104.08401

arXiv:2109.06195 [pdf, other]

doi 10.1093/mnras/stac3751

Towards the impact of GMC collisions on the star formation rate

Authors: Glen H. Hunter, Paul C. Clark, Simon C. O. Glover, Ralf S. Klessen

Abstract: Collisions between giant molecular clouds (GMCs) are one of the pathways for massive star formation, due to the high densities created. However the enhancement of the star formation rate (SFR) is not well constrained. In this study we perform a parameter study of cloud-cloud collisions, and investigate how the resulting SFR depends on the details of set-up. Our parameter study explores variations… ▽ More Collisions between giant molecular clouds (GMCs) are one of the pathways for massive star formation, due to the high densities created. However the enhancement of the star formation rate (SFR) is not well constrained. In this study we perform a parameter study of cloud-cloud collisions, and investigate how the resulting SFR depends on the details of set-up. Our parameter study explores variations in: collision speed; magnetic field inclination (with respect to the collisional axis); and resolution, as defined by the number of cells per Jeans length. In all our collision simulations we find a factor of 2-3 increase in the SFR compared to our no collision simulation, with star formation beginning sooner with a) high collisional velocities, b) parallel orientation between the magnetic field and collision axis, c) and lower resolution. The mean virial parameter of high density (and thus possible star-forming) gas increases with collisional velocity, but has little variation with magnetic field inclination. The alignment of the velocity and magnetic field remains uniform in low density environments but becomes more perpendicular with increasing density, indicating the compression of the magnetic field by collapsing gas. Comparing the trends in the SFR with other GMC collision studies, we find good agreement with studies that account for the gravitational boundedness of the gas in their star formation algorithm, but not with those that simply form stars above a prescribed density threshold. This suggests that the latter approach should be used with caution when modelling star formation on resolved cloud scales. △ Less

Submitted 11 January, 2023; v1 submitted 13 September, 2021; originally announced September 2021.

Comments: 20 pages, 15 figures, 3 tables; accepted for publication in MNRAS

arXiv:2109.02593 [pdf, other]

General-Purpose Question-Answering with Macaw

Authors: Oyvind Tafjord, Peter Clark

Abstract: Despite the successes of pretrained language models, there are still few high-quality, general-purpose QA systems that are freely available. In response, we present Macaw, a versatile, generative question-answering (QA) system that we are making available to the community. Macaw is built on UnifiedQA, itself built on T5, and exhibits strong performance, zero-shot, on a wide variety of topics, incl… ▽ More Despite the successes of pretrained language models, there are still few high-quality, general-purpose QA systems that are freely available. In response, we present Macaw, a versatile, generative question-answering (QA) system that we are making available to the community. Macaw is built on UnifiedQA, itself built on T5, and exhibits strong performance, zero-shot, on a wide variety of topics, including outperforming GPT-3 by over 10% (absolute) on Challenge300, a suite of 300 challenge questions, despite being an order of magnitude smaller (11 billion vs. 175 billion parameters). In addition, Macaw allows different permutations ("angles") of its inputs and outputs to be used, for example Macaw can take a question and produce an answer; or take an answer and produce a question; or take an answer and question, and produce multiple-choice options. We describe the system, and illustrate a variety of question types where it produces surprisingly good answers, well outside the training setup. We also identify question classes where it still appears to struggle, offering insights into the limitations of pretrained language models. Macaw is freely available, and we hope that it proves useful to the community. Macaw is available at https://github.com/allenai/macaw △ Less

Submitted 6 September, 2021; originally announced September 2021.

arXiv:2107.09034 [pdf, other]

doi 10.1093/mnras/stab2038

Probing the Progenitors of Type Ia Supernovae using Circumstellar Material Interaction Signatures

Authors: Peter Clark, Kate Maguire, Mattia Bulla, Lluís Galbany, Mark Sullivan, Joseph P. Anderson, Stephen J. Smartt

Abstract: This work aims to study different probes of Type Ia supernova progenitors that have been suggested to be linked to the presence of circumstellar material (CSM). In particular, we have investigated, for the first time, the link between narrow blueshifted NaID absorption profiles and the presence and strength of the broad high-velocity CaII near infrared triplet absorption features seen in Type Ia s… ▽ More This work aims to study different probes of Type Ia supernova progenitors that have been suggested to be linked to the presence of circumstellar material (CSM). In particular, we have investigated, for the first time, the link between narrow blueshifted NaID absorption profiles and the presence and strength of the broad high-velocity CaII near infrared triplet absorption features seen in Type Ia supernovae around maximum light. With the probes exploring different distances from the supernova; NaID > 10$^{17}$cm, high-velocity CaII features < 10$^{15}$cm. For this, we have used a new intermediate-resolution X-shooter spectral sample of 15 Type Ia supernovae. We do not identify a link between these two probes, implying either that, one (or both) is not physically related to the presence of CSM or that the occurrence of CSM at the distance explored by one probe is not linked to its presence at the distance probed by the other. However, the previously identified statistical excess in the presence of blueshifted (over redshifted) NaID absorption is confirmed in this sample at high significance and is found to be stronger in Type Ia supernovae hosted by late-type galaxies. This excess is difficult to explain as being from an interstellar-medium origin as has been suggested by some recent modelling, as such an origin is not expected to show a bias for blueshifted absorption. However, a circumstellar origin for these features also appears unsatisfactory based on our new results given the lack of link between the two probes of CSM investigated. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: This is a pre-copyedited, author-produced PDF of an article accepted for publication in Monthly Notices of the Royal Astronomical Society following peer review. 25 pages, 16 Figures

arXiv:2106.09386 [pdf, other]

Two-fluid single-column modelling of Rayleigh-Bénard convection as a step towards multi-fluid modelling of atmospheric convection

Authors: Daniel Shipley, Hilary Weller, Peter Clark, Will McIntyre

Abstract: Multi-fluid models have recently been proposed as an approach to improving the representation of convection in weather and climate models. This is an attractive framework as it is fundamentally dynamical, removing some of the assumptions of mass-flux convection schemes which are invalid at current model resolutions. However, it is still not understood how best to close the multi-fluid equations fo… ▽ More Multi-fluid models have recently been proposed as an approach to improving the representation of convection in weather and climate models. This is an attractive framework as it is fundamentally dynamical, removing some of the assumptions of mass-flux convection schemes which are invalid at current model resolutions. However, it is still not understood how best to close the multi-fluid equations for atmospheric convection. In this paper we develop a simple two-fluid, single-column model with one rising and one falling fluid. No further modelling of sub-filter variability is included. We then apply this model to Rayleigh-Bénard convection, showing that, with minimal closures, the correct scaling of the heat flux (Nu) is predicted over six orders of magnitude of buoyancy forcing (Ra). This suggests that even a very simple two-fluid model can accurately capture the dominant coherent overturning structures of convection. △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: 34 pages, 10 figures

arXiv:2105.07079 [pdf, ps, other]

Dynamic network analysis improves protein 3D structural classification

Authors: Khalique Newaz, Jacob Piland, Patricia L. Clark, Scott J. Emrich, Jun Li, Tijana Milenkovic

Abstract: Protein structural classification (PSC) is a supervised problem of assigning proteins into pre-defined structural (e.g., CATH or SCOPe) classes based on the proteins' sequence or 3D structural features. We recently proposed PSC approaches that model protein 3D structures as protein structure networks (PSNs) and analyze PSN-based protein features, which performed better than or comparable to state-… ▽ More Protein structural classification (PSC) is a supervised problem of assigning proteins into pre-defined structural (e.g., CATH or SCOPe) classes based on the proteins' sequence or 3D structural features. We recently proposed PSC approaches that model protein 3D structures as protein structure networks (PSNs) and analyze PSN-based protein features, which performed better than or comparable to state-of-the-art sequence or other 3D structure-based approaches in the task of PSC. However, existing PSN-based PSC approaches model the whole 3D structure of a protein as a static PSN. Because folding of a protein is a dynamic process, where some parts of a protein fold before others, modeling the 3D structure of a protein as a dynamic PSN might further help improve the existing PSC performance. Here, we propose for the first time a way to model 3D structures of proteins as dynamic PSNs, with the hypothesis that this will improve upon the current state-of-the-art PSC approaches that are based on static PSNs (and thus upon the existing state-of-the-art sequence and other 3D structural approaches). Indeed, we confirm this on 71 datasets spanning ~44,000 protein domains from CATH and SCOPe △ Less

Submitted 14 May, 2021; originally announced May 2021.

arXiv:2104.12819 [pdf, other]

doi 10.1051/0004-6361/202140668

Extreme adaptive optics astrometry of R136. Searching for high proper motion stars

Authors: Zeinab Khorrami, M. Langlois, F. Vakili, P. C. Clark, A. S. M. Buckner, M. Gonzalez, P. Crowther, R. Wunsch, J. Palous, A. Boccaletti, S. Lumsden, E. Moraux

Abstract: We compared high-contrast near-infrared images of the core of R136 taken by VLT/SPHERE, in two epochs separated by 3.06 years. For the first time we monitored the dynamics of the detected sources in the core of R136 from a ground-based telescope with adaptive optics. The aim of these observations was to search for High prOper Motion cAndidates (HOMAs) in the central region of R136 (r<6") where it… ▽ More We compared high-contrast near-infrared images of the core of R136 taken by VLT/SPHERE, in two epochs separated by 3.06 years. For the first time we monitored the dynamics of the detected sources in the core of R136 from a ground-based telescope with adaptive optics. The aim of these observations was to search for High prOper Motion cAndidates (HOMAs) in the central region of R136 (r<6") where it has been challenging for other instruments. Two bright sources (K<15mag and V<16mag) are located near R136a1 and R136c (massive WR stars) and have been identified as potential HOMAs. These sources have significantly shifted in the images with respect to the mean shift of all reliable detected sources and their neighbours, and six times their own astrometric errors. We calculate their proper motions to be 1.36\pm0.22 mas/yr (321\pm52 km/s) and 1.15\pm0.11 mas/yr (273\pm26 km/s). We discuss different possible scenarios to explain the magnitude of such extreme proper motions, and argue for the necessity to conduct future observations to conclude on the nature of HOMAs in the core of R136. △ Less

Submitted 26 April, 2021; originally announced April 2021.

Comments: Accepted to be published in A&A/Letter. catalogue of the reliable-consistent sources are available online via CDS

Journal ref: A&A 649, L8 (2021)

arXiv:2104.08765 [pdf, other]

Improving Neural Model Performance through Natural Language Feedback on Their Explanations

Authors: Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Yiming Yang, Peter Clark, Keisuke Sakaguchi, Ed Hovy

Abstract: A class of explainable NLP models for reasoning tasks support their decisions by generating free-form or structured explanations, but what happens when these supporting structures contain errors? Our goal is to allow users to interactively correct explanation structures through natural language feedback. We introduce MERCURIE - an interactive system that refines its explanations for a given reason… ▽ More A class of explainable NLP models for reasoning tasks support their decisions by generating free-form or structured explanations, but what happens when these supporting structures contain errors? Our goal is to allow users to interactively correct explanation structures through natural language feedback. We introduce MERCURIE - an interactive system that refines its explanations for a given reasoning task by getting human feedback in natural language. Our approach generates graphs that have 40% fewer inconsistencies as compared with the off-the-shelf system. Further, simply appending the corrected explanation structures to the output leads to a gain of 1.2 points on accuracy on defeasible reasoning across all three domains. We release a dataset of over 450k graphs for defeasible reasoning generated by our system at https://tinyurl.com/mercurie . △ Less

Submitted 18 April, 2021; originally announced April 2021.

arXiv:2104.08661 [pdf, other]

Explaining Answers with Entailment Trees

Authors: Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, Peter Clark

Abstract: Our goal, in the context of open-domain textual question-answering (QA), is to explain answers by showing the line of reasoning from what is known to the answer, rather than simply showing a fragment of textual evidence (a "rationale'"). If this could be done, new opportunities for understanding and debugging the system's reasoning become possible. Our approach is to generate explanations in the f… ▽ More Our goal, in the context of open-domain textual question-answering (QA), is to explain answers by showing the line of reasoning from what is known to the answer, rather than simply showing a fragment of textual evidence (a "rationale'"). If this could be done, new opportunities for understanding and debugging the system's reasoning become possible. Our approach is to generate explanations in the form of entailment trees, namely a tree of multipremise entailment steps from facts that are known, through intermediate conclusions, to the hypothesis of interest (namely the question + answer). To train a model with this skill, we created ENTAILMENTBANK, the first dataset to contain multistep entailment trees. Given a hypothesis (question + answer), we define three increasingly difficult explanation tasks: generate a valid entailment tree given (a) all relevant sentences (b) all relevant and some irrelevant sentences, or (c) a corpus. We show that a strong language model can partially solve these tasks, in particular when the relevant sentences are included in the input (e.g., 35% of trees for (a) are perfect), and with indications of generalization to other domains. This work is significant as it provides a new type of dataset (multistep entailments) and baselines, offering a new avenue for the community to generate richer, more systematic explanations. △ Less

Submitted 28 May, 2022; v1 submitted 17 April, 2021; originally announced April 2021.

Comments: published in EMNLP 2021

arXiv:2104.08401 [pdf, ps, other]

Enriching a Model's Notion of Belief using a Persistent Memory

Authors: Nora Kassner, Oyvind Tafjord, Hinrich Schutze, Peter Clark

Abstract: Although pretrained language models (PTLMs) have been shown to contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after using specialized training techniques to reduce inconsistency. As a result, it can be hard to identify what the model actually "believes" about the world. Our goal is to reduce this problem, so systems are mo… ▽ More Although pretrained language models (PTLMs) have been shown to contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after using specialized training techniques to reduce inconsistency. As a result, it can be hard to identify what the model actually "believes" about the world. Our goal is to reduce this problem, so systems are more globally consistent and accurate in their answers. Our approach is to add a memory component -- a BeliefBank -- that records a model's answers, and two mechanisms that use it to improve consistency among beliefs. First, a reasoning component -- a weighted SAT solver -- improves consistency by flip** answers that significantly clash with others. Second, a feedback component re-queries the model but using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms improve both accuracy and consistency. This is significant as it is a first step towards endowing models with an evolving memory, allowing them to construct a more coherent picture of the world. △ Less

Submitted 7 October, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: This is an old and now obsolete draft. See arXiv:2109.14723 ("BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief") for the final paper

arXiv:2104.08251 [pdf, other]

proScript: Partially Ordered Scripts Generation via Pre-trained Language Models

Authors: Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras, Niket Tandon, Peter Clark, Ye** Choi

Abstract: Scripts - standardized event sequences describing typical everyday activities - have been shown to help understand narratives by providing expectations, resolving ambiguity, and filling in unstated information. However, to date they have proved hard to author or extract from text. In this work, we demonstrate for the first time that pre-trained neural language models (LMs) can be be finetuned to g… ▽ More Scripts - standardized event sequences describing typical everyday activities - have been shown to help understand narratives by providing expectations, resolving ambiguity, and filling in unstated information. However, to date they have proved hard to author or extract from text. In this work, we demonstrate for the first time that pre-trained neural language models (LMs) can be be finetuned to generate high-quality scripts, at varying levels of granularity, for a wide range of everyday scenarios (e.g., bake a cake). To do this, we collected a large (6.4k), crowdsourced partially ordered scripts (named proScript), which is substantially larger than prior datasets, and developed models that generate scripts with combining language generation and structure prediction. We define two complementary tasks: (i) edge prediction: given a scenario and unordered events, organize the events into a valid (possibly partial-order) script, and (ii) script generation: given only a scenario, generate events and organize them into a (possibly partial-order) script. Our experiments show that our models perform well (e.g., F1=75.7 in task (i)), illustrating a new approach to overcoming previous barriers to script collection. We also show that there is still significant room for improvement toward human level performance. Together, our tasks, dataset, and models offer a new research direction for learning script knowledge. △ Less

Submitted 16 April, 2021; originally announced April 2021.

arXiv:2104.00814 [pdf, other]

CURIE: An Iterative Querying Approach for Reasoning About Situations

Authors: Dheeraj Rajagopal, Aman Madaan, Niket Tandon, Yiming Yang, Shrimai Prabhumoye, Abhilasha Ravichander, Peter Clark, Eduard Hovy

Abstract: Recently, models have been shown to predict the effects of unexpected situations, e.g., would cloudy skies help or hinder plant growth? Given a context, the goal of such situational reasoning is to elicit the consequences of a new situation (st) that arises in that context. We propose a method to iteratively build a graph of relevant consequences explicitly in a structured situational graph (st-gr… ▽ More Recently, models have been shown to predict the effects of unexpected situations, e.g., would cloudy skies help or hinder plant growth? Given a context, the goal of such situational reasoning is to elicit the consequences of a new situation (st) that arises in that context. We propose a method to iteratively build a graph of relevant consequences explicitly in a structured situational graph (st-graph) using natural language queries over a finetuned language model (M). Across multiple domains, CURIE generates st-graphs that humans find relevant and meaningful in eliciting the consequences of a new situation. We show that st-graphs generated by CURIE improve a situational reasoning end task (WIQA-QA) by 3 points on accuracy by simply augmenting their input with our generated situational graphs, especially for a hard subset that requires background knowledge and multi-hop reasoning. △ Less

Submitted 5 April, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

Comments: This paper builds upon EIGEN (arXiv:2010.11764) and proposes a general framework for situational reasoning

arXiv:2102.05972 [pdf, other]

doi 10.1093/mnras/stab388

High contrast and resolution near infrared photometry of the core of R136

Authors: Zeinab Khorrami, Maud Langlois, Paul C. Clark, Farrokh Vakili, Anne S. M. Buckner, Marta Gonzalez, Paul Crowther, Richard Wunsch, Jan Palous, Stuart Lumsden, Estelle Moraux

Abstract: We present the sharpest and deepest near infrared photometric analysis of the core of R136, a newly formed massive star cluster at the centre of the 30 Doradus star forming region in the Large Magellanic Cloud. We used the extreme adaptive optics of the SPHERE focal instrument implemented on the ESO Very Large Telescope and operated in its IRDIS imaging mode, for the second time with longer exposu… ▽ More We present the sharpest and deepest near infrared photometric analysis of the core of R136, a newly formed massive star cluster at the centre of the 30 Doradus star forming region in the Large Magellanic Cloud. We used the extreme adaptive optics of the SPHERE focal instrument implemented on the ESO Very Large Telescope and operated in its IRDIS imaging mode, for the second time with longer exposure time in the H- and K filters. Our aim was to (i) increase the number of resolved sources in the core of R136, and (ii) to compare with the first epoch to classify the properties of the detected common sources between the two epochs. Within the field of view (FOV) of 10.8"x12.1" (2.7pc x3.0pc), we detected 1499 sources in both H and K filters, for which 76% of these sources have visual companions closer than 0.2". The larger number of detected sources, enabled us to better sample the mass function (MF). The MF slopes are estimated at ages of 1, 1.5 and 2 Myr, at different radii, and for different mass ranges. The MF slopes for the mass range of 10-300 solar-mass are about 0.3 dex steeper than the mass range of 3-300 solar-mass, for the whole FOV and different radii. Comparing the JHK colours of 790 sources common in between the two epochs, 67% of detected sources in the outer region (r >3") are not consistent with evolutionary models at 1-2 Myr and with extinctions similar to the average cluster value, suggesting an origin from ongoing star formation within 30 Doradus, unrelated to R136. △ Less

Submitted 11 February, 2021; originally announced February 2021.

Comments: 24 pages, 20 figures, 6 tables. Accepted for publication in MNRAS

arXiv:2102.03315 [pdf, other]

Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge

Authors: Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Peter Clark

Abstract: We present the ARC-DA dataset, a direct-answer ("open response", "freeform") version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset. While ARC has been influential in the community, its multiple-choice format is unrepresentative of real-world questions, and multiple choice formats can be particularly susceptible to artifacts. The ARC-DA dataset addresses these concerns by converting… ▽ More We present the ARC-DA dataset, a direct-answer ("open response", "freeform") version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset. While ARC has been influential in the community, its multiple-choice format is unrepresentative of real-world questions, and multiple choice formats can be particularly susceptible to artifacts. The ARC-DA dataset addresses these concerns by converting questions to direct-answer format using a combination of crowdsourcing and expert review. The resulting dataset contains 2985 questions with a total of 8436 valid answers (questions typically have more than one valid answer). ARC-DA is one of the first DA datasets of natural questions that often require reasoning, and where appropriate question decompositions are not evident from the questions themselves. We describe the conversion approach taken, appropriate evaluation metrics, and several strong models. Although high, the best scores (81% GENIE, 61.4% F1, 63.2% ROUGE-L) still leave considerable room for improvement. In addition, the dataset provides a natural setting for new research on explanation, as many questions require reasoning to construct answers. We hope the dataset spurs further advances in complex question-answering by the community. ARC-DA is available at https://allenai.org/data/arc-da △ Less

Submitted 5 February, 2021; originally announced February 2021.

arXiv:2012.13048 [pdf, other]

ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language

Authors: Oyvind Tafjord, Bhavana Dalvi Mishra, Peter Clark

Abstract: Transformers have been shown to emulate logical deduction over natural language theories (logical rules expressed in natural language), reliably assigning true/false labels to candidate implications. However, their ability to generate implications of a theory has not yet been demonstrated, and methods for reconstructing proofs of answers are imperfect. In this work we show that a generative model,… ▽ More Transformers have been shown to emulate logical deduction over natural language theories (logical rules expressed in natural language), reliably assigning true/false labels to candidate implications. However, their ability to generate implications of a theory has not yet been demonstrated, and methods for reconstructing proofs of answers are imperfect. In this work we show that a generative model, called ProofWriter, can reliably generate both implications of a theory and the natural language proof(s) that support them. In particular, iterating a 1-step implication generator results in proofs that are highly reliable, and represent actual model decisions (rather than post-hoc rationalizations). On the RuleTaker dataset, the accuracy of ProofWriter's proofs exceed previous methods by +9% absolute, and in a way that generalizes to proof depths unseen in training and on out-of-domain problems. We also show that generative techniques can perform a type of abduction with high precision: Given a theory and an unprovable conclusion, identify a missing fact that allows the conclusion to be proved, along with a proof. These results significantly improve the viability of neural methods for systematically reasoning over natural language. △ Less

Submitted 3 June, 2021; v1 submitted 23 December, 2020; originally announced December 2020.

Comments: Findings of ACL 2021

arXiv:2012.05919 [pdf, other]

doi 10.1093/mnras/stab1683

Simulations of the star-forming molecular gas in an interacting M51-like galaxy: cloud population statistics

Authors: Robin G. Tress, Mattia C. Sormani, Rowan J. Smith, Simon C. O. Glover, Ralf S. Klessen, Mordecai-Mark Mac Low, Paul Clark, Ana Duarte-Cabral

Abstract: To investigate how molecular clouds react to different environmental conditions at a galactic scale, we present a catalogue of giant molecular clouds resolved down to masses of $\sim 10$~M$_{\odot}$ from a simulation of the entire disc of an interacting M51-like galaxy and a comparable isolated galaxy. Our model includes time-dependent gas chemistry, sink particles for star formation and supernova… ▽ More To investigate how molecular clouds react to different environmental conditions at a galactic scale, we present a catalogue of giant molecular clouds resolved down to masses of $\sim 10$~M$_{\odot}$ from a simulation of the entire disc of an interacting M51-like galaxy and a comparable isolated galaxy. Our model includes time-dependent gas chemistry, sink particles for star formation and supernova feedback, meaning we are not reliant on star formation recipes based on threshold densities and can follow the physics of the cold molecular phase. We extract giant molecular clouds at a given timestep of the simulations and analyse their properties. In the disc of our simulated galaxies, spiral arms seem to act merely as snowplows, gathering gas and clouds without dramatically affecting their properties. In the centre of the galaxy, on the other hand, environmental conditions lead to larger, more massive clouds. While the galaxy interaction has little effect on cloud masses and sizes, it does promote the formation of counter-rotating clouds. We find that the identified clouds seem to be largely gravitationally unbound at first glance, but a closer analysis of the hierarchical structure of the molecular interstellar medium shows that there is a large range of virial parameters with a smooth transition from unbound to mostly bound for the densest structures. The common observation that clouds appear to be virialised entities may therefore be due to CO bright emission highlighting a specific level in this hierarchical binding sequence. The small fraction of gravitationally bound structures found suggests that low galactic star formation efficiencies may be set by the process of cloud formation and initial collapse. △ Less

Submitted 28 June, 2021; v1 submitted 10 December, 2020; originally announced December 2020.

Comments: 22 pages, 26 figures, 2 tables. Properties of the clouds in the catalog are provided as a supplementary file

arXiv:2011.10574 [pdf, other]

doi 10.1051/0004-6361/202038123

S2D2: Small-scale Significant substructure DBSCAN Detection I. NESTs detection in 2D star-forming regions

Authors: Marta González, Isabelle Joncour, Anne S. M. Buckner, Zeinhab Khorrami, Estelle Moraux, Stuart L. Lumsden, Paul Clark, René D. Oudmaijer, José Manuel Blanco, Ignacio de la Calle, José María Herrera-Fernandez, Jesús J. Salgado, Luis Valero-Martín, Zoe Torres, Álvaro Hacar, Ana Ulla

Abstract: The spatial and dynamical structure of star-forming regions can help provide insights on stellar formation patterns. The amount of data from current and upcoming surveys calls for robust and objective procedures to detect structure, so the results can be statistically analysed and different regions compared. We provide the community with a tool able to detect the small scale significant structure,… ▽ More The spatial and dynamical structure of star-forming regions can help provide insights on stellar formation patterns. The amount of data from current and upcoming surveys calls for robust and objective procedures to detect structure, so the results can be statistically analysed and different regions compared. We provide the community with a tool able to detect the small scale significant structure, above random expectation, in star-forming regions, which could be the imprint of the stellar formation process. The tool makes use of the one point correlation function and of nearest neighbour statistics to determine the parameters for the DBSCAN algorithm. The procedure successfully detects significant small scale substructures in heterogeneous regions, fulfilling the goals it was designed for, and providing very reliable structures. The analysis of regions close to complete spatial randomness ($Q \in [0.7,0.87]$) shows that, even when some structure is present and recovered, it is hardly distinguishable from spurious detection in homogeneous regions due to projection effects. Interpretation should thus be done with care. For concentrated regions, we detect a main structure surrounded by smaller ones, corresponding to the core plus some Poisson fluctuations around it. We argue that these structures do not correspond to the small compact regions we are looking for. In some realistic cases, a more complete hierarchical, multi-scale analysis would be needed to capture the complexity of the region. We have developed implementations of our procedure, and a catalogue of the NESTs (Nested Elementary STructures) detected by it in four star-forming regions (Taurus, IC 348, Upper Scorpius, and Carina), which are publicly available to the community. Implementations of the 3D, and up to 6D versions of the procedure including proper movements are in progress, and will be provided as future work. △ Less

Submitted 20 November, 2020; originally announced November 2020.

Comments: 24 pages, 21 figures. Accepted for publication in A&A. Abstract abridged

Journal ref: A&A 647, A14 (2021)

arXiv:2011.08092 [pdf, other]

A Dataset for Tracking Entities in Open Domain Procedural Text

Authors: Niket Tandon, Keisuke Sakaguchi, Bhavana Dalvi Mishra, Dheeraj Rajagopal, Peter Clark, Michal Guerquin, Kyle Richardson, Eduard Hovy

Abstract: We present the first dataset for tracking state changes in procedural text from arbitrary domains by using an unrestricted (open) vocabulary. For example, in a text describing fog removal using potatoes, a car window may transition between being foggy, sticky,opaque, and clear. Previous formulations of this task provide the text and entities involved,and ask how those entities change for just a sm… ▽ More We present the first dataset for tracking state changes in procedural text from arbitrary domains by using an unrestricted (open) vocabulary. For example, in a text describing fog removal using potatoes, a car window may transition between being foggy, sticky,opaque, and clear. Previous formulations of this task provide the text and entities involved,and ask how those entities change for just a small, pre-defined set of attributes (e.g., location), limiting their fidelity. Our solution is a new task formulation where given just a procedural text as input, the task is to generate a set of state change tuples(entity, at-tribute, before-state, after-state)for each step,where the entity, attribute, and state values must be predicted from an open vocabulary. Using crowdsourcing, we create OPENPI1, a high-quality (91.5% coverage as judged by humans and completely vetted), and large-scale dataset comprising 29,928 state changes over 4,050 sentences from 810 procedural real-world paragraphs from WikiHow.com. A current state-of-the-art generation model on this task achieves 16.1% F1 based on BLEU metric, leaving enough room for novel model architectures. △ Less

Submitted 30 October, 2020; originally announced November 2020.

Comments: To appear in EMNLP 2020

arXiv:2011.02582 [pdf, other]

doi 10.1093/mnras/staa3470

The Cloud Factory II: Gravoturbulent Kinematics of Resolved Molecular Clouds in a Galactic Potential

Authors: Andres F. Izquierdo, Rowan J. Smith, Simon C. O. Glover, Ralf S. Klessen, Robin G. Tress, Mattia C. Sormani, Paul C. Clark, Ana Duarte-Cabral, Catherine Zucker

Abstract: We present a statistical analysis of the gravoturbulent velocity fluctuations in molecular cloud complexes extracted from our "Cloud Factory" galactic-scale ISM simulation suite. For this purpose, we produce non-LTE $^{12}$CO J=1-0 synthetic observations and apply the Principal Component Analysis (PCA) reduction technique on a representative sample of cloud complexes. The velocity fluctuations are… ▽ More We present a statistical analysis of the gravoturbulent velocity fluctuations in molecular cloud complexes extracted from our "Cloud Factory" galactic-scale ISM simulation suite. For this purpose, we produce non-LTE $^{12}$CO J=1-0 synthetic observations and apply the Principal Component Analysis (PCA) reduction technique on a representative sample of cloud complexes. The velocity fluctuations are self-consistently generated by different physical mechanisms at play in our simulations, which include galactic-scale forces, gas self-gravity, and supernova feedback. The statistical analysis suggests that, even though purely gravitational effects are necessary to reproduce standard observational laws, they are not sufficient in most cases. We show that the extra injection of energy from supernova explosions plays a key role in establishing the global turbulent field and the local dynamics and morphology of molecular clouds. Additionally, we characterise structure function scaling parameters as a result of cloud environmental conditions: some of the complexes are immersed in diffuse (inter-arm) or dense (spiral-arm) environments, and others are influenced by embedded or external supernovae. In quiescent regions, we obtain time-evolving trajectories of scaling parameters driven by gravitational collapse and supersonic turbulent flows. Our findings suggests that a PCA-based statistical study is a robust method to diagnose the physical mechanisms that drive the gravoturbulent properties of molecular clouds. Also, we present a new open source module, the PCAFACTORY, which smartly performs PCA to extract velocity structure functions from simulated or real data of the ISM in a user-friendly way. Software DOI: 10.5281/zenodo.3822718 △ Less

Submitted 4 November, 2020; originally announced November 2020.

Comments: 29 pages, 15 figures, 4 tables. Accepted for publication in MNRAS

arXiv:2010.15446 [pdf, other]

Progressive Voice Trigger Detection: Accuracy vs Latency

Authors: Siddharth Sigtia, John Bridle, Hywel Richards, Pascal Clark, Erik Marchi, Vineet Garg

Abstract: We present an architecture for voice trigger detection for virtual assistants. The main idea in this work is to exploit information in words that immediately follow the trigger phrase. We first demonstrate that by including more audio context after a detected trigger phrase, we can indeed get a more accurate decision. However, waiting to listen to more audio each time incurs a latency increase. Pr… ▽ More We present an architecture for voice trigger detection for virtual assistants. The main idea in this work is to exploit information in words that immediately follow the trigger phrase. We first demonstrate that by including more audio context after a detected trigger phrase, we can indeed get a more accurate decision. However, waiting to listen to more audio each time incurs a latency increase. Progressive Voice Trigger Detection allows us to trade-off latency and accuracy by accepting clear trigger candidates quickly, but waiting for more context to decide whether to accept more marginal examples. Using a two-stage architecture, we show that by delaying the decision for just 3% of detected true triggers in the test set, we are able to obtain a relative improvement of 66% in false rejection rate, while incurring only a negligible increase in latency. △ Less

Submitted 2 March, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

Comments: Camera Ready Version: ICASSP 2021

arXiv:2010.05145 [pdf, other]

doi 10.3847/1538-3881/abbffb

Optical night sky brightness measurements from the stratosphere

Authors: Ajay Gill, Steven J. Benton, Anthony M. Brown, Paul Clark, Christopher J. Damaren, Tim Eifler, Aurelien A. Fraisse, Mathew N. Galloway, John W. Hartley, Bradley Holder, Eric M. Huff, Mathilde Jauzac, William C. Jones, David Lagattuta, Jason S. -Y Leung, Lun Li, Thuy Vy T. Luu, Richard J. Massey, Jacqueline McCleary, James Mullaney, Johanna M. Nagy, C. Barth Netterfield, Susan Redmond, Jason D. Rhodes, L. Javier Romualdez , et al. (5 additional authors not shown)

Abstract: This paper presents optical night sky brightness measurements from the stratosphere using CCD images taken with the Super-pressure Balloon-borne Imaging Telescope (SuperBIT). The data used for estimating the backgrounds were obtained during three commissioning flights in 2016, 2018, and 2019 at altitudes ranging from 28 km to 34 km above sea level. For a valid comparison of the brightness measurem… ▽ More This paper presents optical night sky brightness measurements from the stratosphere using CCD images taken with the Super-pressure Balloon-borne Imaging Telescope (SuperBIT). The data used for estimating the backgrounds were obtained during three commissioning flights in 2016, 2018, and 2019 at altitudes ranging from 28 km to 34 km above sea level. For a valid comparison of the brightness measurements from the stratosphere with measurements from mountain-top ground-based observatories (taken at zenith on the darkest moonless night at high Galactic and high ecliptic latitudes), the stratospheric brightness levels were zodiacal light and diffuse Galactic light subtracted, and the airglow brightness was projected to zenith. The stratospheric brightness was measured around 5.5 hours, 3 hours, and 2 hours before the local sunrise time in 2016, 2018, and 2019 respectively. The $B$, $V$, $R$, and $I$ brightness levels in 2016 were 2.7, 1.0, 1.1, and 0.6 mag arcsec$^{-2}$ darker than the darkest ground-based measurements. The $B$, $V$, and $R$ brightness levels in 2018 were 1.3, 1.0, and 1.3 mag arcsec$^{-2}$ darker than the darkest ground-based measurements. The $U$ and $I$ brightness levels in 2019 were 0.1 mag arcsec$^{-2}$ brighter than the darkest ground-based measurements, whereas the $B$ and $V$ brightness levels were 0.8 and 0.6 mag arcsec$^{-2}$ darker than the darkest ground-based measurements. The lower sky brightness levels, stable photometry, and lower atmospheric absorption make stratospheric observations from a balloon-borne platform a unique tool for astronomy. We plan to continue this work in a future mid-latitude long duration balloon flight with SuperBIT. △ Less

Submitted 10 October, 2020; originally announced October 2020.

Comments: 17 pages, 7 figures. Accepted for publication in the Astronomical Journal

Journal ref: The Astronomical Journal, Volume 160, Number 6, 2020

arXiv:2010.03274 [pdf, other]

Learning to Explain: Datasets and Models for Identifying Valid Reasoning Chains in Multihop Question-Answering

Authors: Harsh Jhamtani, Peter Clark

Abstract: Despite the rapid progress in multihop question-answering (QA), models still have trouble explaining why an answer is correct, with limited explanation training data available to learn from. To address this, we introduce three explanation datasets in which explanations formed from corpus facts are annotated. Our first dataset, eQASC, contains over 98K explanation annotations for the multihop quest… ▽ More Despite the rapid progress in multihop question-answering (QA), models still have trouble explaining why an answer is correct, with limited explanation training data available to learn from. To address this, we introduce three explanation datasets in which explanations formed from corpus facts are annotated. Our first dataset, eQASC, contains over 98K explanation annotations for the multihop question answering dataset QASC, and is the first that annotates multiple candidate explanations for each answer. The second dataset eQASC-perturbed is constructed by crowd-sourcing perturbations (while preserving their validity) of a subset of explanations in QASC, to test consistency and generalization of explanation prediction models. The third dataset eOBQA is constructed by adding explanation annotations to the OBQA dataset to test generalization of models trained on eQASC. We show that this data can be used to significantly improve explanation quality (+14% absolute F1 over a strong retrieval baseline) using a BERT-based classifier, but still behind the upper bound, offering a new challenge for future research. We also explore a delexicalized chain representation in which repeated noun phrases are replaced by variables, thus turning them into generalized reasoning chains (for example: "X is a Y" AND "Y has Z" IMPLIES "X has Z"). We find that generalized chains maintain performance while also being more robust to certain perturbations. △ Less

Submitted 7 October, 2020; originally announced October 2020.

Comments: EMNLP 2020

arXiv:2009.10509 [pdf, other]

doi 10.1093/mnras/staa2947

SN 2018gjx reveals that some SNe Ibn are SNe IIb exploding in dense circumstellar material

Authors: S. J. Prentice, K. Maguire, I. Boian, J. Groh, J. Anderson, C. Barbarino, K. A. Bostroem, J. Burke, P. Clark, Y. Dong, M. Fraser, L. Galbany, M. Gromadzki, C. P. Gutiérrez, D. A. Howell, D. Hiramatsu, C. Inserra, P. A. James, E. Kankare, H. Kuncarayakti, P. A. Mazzali, C. McCully, T. E. Müller-Bravo, M. Nichol, C. Pellegrino , et al. (5 additional authors not shown)

Abstract: We present the data and analysis of SN 2018gjx, an unusual low-luminosity transient with three distinct spectroscopic phases. Phase I shows a hot blue spectrum with signatures of ionised circumstellar material (CSM), Phase II has the appearance of broad SN features, consistent with those seen in a Type IIb supernova at maximum light, and Phase III is that of a supernova interacting with helium-ric… ▽ More We present the data and analysis of SN 2018gjx, an unusual low-luminosity transient with three distinct spectroscopic phases. Phase I shows a hot blue spectrum with signatures of ionised circumstellar material (CSM), Phase II has the appearance of broad SN features, consistent with those seen in a Type IIb supernova at maximum light, and Phase III is that of a supernova interacting with helium-rich CSM, similar to a Type Ibn supernova. This event provides an apparently rare opportunity to view the inner workings of an interacting supernova. The observed properties can be explained by the explosion of a star in an aspherical CSM. The initial light is emitted from an extended CSM (~ 4000 Rsun), which ionises the exterior unshocked material. Some days after, the SN photosphere envelops this region, leading to the appearance of a SN IIb. Over time, the photosphere recedes in velocity space, revealing interaction between the supernova ejecta and the CSM that partially obscures the supernova nebular phase. Modelling of the initial spectrum reveals a surface composition consistent with compact H-deficient Wolf-Rayet and LBV stars. Such configurations may not be unusual, with SNe IIb being known to have signs of interaction so at least some SNe IIb and SNe Ibn may be the same phenomena viewed from different angles or, possibly with differing CSM configurations. △ Less

Submitted 22 September, 2020; originally announced September 2020.

Comments: Accepted for publication in MNRAS

Showing 51–100 of 334 results for author: Clark, P