Skip to main content

Showing 1–50 of 93 results for author: Clark, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01725  [pdf, other

    cs.CL cs.AI cs.LG

    DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

    Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, Peter Clark

    Abstract: Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systemat… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Website: https://github.com/allenai/discoverybench

  2. arXiv:2406.06769  [pdf, other

    cs.AI cs.CL

    DISCOVERYWORLD: A Virtual Environment for Develo** and Evaluating Automated Scientific Discovery Agents

    Authors: Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark

    Abstract: Automated scientific discovery promises to accelerate progress across scientific domains. However, develo** and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for develo** and benchmarking an agent's abil… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 9 pages, 4 figures. Preprint, under review

  3. arXiv:2406.06485  [pdf, other

    cs.CL cs.AI

    Can Language Models Serve as Text-Based World Simulators?

    Authors: Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, Peter Jansen

    Abstract: Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of tex… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: ACL 2024

  4. arXiv:2405.19793  [pdf, other

    cs.CL

    PDDLEGO: Iterative Planning in Textual Environments

    Authors: Li Zhang, Peter Jansen, Tianyi Zhang, Peter Clark, Chris Callison-Burch, Niket Tandon

    Abstract: Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: In *SEM 2024

  5. arXiv:2405.16337  [pdf, other

    cs.CL cs.AI

    Learning to Reason via Program Generation, Emulation, and Search

    Authors: Nathaniel Weir, Muhammad Khalifa, Linlu Qiu, Orion Weller, Peter Clark

    Abstract: Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understand… ▽ More

    Submitted 28 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

    Comments: 16 pages, 10 figures

  6. arXiv:2403.00092  [pdf, other

    cs.CL

    PROC2PDDL: Open-Domain Planning Representations from Texts

    Authors: Tianyi Zhang, Li Zhang, Zhaoyi Hou, Ziyu Wang, Yuling Gu, Peter Clark, Chris Callison-Burch, Niket Tandon

    Abstract: Planning in a text-based environment continues to be a major challenge for AI systems. Recent approaches have used language models to predict a planning domain definition (e.g., PDDL) but have only been evaluated in closed-domain simulated environments. To address this, we present Proc2PDDL , the first dataset containing open-domain procedural texts paired with expert-annotated PDDL representation… ▽ More

    Submitted 2 July, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

    Comments: In NLRSE 2024, the 2nd Natural Language Reasoning and Structured Explanations Workshop

  7. arXiv:2402.14798  [pdf, other

    cs.CL cs.AI

    Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic

    Authors: Nathaniel Weir, Kate Sanders, Orion Weller, Shreya Sharma, Dongwei Jiang, Zheng** Jiang, Bhavana Dalvi Mishra, Oyvind Tafjord, Peter Jansen, Peter Clark, Benjamin Van Durme

    Abstract: Contemporary language models enable new opportunities for structured reasoning with text, such as the construction and evaluation of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy… ▽ More

    Submitted 27 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

  8. arXiv:2402.13610  [pdf, other

    cs.CL cs.AI cs.LG

    Data-driven Discovery with Large Generative Models

    Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, Peter Clark

    Abstract: With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a se… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

  9. arXiv:2402.03244  [pdf, other

    cs.LG cs.CL

    Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills

    Authors: Kolby Nottingham, Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Sameer Singh, Peter Clark, Roy Fox

    Abstract: Large language models (LLMs) have recently been used for sequential decision making in interactive environments. However, leveraging environment reward signals for continual LLM actor improvement is not straightforward. We propose Skill Set Optimization (SSO) for improving LLM actor performance through constructing and refining sets of transferable skills. SSO constructs skills by extracting commo… ▽ More

    Submitted 22 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024

  10. arXiv:2401.06751  [pdf, other

    cs.CL cs.AI cs.LG

    The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

    Authors: Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe

    Abstract: How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have continually improved. In this paper, we present the surprising conclusion that current pretrained language models often generalize relatively well from… ▽ More

    Submitted 5 June, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

    Comments: ACL 2024. 23 pages, 20 figures

  11. arXiv:2312.07527  [pdf, other

    cs.CL cs.AI

    BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability

    Authors: Peter Clark, Bhavana Dalvi Mishra, Oyvind Tafjord

    Abstract: While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a colle… ▽ More

    Submitted 23 March, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: Added note about how dataset sampling was performed

  12. arXiv:2311.09613  [pdf, other

    cs.CL cs.AI

    Digital Socrates: Evaluating LLMs through Explanation Critiques

    Authors: Yuling Gu, Oyvind Tafjord, Peter Clark

    Abstract: While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on… ▽ More

    Submitted 16 February, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

  13. arXiv:2311.09519  [pdf, other

    cs.CL

    Leveraging Code to Improve In-context Learning for Semantic Parsing

    Authors: Ben Bogin, Shivanshu Gupta, Peter Clark, Ashish Sabharwal

    Abstract: In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs. In this work, we improve the effectiveness of ICL for semantic parsing by (1) using general-purpose p… ▽ More

    Submitted 27 March, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: Accepted to NAACL 2024

  14. arXiv:2311.09510  [pdf, other

    cs.CL

    Tailoring with Targeted Precision: Edit-Based Agents for Open-Domain Procedure Customization

    Authors: Yash Kumar Lal, Li Zhang, Faeze Brahman, Bodhisattwa Prasad Majumder, Peter Clark, Niket Tandon

    Abstract: How-to procedures, such as how to plant a garden, are now used by millions of users, but sometimes need customizing to meet a user's specific needs, e.g., planting a garden without pesticides. Our goal is to measure and improve an LLM's ability to perform such customization. Our approach is to test several simple multi-LLM-agent architectures for customization, as well as an end-to-end LLM, using… ▽ More

    Submitted 30 May, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: Camera ready version accepted to Findings of ACL 2024

  15. arXiv:2311.05772  [pdf, other

    cs.AI cs.CL cs.LG

    ADaPT: As-Needed Decomposition and Planning with Language Models

    Authors: Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, Tushar Khot

    Abstract: Large Language Models (LLMs) are increasingly being used for interactive decision-making tasks requiring planning and adapting to the environment. Recent works employ LLMs-as-agents in broadly two ways: iteratively determining the next action (iterative executors) or generating plans and executing sub-tasks using LLMs (plan-and-execute). However, these methods struggle with task complexity, as the… ▽ More

    Submitted 8 April, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 (findings) camera-ready. Project Page: https://allenai.github.io/adaptllm

  16. arXiv:2311.04892  [pdf, other

    cs.CL

    Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs

    Authors: Shashank Gupta, Vaishnavi Shrivastava, Ameet Deshpande, Ashwin Kalyan, Peter Clark, Ashish Sabharwal, Tushar Khot

    Abstract: Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of… ▽ More

    Submitted 27 January, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: Project page: https://allenai.github.io/persona-bias. Paper to appear at ICLR 2024. Added results for other LLMs in v2 (similar findings)

  17. arXiv:2311.02807  [pdf, other

    cs.LG cs.AI cs.CL

    QualEval: Qualitative Evaluation for Model Improvement

    Authors: Vishvak Murahari, Ameet Deshpande, Peter Clark, Tanmay Rajpurohit, Ashish Sabharwal, Karthik Narasimhan, Ashwin Kalyan

    Abstract: Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a… ▽ More

    Submitted 5 May, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

    Comments: NAACL 2024

  18. arXiv:2310.10134  [pdf, other

    cs.CL cs.AI cs.LG

    CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization

    Authors: Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark

    Abstract: Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. However, despite their zero-shot capabilities, these agents to date do not continually improve over time beyond performance refinement on a specific task. Here we present C… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: Project page: https://allenai.github.io/clin/

  19. arXiv:2305.14596  [pdf, other

    cs.CL cs.LG

    Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy

    Authors: Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, Ashish Sabharwal

    Abstract: When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer choices. Spreading probability mass across multiple surface forms with identical meaning (such as "bath" and "bathtub") is thought to cause an underestimation of a model's true performance, referred to as th… ▽ More

    Submitted 31 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  20. arXiv:2305.14386  [pdf, other

    cs.LG cs.AI cs.CL

    Let GPT be a Math Tutor: Teaching Math Word Problem Solvers with Customized Exercise Generation

    Authors: Zhenwen Liang, Wenhao Yu, Tanmay Rajpurohit, Peter Clark, Xiangliang Zhang, Ashwin Kaylan

    Abstract: In this paper, we present a novel approach for distilling math word problem solving capabilities from large language models (LLMs) into smaller, more efficient student models. Our approach is designed to consider the student model's weaknesses and foster a tailored learning experience by generating targeted exercises aligned with educational science principles, such as knowledge tracing and person… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  21. arXiv:2305.14250  [pdf, other

    cs.CL cs.AI

    Language Models with Rationality

    Authors: Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle Richardson, Hinrich Schuetze, Peter Clark

    Abstract: While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent "beliefs". This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that… ▽ More

    Submitted 29 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

  22. arXiv:2305.14010  [pdf, other

    cs.CL

    IfQA: A Dataset for Open-domain Question Answering under Counterfactual Presuppositions

    Authors: Wenhao Yu, Meng Jiang, Peter Clark, Ashish Sabharwal

    Abstract: Although counterfactual reasoning is a fundamental aspect of intelligence, the lack of large-scale counterfactual open-domain question-answering (QA) benchmarks makes it difficult to evaluate and improve models on this ability. To address this void, we introduce the first such dataset, named IfQA, where each question is based on a counterfactual presupposition via an "if" clause. For example, if L… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  23. arXiv:2305.08844  [pdf, other

    cs.CL

    RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

    Authors: Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan, Peter Clark, Derry Wijaya, Niket Tandon

    Abstract: Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics… ▽ More

    Submitted 11 July, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

    Comments: ACL 2023

  24. arXiv:2303.17651  [pdf, other

    cs.CL cs.AI cs.LG

    Self-Refine: Iterative Refinement with Self-Feedback

    Authors: Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark

    Abstract: Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it… ▽ More

    Submitted 25 May, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: Code, data, and demo at https://selfrefine.info/

  25. arXiv:2212.10029  [pdf, other

    cs.CL cs.AI

    Do language models have coherent mental models of everyday things?

    Authors: Yuling Gu, Bhavana Dalvi Mishra, Peter Clark

    Abstract: When people think of everyday things like an egg, they typically have a mental image associated with it. This allows them to correctly judge, for example, that "the yolk surrounds the shell" is a false statement. Do language models similarly have a coherent picture of such everyday things? To investigate this, we propose a benchmark dataset consisting of 100 everyday things, their parts, and the r… ▽ More

    Submitted 8 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023

  26. arXiv:2210.17517  [pdf, other

    cs.CL cs.AI

    Lila: A Unified Benchmark for Mathematical Reasoning

    Authors: Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan

    Abstract: Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shop** to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., q… ▽ More

    Submitted 8 March, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

    MSC Class: 68T50 ACM Class: I.2.7

  27. arXiv:2210.16407  [pdf, other

    cs.CL

    Just-DREAM-about-it: Figurative Language Understanding with DREAM-FLUTE

    Authors: Yuling Gu, Yao Fu, Valentina Pyatkin, Ian Magnusson, Bhavana Dalvi Mishra, Peter Clark

    Abstract: Figurative language (e.g., "he flew like the wind") is challenging to understand, as it is hard to tell what implicit information is being conveyed from the surface form alone. We hypothesize that to perform this task well, the reader needs to mentally elaborate the scene being described to identify a sensible meaning of the language. We present DREAM-FLUTE, a figurative language understanding sys… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: Accepted at The Third Workshop on Figurative Language Processing @ EMNLP 2022

  28. arXiv:2210.12217  [pdf, other

    cs.AI cs.CL

    Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning

    Authors: Oyvind Tafjord, Bhavana Dalvi Mishra, Peter Clark

    Abstract: Our goal is a question-answering (QA) system that can show how its answers are implied by its own internal beliefs via a systematic chain of reasoning. Such a capability would allow better understanding of why a model produced the answer it did. Our approach is to recursively combine a trained backward-chaining model, capable of generating a set of premises entailing an answer hypothesis, with a v… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

    Comments: accepted at EMNLP 2022. arXiv admin note: substantial text overlap with arXiv:2204.13074

  29. arXiv:2210.02406  [pdf, other

    cs.CL

    Decomposed Prompting: A Modular Approach for Solving Complex Tasks

    Authors: Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, Ashish Sabharwal

    Abstract: Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to solve various tasks. However, this approach struggles as the task complexity increases or when the individual reasoning steps of the task themselves are hard to learn, especially when embedded in more complex tasks. To address this, we propose Decomposed Prompting, a new approach to solve complex tasks by deco… ▽ More

    Submitted 11 April, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: ICLR'23 Camera Ready

  30. arXiv:2210.00720  [pdf, other

    cs.CL cs.AI cs.LG

    Complexity-Based Prompting for Multi-Step Reasoning

    Authors: Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, Tushar Khot

    Abstract: We study the task of prompting large-scale language models to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer, large language models can generate new reasoning chains and predict answers for new inputs. A central question is which reasoning examples make… ▽ More

    Submitted 30 January, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: Preprint

  31. arXiv:2209.14610  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning

    Authors: Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, Ashwin Kalyan

    Abstract: Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that in… ▽ More

    Submitted 2 March, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

    Comments: ICLR 2023. 26 pages and 18 figures. The data and code are available at https://promptpg.github.io

  32. arXiv:2209.09513  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.MM

    Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

    Authors: Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, Ashwin Kalyan

    Abstract: When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system… ▽ More

    Submitted 17 October, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

    Comments: Accepted to NeurIPS 2022. 22 pages, 17 figures, 9 tables. Project: https://scienceqa.github.io

  33. arXiv:2209.07662  [pdf, other

    cs.CL

    NELLIE: A Neuro-Symbolic Inference Engine for Grounded, Compositional, and Explainable Reasoning

    Authors: Nathaniel Weir, Peter Clark, Benjamin Van Durme

    Abstract: Our goal is a modern approach to answering questions via systematic reasoning where answers are supported by human interpretable proof trees grounded in an NL corpus of authoritative facts. Such a system would help alleviate the challenges of interpretability and hallucination with modern LMs, and the lack of grounding of current explanation methods (e.g., Chain-of-Thought). This paper proposes a… ▽ More

    Submitted 21 December, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

  34. arXiv:2204.13074  [pdf, other

    cs.CL cs.AI

    Towards Teachable Reasoning Systems: Using a Dynamic Memory of User Feedback for Continual System Improvement

    Authors: Bhavana Dalvi Mishra, Oyvind Tafjord, Peter Clark

    Abstract: Our goal is a teachable reasoning system for question-answering (QA), where a user can interact with faithful answer explanations, and correct its errors so that the system improves over time. Our approach is to augment a QA model with a dynamic memory of user feedback, containing user-supplied corrections to erroneous model beliefs that users identify during interaction. Retrievals from memory ar… ▽ More

    Submitted 21 October, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

    Comments: accepted at EMNLP 2022

  35. arXiv:2204.09148  [pdf, other

    cs.CL cs.AI

    What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment

    Authors: Matthew Finlayson, Kyle Richardson, Ashish Sabharwal, Peter Clark

    Abstract: The instruction learning paradigm -- where a model learns to perform new tasks from task descriptions alone -- has become popular in general-purpose model research. The capabilities of large transformer models as instruction learners, however, remain poorly understood. We use a controlled synthetic environment to characterize such capabilities. Specifically, we use the task of deciding whether a g… ▽ More

    Submitted 24 May, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

    Comments: Typos corrected, rewordings

    MSC Class: 68T50 ACM Class: I.2.7

  36. arXiv:2204.05660  [pdf, other

    cs.CL cs.AI cs.LG

    NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks

    Authors: Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, Ashwin Kalyan

    Abstract: Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE that was proposed… ▽ More

    Submitted 12 April, 2022; originally announced April 2022.

    Comments: ACL 2022

  37. arXiv:2201.06009  [pdf, other

    cs.CL

    Memory-assisted prompt editing to improve GPT-3 after deployment

    Authors: Aman Madaan, Niket Tandon, Peter Clark, Yiming Yang

    Abstract: Large LMs such as GPT-3 are powerful, but can commit mistakes that are obvious to humans. For example, GPT-3 would mistakenly interpret "What word is similar to good?" to mean a homophone, while the user intended a synonym. Our goal is to effectively correct such errors via user interactions with the system but without retraining, which will be prohibitively costly. We pair GPT-3 with a growing me… ▽ More

    Submitted 18 February, 2023; v1 submitted 16 January, 2022; originally announced January 2022.

    Comments: EMNLP 2022. This version updates the title to be consistent with EMNLP camera ready

  38. arXiv:2112.09737  [pdf, other

    cs.CL cs.AI

    Learning to Repair: Repairing model output errors after deployment using a dynamic memory of feedback

    Authors: Niket Tandon, Aman Madaan, Peter Clark, Yiming Yang

    Abstract: Large language models (LMs), while powerful, are not immune to mistakes, but can be difficult to retrain. Our goal is for an LM to continue to improve after deployment, without retraining, using feedback from the user. Our approach pairs an LM with (i) a growing memory of cases where the user identified an output error and provided general feedback on how to correct it (ii) a corrector model, trai… ▽ More

    Submitted 9 May, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: NAACL 2022 (Findings)

  39. arXiv:2112.08656  [pdf, other

    cs.CL cs.AI

    DREAM: Improving Situational QA by First Elaborating the Situation

    Authors: Yuling Gu, Bhavana Dalvi Mishra, Peter Clark

    Abstract: When people answer questions about a specific situation, e.g., "I cheated on my mid-term exam last week. Was that wrong?", cognitive science suggests that they form a mental picture of that situation before answering. While we do not know how language models (LMs) answer such questions, we conjecture that they may answer more accurately if they are also provided with additional details about the q… ▽ More

    Submitted 5 May, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: to be published in NAACL 2022

  40. arXiv:2112.07867  [pdf, other

    cs.AI

    Interscript: A dataset for interactive learning of scripts through error feedback

    Authors: Niket Tandon, Aman Madaan, Peter Clark, Keisuke Sakaguchi, Yiming Yang

    Abstract: How can an end-user provide feedback if a deployed structured prediction model generates inconsistent output, ignoring the structural complexity of human language? This is an emerging topic with recent progress in synthetic or constrained settings, and the next big leap would require testing and tuning models in real-world settings. We present a new dataset, Interscript, containing user feedback o… ▽ More

    Submitted 15 December, 2021; v1 submitted 14 December, 2021; originally announced December 2021.

    Comments: AAAI'22-Workshop on Interactive Machine Learning

  41. arXiv:2110.14207  [pdf, other

    cs.CL cs.AI

    How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI

    Authors: Ashwin Kalyan, Abhinav Kumar, Arjun Chandrasekaran, Ashish Sabharwal, Peter Clark

    Abstract: Many real-world problems require the combined application of multiple reasoning abilities employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because t… ▽ More

    Submitted 20 December, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

    Comments: Accepted for publication at EMNLP 2021, 11 pages, 5 tables, 4 figures

  42. arXiv:2110.12349  [pdf, other

    cs.AI cs.CL

    Think about it! Improving defeasible reasoning by first modeling the question scenario

    Authors: Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Peter Clark, Yiming Yang, Eduard Hovy

    Abstract: Defeasible reasoning is the mode of reasoning where conclusions can be overturned by taking into account new evidence. Existing cognitive science literature on defeasible reasoning suggests that a person forms a mental model of the problem scenario before answering questions. Our research goal asks whether neural models can similarly benefit from envisioning the question scenario before answering… ▽ More

    Submitted 24 October, 2021; originally announced October 2021.

    Comments: EMNLP 2021

  43. arXiv:2110.01398  [pdf

    cs.DC cs.CR

    Enabling Blockchain Scalability and Interoperability with Mobile Computing through LayerOne.X

    Authors: Kevin Coutinho, Ponnie Clark, Ferdinand Azis, Norman Lip, Josh Hunt

    Abstract: Interoperability and scalability are currently the bottlenecks preventing mass adoption of blockchain technology. Development of an interoperable and scalable network that promotes a truly decentralised, permissionless and secure blockchain as well as one that enables micro validation is the main goal of this project. Layer-One.X, a truly decentralised ledger which utilises para-sharding, Directed… ▽ More

    Submitted 30 September, 2021; originally announced October 2021.

    Comments: 40 pages

  44. arXiv:2109.14723  [pdf, other

    cs.CL

    BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief

    Authors: Nora Kassner, Oyvind Tafjord, Hinrich Schütze, Peter Clark

    Abstract: Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually "believes" about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our appro… ▽ More

    Submitted 29 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021 Camera Ready. arXiv admin note: substantial text overlap with arXiv:2104.08401

  45. arXiv:2109.02593  [pdf, other

    cs.CL cs.AI

    General-Purpose Question-Answering with Macaw

    Authors: Oyvind Tafjord, Peter Clark

    Abstract: Despite the successes of pretrained language models, there are still few high-quality, general-purpose QA systems that are freely available. In response, we present Macaw, a versatile, generative question-answering (QA) system that we are making available to the community. Macaw is built on UnifiedQA, itself built on T5, and exhibits strong performance, zero-shot, on a wide variety of topics, incl… ▽ More

    Submitted 6 September, 2021; originally announced September 2021.

  46. arXiv:2104.08765  [pdf, other

    cs.CL

    Improving Neural Model Performance through Natural Language Feedback on Their Explanations

    Authors: Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Yiming Yang, Peter Clark, Keisuke Sakaguchi, Ed Hovy

    Abstract: A class of explainable NLP models for reasoning tasks support their decisions by generating free-form or structured explanations, but what happens when these supporting structures contain errors? Our goal is to allow users to interactively correct explanation structures through natural language feedback. We introduce MERCURIE - an interactive system that refines its explanations for a given reason… ▽ More

    Submitted 18 April, 2021; originally announced April 2021.

  47. arXiv:2104.08661  [pdf, other

    cs.CL cs.AI

    Explaining Answers with Entailment Trees

    Authors: Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, Peter Clark

    Abstract: Our goal, in the context of open-domain textual question-answering (QA), is to explain answers by showing the line of reasoning from what is known to the answer, rather than simply showing a fragment of textual evidence (a "rationale'"). If this could be done, new opportunities for understanding and debugging the system's reasoning become possible. Our approach is to generate explanations in the f… ▽ More

    Submitted 28 May, 2022; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: published in EMNLP 2021

  48. arXiv:2104.08401  [pdf, ps, other

    cs.CL cs.AI

    Enriching a Model's Notion of Belief using a Persistent Memory

    Authors: Nora Kassner, Oyvind Tafjord, Hinrich Schutze, Peter Clark

    Abstract: Although pretrained language models (PTLMs) have been shown to contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after using specialized training techniques to reduce inconsistency. As a result, it can be hard to identify what the model actually "believes" about the world. Our goal is to reduce this problem, so systems are mo… ▽ More

    Submitted 7 October, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

    Comments: This is an old and now obsolete draft. See arXiv:2109.14723 ("BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief") for the final paper

  49. arXiv:2104.08251  [pdf, other

    cs.CL

    proScript: Partially Ordered Scripts Generation via Pre-trained Language Models

    Authors: Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras, Niket Tandon, Peter Clark, Ye** Choi

    Abstract: Scripts - standardized event sequences describing typical everyday activities - have been shown to help understand narratives by providing expectations, resolving ambiguity, and filling in unstated information. However, to date they have proved hard to author or extract from text. In this work, we demonstrate for the first time that pre-trained neural language models (LMs) can be be finetuned to g… ▽ More

    Submitted 16 April, 2021; originally announced April 2021.

  50. arXiv:2104.00814  [pdf, other

    cs.CL

    CURIE: An Iterative Querying Approach for Reasoning About Situations

    Authors: Dheeraj Rajagopal, Aman Madaan, Niket Tandon, Yiming Yang, Shrimai Prabhumoye, Abhilasha Ravichander, Peter Clark, Eduard Hovy

    Abstract: Recently, models have been shown to predict the effects of unexpected situations, e.g., would cloudy skies help or hinder plant growth? Given a context, the goal of such situational reasoning is to elicit the consequences of a new situation (st) that arises in that context. We propose a method to iteratively build a graph of relevant consequences explicitly in a structured situational graph (st-gr… ▽ More

    Submitted 5 April, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: This paper builds upon EIGEN (arXiv:2010.11764) and proposes a general framework for situational reasoning