-
On scalable oversight with weak LLMs judging strong LLMs
Authors:
Zachary Kenton,
Noah Y. Siegel,
János Kramár,
Jonah Brown-Cohen,
Samuel Albanie,
Jannis Bulian,
Rishabh Agarwal,
David Lindner,
Yunhao Tang,
Noah D. Goodman,
Rohin Shah
Abstract:
Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI a…
▽ More
Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Assessing Large Language Models on Climate Information
Authors:
Jannis Bulian,
Mike S. Schäfer,
Afra Amini,
Heidi Lam,
Massimiliano Ciaramita,
Ben Gaiarin,
Michelle Chen Hübscher,
Christian Buck,
Niels G. Mede,
Markus Leippold,
Nadine Strauß
Abstract:
As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM genera…
▽ More
As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues. Our evaluation task is a real-world example of a growing number of challenging problems where AI can complement and lift human performance. We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education. We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.
△ Less
Submitted 28 May, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.
-
Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$
Authors:
Adam Roberts,
Hyung Won Chung,
Anselm Levskaya,
Gaurav Mishra,
James Bradbury,
Daniel Andor,
Sharan Narang,
Brian Lester,
Colin Gaffney,
Afroz Mohiuddin,
Curtis Hawthorne,
Aitor Lewkowycz,
Alex Salcianu,
Marc van Zee,
Jacob Austin,
Sebastian Goodman,
Livio Baldini Soares,
Haitang Hu,
Sasha Tsvyashchenko,
Aakanksha Chowdhery,
Jasmijn Bastings,
Jannis Bulian,
Xavier Garcia,
Jianmo Ni,
Andrew Chen
, et al. (18 additional authors not shown)
Abstract:
Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we presen…
▽ More
Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we present two software libraries that ease these issues: $\texttt{t5x}$ simplifies the process of building and training large language models at scale while maintaining ease of use, and $\texttt{seqio}$ provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.
Along with the libraries, we release configurations and instructions for T5-like encoder-decoder models as well as GPT-like decoder-only architectures.
$\texttt{t5x}$ and $\texttt{seqio}$ are open source and available at https://github.com/google-research/t5x and https://github.com/google/seqio, respectively.
△ Less
Submitted 31 March, 2022;
originally announced March 2022.
-
Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation
Authors:
Jannis Bulian,
Christian Buck,
Wojciech Gajewski,
Benjamin Boerschinger,
Tal Schuster
Abstract:
The predictions of question answering (QA)systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over exact match (EM) with pre-defined rules or with the token-level F1 measure. In this paper, we present the first syste…
▽ More
The predictions of question answering (QA)systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over exact match (EM) with pre-defined rules or with the token-level F1 measure. In this paper, we present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures.
To this end, we define the asymmetric notion of answer equivalence (AE), accepting answers that are equivalent to or improve over the reference, and publish over 23k human judgments for candidates produced by multiple QA systems on SQuAD. Through a careful analysis of this data, we reveal and quantify several concrete limitations of the F1 measure, such as a false impression of graduality, or missing dependence on the question.
Since collecting AE annotations for each evaluated model is expensive, we learn a BERT matching (BEM) measure to approximate this task. Being a simpler task than QA, we find BEM to provide significantly better AE approximations than F1, and to more accurately reflect the performance of systems.
Finally, we demonstrate the practical utility of AE and BEM on the concrete application of minimal accurate prediction sets, reducing the number of required answers by up to x2.6.
△ Less
Submitted 26 October, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Fool Me Twice: Entailment from Wikipedia Gamification
Authors:
Julian Martin Eisenschlos,
Bhuwan Dhingra,
Jannis Bulian,
Benjamin Börschinger,
Jordan Boyd-Graber
Abstract:
We release FoolMeTwice (FM2 for short), a large dataset of challenging entailment pairs collected through a fun multi-player game. Gamification encourages adversarial examples, drastically lowering the number of examples that can be solved using "shortcuts" compared to other popular entailment datasets. Players are presented with two tasks. The first task asks the player to write a plausible claim…
▽ More
We release FoolMeTwice (FM2 for short), a large dataset of challenging entailment pairs collected through a fun multi-player game. Gamification encourages adversarial examples, drastically lowering the number of examples that can be solved using "shortcuts" compared to other popular entailment datasets. Players are presented with two tasks. The first task asks the player to write a plausible claim based on the evidence from a Wikipedia page. The second one shows two plausible claims written by other players, one of which is false, and the goal is to identify it before the time runs out. Players "pay" to see clues retrieved from the evidence pool: the more evidence the player needs, the harder the claim. Game-play between motivated players leads to diverse strategies for crafting claims, such as temporal inference and diverting to unrelated evidence, and results in higher quality data for the entailment and evidence retrieval tasks. We open source the dataset and the game code.
△ Less
Submitted 10 April, 2021;
originally announced April 2021.
-
CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims
Authors:
Thomas Diggelmann,
Jordan Boyd-Graber,
Jannis Bulian,
Massimiliano Ciaramita,
Markus Leippold
Abstract:
We introduce CLIMATE-FEVER, a new publicly available dataset for verification of climate change-related claims. By providing a dataset for the research community, we aim to facilitate and encourage work on improving algorithms for retrieving evidential support for climate-specific claims, addressing the underlying language understanding challenges, and ultimately help alleviate the impact of misin…
▽ More
We introduce CLIMATE-FEVER, a new publicly available dataset for verification of climate change-related claims. By providing a dataset for the research community, we aim to facilitate and encourage work on improving algorithms for retrieving evidential support for climate-specific claims, addressing the underlying language understanding challenges, and ultimately help alleviate the impact of misinformation on climate change. We adapt the methodology of FEVER [1], the largest dataset of artificially designed claims, to real-life claims collected from the Internet. While during this process, we could rely on the expertise of renowned climate scientists, it turned out to be no easy task. We discuss the surprising, subtle complexity of modeling real-world climate-related claims within the \textsc{fever} framework, which we believe provides a valuable challenge for general natural language understanding. We hope that our work will mark the beginning of a new exciting long-term joint effort by the climate science and AI community.
△ Less
Submitted 2 January, 2021; v1 submitted 1 December, 2020;
originally announced December 2020.
-
Meta Answering for Machine Reading
Authors:
Benjamin Borschinger,
Jordan Boyd-Graber,
Christian Buck,
Jannis Bulian,
Massimiliano Ciaramita,
Michelle Chen Huebscher,
Wojciech Gajewski,
Yannic Kilcher,
Rodrigo Nogueira,
Lierni Sestorain Saralegu
Abstract:
We investigate a framework for machine reading, inspired by real world information-seeking problems, where a meta question answering system interacts with a black box environment. The environment encapsulates a competitive machine reader based on BERT, providing candidate answers to questions, and possibly some context. To validate the realism of our formulation, we ask humans to play the role of…
▽ More
We investigate a framework for machine reading, inspired by real world information-seeking problems, where a meta question answering system interacts with a black box environment. The environment encapsulates a competitive machine reader based on BERT, providing candidate answers to questions, and possibly some context. To validate the realism of our formulation, we ask humans to play the role of a meta-answerer. With just a small snippet of text around an answer, humans can outperform the machine reader, improving recall. Similarly, a simple machine meta-answerer outperforms the environment, improving both precision and recall on the Natural Questions dataset. The system relies on joint training of answer scoring and the selection of conditioning information.
△ Less
Submitted 30 April, 2020; v1 submitted 11 November, 2019;
originally announced November 2019.
-
Learning to Coordinate Multiple Reinforcement Learning Agents for Diverse Query Reformulation
Authors:
Rodrigo Nogueira,
Jannis Bulian,
Massimiliano Ciaramita
Abstract:
We propose a method to efficiently learn diverse strategies in reinforcement learning for query reformulation in the tasks of document retrieval and question answering. In the proposed framework an agent consists of multiple specialized sub-agents and a meta-agent that learns to aggregate the answers from sub-agents to produce a final answer. Sub-agents are trained on disjoint partitions of the tr…
▽ More
We propose a method to efficiently learn diverse strategies in reinforcement learning for query reformulation in the tasks of document retrieval and question answering. In the proposed framework an agent consists of multiple specialized sub-agents and a meta-agent that learns to aggregate the answers from sub-agents to produce a final answer. Sub-agents are trained on disjoint partitions of the training data, while the meta-agent is trained on the full training set. Our method makes learning faster, because it is highly parallelizable, and has better generalization performance than strong baselines, such as an ensemble of agents trained on the full data. We show that the improved performance is due to the increased diversity of reformulation strategies.
△ Less
Submitted 25 December, 2018; v1 submitted 27 September, 2018;
originally announced September 2018.
-
Analyzing Language Learned by an Active Question Answering Agent
Authors:
Christian Buck,
Jannis Bulian,
Massimiliano Ciaramita,
Wojciech Gajewski,
Andrea Gesmundo,
Neil Houlsby,
Wei Wang
Abstract:
We analyze the language learned by an agent trained with reinforcement learning as a component of the ActiveQA system [Buck et al., 2017]. In ActiveQA, question answering is framed as a reinforcement learning task in which an agent sits between the user and a black box question-answering system. The agent learns to reformulate the user's questions to elicit the optimal answers. It probes the syste…
▽ More
We analyze the language learned by an agent trained with reinforcement learning as a component of the ActiveQA system [Buck et al., 2017]. In ActiveQA, question answering is framed as a reinforcement learning task in which an agent sits between the user and a black box question-answering system. The agent learns to reformulate the user's questions to elicit the optimal answers. It probes the system with many versions of a question that are generated via a sequence-to-sequence question reformulation model, then aggregates the returned evidence to find the best answer. This process is an instance of \emph{machine-machine} communication. The question reformulation model must adapt its language to increase the quality of the answers returned, matching the language of the question answering system. We find that the agent does not learn transformations that align with semantic intuitions but discovers through learning classical information retrieval techniques such as tf-idf re-weighting and stemming.
△ Less
Submitted 23 January, 2018;
originally announced January 2018.
-
Ask the Right Questions: Active Question Reformulation with Reinforcement Learning
Authors:
Christian Buck,
Jannis Bulian,
Massimiliano Ciaramita,
Wojciech Gajewski,
Andrea Gesmundo,
Neil Houlsby,
Wei Wang
Abstract:
We frame Question Answering (QA) as a Reinforcement Learning task, an approach that we call Active Question Answering. We propose an agent that sits between the user and a black box QA system and learns to reformulate questions to elicit the best possible answers. The agent probes the system with, potentially many, natural language reformulations of an initial question and aggregates the returned…
▽ More
We frame Question Answering (QA) as a Reinforcement Learning task, an approach that we call Active Question Answering. We propose an agent that sits between the user and a black box QA system and learns to reformulate questions to elicit the best possible answers. The agent probes the system with, potentially many, natural language reformulations of an initial question and aggregates the returned evidence to yield the best answer. The reformulation system is trained end-to-end to maximize answer quality using policy gradient. We evaluate on SearchQA, a dataset of complex questions extracted from Jeopardy!. The agent outperforms a state-of-the-art base model, playing the role of the environment, and other benchmarks. We also analyze the language that the agent has learned while interacting with the question answering system. We find that successful question reformulations look quite different from natural language paraphrases. The agent is able to discover non-trivial reformulation strategies that resemble classic information retrieval techniques such as term re-weighting (tf-idf) and stemming.
△ Less
Submitted 2 March, 2018; v1 submitted 22 May, 2017;
originally announced May 2017.
-
Fixed-parameter Tractable Distances to Sparse Graph Classes
Authors:
Jannis Bulian,
Anuj Dawar
Abstract:
We show that for various classes C of sparse graphs, and several measures of distance to such classes (such as edit distance and elimination distance), the problem of determining the distance of a given graph G to C is fixed-parameter tractable. The results are based on two general techniques. The first of these, building on recent work of Grohe et al. establishes that any class of graphs that is…
▽ More
We show that for various classes C of sparse graphs, and several measures of distance to such classes (such as edit distance and elimination distance), the problem of determining the distance of a given graph G to C is fixed-parameter tractable. The results are based on two general techniques. The first of these, building on recent work of Grohe et al. establishes that any class of graphs that is slicewise nowhere dense and slicewise first-order definable is FPT. The second shows that determining the elimination distance of a graph G to a minor-closed class C is FPT.
△ Less
Submitted 20 February, 2015;
originally announced February 2015.
-
Graph Isomorphism Parameterized by Elimination Distance to Bounded Degree
Authors:
Jannis Bulian,
Anuj Dawar
Abstract:
A commonly studied means of parameterizing graph problems is the deletion distance from triviality (Guo et al. 2004), which counts vertices that need to be deleted from a graph to place it in some class for which efficient algorithms are known. In the context of graph isomorphism, we define triviality to mean a graph with maximum degree bounded by a constant, as such graph classes admit polynomial…
▽ More
A commonly studied means of parameterizing graph problems is the deletion distance from triviality (Guo et al. 2004), which counts vertices that need to be deleted from a graph to place it in some class for which efficient algorithms are known. In the context of graph isomorphism, we define triviality to mean a graph with maximum degree bounded by a constant, as such graph classes admit polynomial-time isomorphism tests. We generalise deletion distance to a measure we call elimination distance to triviality, based on elimination trees or tree-depth decompositions. We establish that graph canonisation, and thus graph isomorphism, is FPT when parameterized by elimination distance to bounded degree, extending results of Bouland et al. (2012).
△ Less
Submitted 16 October, 2014; v1 submitted 18 June, 2014;
originally announced June 2014.
-
Bare canonicity of representable cylindric and polyadic algebras
Authors:
Jannis Bulian,
Ian Hodkinson
Abstract:
We show that for finite n at least 3, every first-order axiomatisation of the varieties of representable n-dimensional cylindric algebras, diagonal-free cylindric algebras, polyadic algebras, and polyadic equality algebras contains an infinite number of non-canonical formulas. We also show that the class of structures for each of these varieties is non-elementary. The proofs employ algebras derive…
▽ More
We show that for finite n at least 3, every first-order axiomatisation of the varieties of representable n-dimensional cylindric algebras, diagonal-free cylindric algebras, polyadic algebras, and polyadic equality algebras contains an infinite number of non-canonical formulas. We also show that the class of structures for each of these varieties is non-elementary. The proofs employ algebras derived from random graphs.
△ Less
Submitted 27 February, 2012;
originally announced February 2012.