Skip to main content

Showing 1–50 of 96 results for author: Bowman, S R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.15518  [pdf, other

    cs.CL cs.LG

    Steering Without Side Effects: Improving Post-Deployment Control of Language Models

    Authors: Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman

    Abstract: Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such m… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  2. arXiv:2406.10162  [pdf, other

    cs.AI cs.CL

    Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

    Authors: Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger

    Abstract: In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be to… ▽ More

    Submitted 28 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: Make it easier to find samples from the model, and highlight that our operational definition of reward tampering has false positives where the model attempts to complete the task honestly but edits the reward. Add paragraph to conclusion to this effect, and add sentence to figure 1 to this effect

  3. arXiv:2404.15758  [pdf, other

    cs.CL cs.AI

    Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

    Authors: Jacob Pfau, William Merrill, Samuel R. Bowman

    Abstract: Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: 17 pages, 10 figures

    ACM Class: I.2.6

  4. arXiv:2404.13076  [pdf, other

    cs.CL cs.AI

    LLM Evaluators Recognize and Favor Their Own Generations

    Authors: Arjun Panickssery, Samuel R. Bowman, Shi Feng

    Abstract: Self-evaluation using large language models (LLMs) has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators cons… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  5. arXiv:2403.05518  [pdf, other

    cs.CL cs.AI

    Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

    Authors: James Chua, Edward Rees, Hunar Batra, Samuel R. Bowman, Julian Michael, Ethan Perez, Miles Turpin

    Abstract: While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing models' behavior--for example, rationalizing answers in line with a user's opinion without mentioning this bias. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervis… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

  6. arXiv:2402.06782  [pdf, other

    cs.AI cs.CL

    Debating with More Persuasive LLMs Leads to More Truthful Answers

    Authors: Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez

    Abstract: Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this… ▽ More

    Submitted 30 May, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

    Comments: For code please check: https://github.com/ucl-dark/llm_debate

  7. arXiv:2401.05566  [pdf, other

    cs.CR cs.AI cs.CL cs.LG cs.SE

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

    Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More

    Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: updated to add missing acknowledgements

  8. arXiv:2311.12022  [pdf, other

    cs.AI cs.CL

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Authors: David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman

    Abstract: We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert v… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: 28 pages, 5 figures, 7 tables

  9. arXiv:2311.08702  [pdf, other

    cs.AI cs.CL

    Debate Helps Supervise Unreliable Experts

    Authors: Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman

    Abstract: As AI systems are used to answer more difficult questions and potentially help create new knowledge, judging the truthfulness of their outputs becomes more difficult and more important. How can we supervise unreliable experts, which have access to the truth but may not accurately report it, to give answers that are systematically true and don't just superficially seem true, when the supervisor can… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

    Comments: 84 pages, 13 footnotes, 5 figures, 4 tables, 28 debate transcripts; data and code at https://github.com/julianmichael/debate/tree/2023-nyu-experiments

    ACM Class: I.2.0

  10. arXiv:2310.13548  [pdf, other

    cs.CL cs.AI cs.LG stat.ML

    Towards Understanding Sycophancy in Language Models

    Authors: Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

    Abstract: Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that… ▽ More

    Submitted 27 October, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

    Comments: 32 pages, 20 figures

    ACM Class: I.2.6

  11. arXiv:2308.03296  [pdf, other

    cs.LG cs.CL stat.ML

    Studying Large Language Model Generalization with Influence Functions

    Authors: Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, Samuel R. Bowman

    Abstract: When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set?… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: 119 pages, 47 figures, 22 tables

  12. arXiv:2307.13702  [pdf, other

    cs.AI cs.CL cs.LG

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Authors: Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume , et al. (5 additional authors not shown)

    Abstract: Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change… ▽ More

    Submitted 16 July, 2023; originally announced July 2023.

  13. arXiv:2307.11768  [pdf, other

    cs.CL cs.AI cs.LG

    Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

    Authors: Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Hernandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Sam McCandlish, Sheer El Showk, Tamera Lanham, Tim Maxwell, Venkatesa Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, Ethan Perez

    Abstract: As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perfo… ▽ More

    Submitted 25 July, 2023; v1 submitted 16 July, 2023; originally announced July 2023.

    Comments: For few-shot examples and prompts, see https://github.com/anthropics/DecompositionFaithfulnessPaper

  14. arXiv:2306.09479  [pdf, other

    cs.CL cs.AI cs.CY

    Inverse Scaling: When Bigger Isn't Better

    Authors: Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zheng** Zhou, Najoung Kim , et al. (2 additional authors not shown)

    Abstract: Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling… ▽ More

    Submitted 12 May, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: Published in TMLR (2023), 39 pages

    Journal ref: Transactions on Machine Learning Research (TMLR), 10/2023, https://openreview.net/forum?id=DwgRm72GQF

  15. ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning

    Authors: **gyuan Selena She, Christopher Potts, Samuel R. Bowman, Atticus Geiger

    Abstract: A number of recent benchmarks seek to assess how well models handle natural language negation. However, these benchmarks lack the controlled example paradigms that would allow us to infer whether a model had learned how negation morphemes semantically scope. To fill these analytical gaps, we present the Scoped Negation NLI (ScoNe-NLI) benchmark, which contains contrast sets of six examples with up… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  16. arXiv:2305.14279  [pdf, other

    cs.CL

    Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

    Authors: Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho

    Abstract: Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are… ▽ More

    Submitted 2 February, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted to TMLR: https://openreview.net/forum?id=5nBqY1y96B

    Journal ref: Transactions on Machine Learning Research (2024)

  17. arXiv:2305.04388  [pdf, other

    cs.CL cs.AI

    Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

    Authors: Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman

    Abstract: Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. This level of transparency into LLMs' predictions would yield significant safety benefits. However, we find that… ▽ More

    Submitted 9 December, 2023; v1 submitted 7 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  18. arXiv:2304.00612  [pdf, other

    cs.CL cs.AI

    Eight Things to Know about Large Language Models

    Authors: Samuel R. Bowman

    Abstract: The widespread public deployment of large language models (LLMs) in recent months has prompted a wave of new attention and engagement from advocates, policymakers, and scholars from many fields. This attention is a timely response to the many urgent questions that this technology raises, but it can sometimes miss important considerations. This paper surveys the evidence for eight potentially surpr… ▽ More

    Submitted 2 April, 2023; originally announced April 2023.

  19. arXiv:2303.16749  [pdf, other

    cs.SE cs.AI cs.CL cs.LG

    Improving Code Generation by Training with Natural Language Feedback

    Authors: Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, Ethan Perez

    Abstract: The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedbac… ▽ More

    Submitted 22 February, 2024; v1 submitted 28 March, 2023; originally announced March 2023.

    Comments: Published in (and superceded by) TMLR: https://openreview.net/forum?id=xo3hI5MwvU

  20. arXiv:2302.08582  [pdf, other

    cs.CL cs.LG

    Pretraining Language Models with Human Preferences

    Authors: Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez

    Abstract: Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark… ▽ More

    Submitted 14 June, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: ICML 2023

  21. arXiv:2302.07459  [pdf, other

    cs.CL

    The Capacity for Moral Self-Correction in Large Language Models

    Authors: Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamilė Lukošiūtė, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noemi Mercado, Nova DasSarma , et al. (24 additional authors not shown)

    Abstract: We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability… ▽ More

    Submitted 18 February, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

  22. arXiv:2212.10003  [pdf, other

    cs.CL

    (QA)$^2$: Question Answering with Questionable Assumptions

    Authors: Najoung Kim, Phu Mon Htut, Samuel R. Bowman, Jackson Petty

    Abstract: Naturally occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable. Questions containing questionable assumptions are challenging because they require a distinct answer strategy that deviates from typical answers for information-seeking questions. For instance, the question "When did Marie Curie discover Uranium?" cannot be answer… ▽ More

    Submitted 29 August, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023 camera-ready

  23. arXiv:2212.09251  [pdf, other

    cs.CL cs.AI cs.LG

    Discovering Language Model Behaviors with Model-Written Evaluations

    Authors: Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion , et al. (38 additional authors not shown)

    Abstract: As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from inst… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

    Comments: for associated data visualizations, see https://www.evals.anthropic.com/model-written/ for full datasets, see https://github.com/anthropics/evals

  24. arXiv:2212.08073  [pdf, other

    cs.CL cs.AI

    Constitutional AI: Harmlessness from AI Feedback

    Authors: Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite , et al. (26 additional authors not shown)

    Abstract: As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supe… ▽ More

    Submitted 15 December, 2022; originally announced December 2022.

  25. arXiv:2211.03540  [pdf, other

    cs.HC cs.AI cs.CL

    Measuring Progress on Scalable Oversight for Large Language Models

    Authors: Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse , et al. (21 additional authors not shown)

    Abstract: Develo** safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think abou… ▽ More

    Submitted 11 November, 2022; v1 submitted 4 November, 2022; originally announced November 2022.

    Comments: v2 fixes a few typos from v1

  26. arXiv:2210.10860  [pdf, other

    cs.CL

    Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions

    Authors: Alicia Parrish, Harsh Trivedi, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Amanpreet Singh Saimbhi, Samuel R. Bowman

    Abstract: The use of language-model-based question-answering systems to aid humans in completing difficult tasks is limited, in part, by the unreliability of the text these systems generate. Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer options, where one is correct and the other is incorrect, allows human… ▽ More

    Submitted 19 October, 2022; originally announced October 2022.

    Comments: 12 pages, 6 figures, 7 tables

  27. arXiv:2208.12852  [pdf, other

    cs.CL cs.AI

    What Do NLP Researchers Believe? Results of the NLP Community Metasurvey

    Authors: Julian Michael, Ari Holtzman, Alicia Parrish, Aaron Mueller, Alex Wang, Angelica Chen, Divyam Madaan, Nikita Nangia, Richard Yuanzhe Pang, Jason Phang, Samuel R. Bowman

    Abstract: We present the results of the NLP Community Metasurvey. Run from May to June 2022, the survey elicited opinions on controversial issues, including industry influence in the field, concerns about AGI, and ethics. Our results put concrete numbers to several controversies: For example, respondents are split almost exactly in half on questions about the importance of artificial general intelligence, w… ▽ More

    Submitted 26 August, 2022; originally announced August 2022.

    Comments: 31 pages, 19 figures, 3 tables; more information at https://nlpsurvey.net

    ACM Class: I.2.7

  28. arXiv:2208.07998  [pdf, other

    cs.CL

    What Artificial Neural Networks Can Tell Us About Human Language Acquisition

    Authors: Alex Warstadt, Samuel R. Bowman

    Abstract: Rapid progress in machine learning for natural language processing has the potential to transform debates about how humans learn language. However, the learning environments and biases of current artificial learners and humans diverge in ways that weaken the impact of the evidence obtained from learning simulations. For example, today's most effective neural language models are trained on roughly… ▽ More

    Submitted 11 February, 2024; v1 submitted 16 August, 2022; originally announced August 2022.

    Comments: Please cite the published version with the following information: @incollection{warstadt2022artificial, title={What artificial neural networks can tell us about human language acquisition}, author={Warstadt, Alex and Bowman, Samuel R.}, booktitle={Algebraic Structures in Natural Language}, pages={17--60}, year={2022}, publisher={CRC Press} }

  29. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  30. arXiv:2205.11465  [pdf, ps, other

    cs.CL cs.AI

    SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

    Authors: Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, Samuel R. Bowman

    Abstract: Summarization datasets are often assembled either by scra** naturally occurring public-domain summaries -- which are nearly always in difficult-to-work-with technical domains -- or by using approximate heuristics to extract them from everyday text -- which frequently yields unfaithful summaries. In this work, we turn to a slower but more straightforward approach to develo** summarization bench… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

  31. arXiv:2205.10782  [pdf, other

    cs.CL

    Instruction Induction: From Few Examples to Natural Language Task Descriptions

    Authors: Or Honovich, Uri Shaham, Samuel R. Bowman, Omer Levy

    Abstract: Large language models are able to perform a task by conditioning on a few input-output demonstrations - a paradigm known as in-context learning. We show that language models can explicitly infer an underlying task from a few demonstrations by prompting them to generate a natural language instruction that fits the examples. To explore this ability, we introduce the instruction induction challenge,… ▽ More

    Submitted 22 May, 2022; originally announced May 2022.

  32. arXiv:2204.05212  [pdf, other

    cs.CL

    Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions

    Authors: Alicia Parrish, Harsh Trivedi, Ethan Perez, Angelica Chen, Nikita Nangia, Jason Phang, Samuel R. Bowman

    Abstract: Current QA systems can generate reasonable-sounding yet false answers without explanation or evidence for the generated answer, which is especially problematic when humans cannot readily check the model's answers. This presents a challenge for building trust in machine learning systems. We take inspiration from real-world situations where difficult questions are answered by considering opposing si… ▽ More

    Submitted 13 April, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Accepted to the 2022 ACL Workshop on Learning with Natural Language Supervision. 12 pages total, 9 figures, 2 tables

  33. arXiv:2203.06342  [pdf, other

    cs.CL cs.AI

    What Makes Reading Comprehension Questions Difficult?

    Authors: Saku Sugawara, Nikita Nangia, Alex Warstadt, Samuel R. Bowman

    Abstract: For a natural language understanding benchmark to be useful in research, it has to consist of examples that are diverse and difficult enough to discriminate among current and near-future state-of-the-art systems. However, we do not yet know how best to select text sources to collect a variety of challenging examples. In this study, we crowdsource multiple-choice reading comprehension questions for… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

    Comments: ACL 2022

  34. arXiv:2112.08608  [pdf, other

    cs.CL

    QuALITY: Question Answering with Long Input Texts, Yes!

    Authors: Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, Samuel R. Bowman

    Abstract: To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than rely… ▽ More

    Submitted 11 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

    Comments: NAACL 2022

  35. arXiv:2111.08181  [pdf, other

    cs.CL

    Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

    Authors: Jason Phang, Angelica Chen, William Huang, Samuel R. Bowman

    Abstract: More capable language models increasingly saturate existing task benchmarks, in some cases outperforming humans. This has left little headroom with which to measure further progress. Adversarial dataset creation has been proposed as a strategy to construct more challenging datasets, and two common approaches are: (1) filtering out easy examples and (2) model-in-the-loop data collection. In this wo… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

  36. arXiv:2110.08355  [pdf, other

    cs.CL

    Clean or Annotate: How to Spend a Limited Data Collection Budget

    Authors: Derek Chen, Zhou Yu, Samuel R. Bowman

    Abstract: Crowdsourcing platforms are often used to collect datasets for training machine learning models, despite higher levels of inaccurate labeling compared to expert labeling. There are two common strategies to manage the impact of such noise. The first involves aggregating redundant annotations, but comes at the expense of labeling substantially fewer examples. Secondly, prior works have also consider… ▽ More

    Submitted 12 June, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: 17 pages, 3 figures, 6 tables. Accepted to NAACL 2022 workshop

  37. arXiv:2110.08300  [pdf, other

    cs.CL cs.AI

    The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail

    Authors: Samuel R. Bowman

    Abstract: Researchers in NLP often frame and discuss research results in ways that serve to deemphasize the field's successes, often in response to the field's widespread hype. Though well-meaning, this has yielded many misleading or false claims about the limits of our best technology. This is a problem, and it may be more serious than it looks: It harms our credibility in ways that can make it harder to m… ▽ More

    Submitted 10 March, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: Proceedings of ACL 2022

  38. arXiv:2110.08193  [pdf, other

    cs.CL

    BBQ: A Hand-Built Bias Benchmark for Question Answering

    Authors: Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, Samuel R. Bowman

    Abstract: It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions rele… ▽ More

    Submitted 15 March, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: Accepted to ACL 2022 Findings. 20 pages, 10 figures

  39. arXiv:2109.08406  [pdf, other

    cs.CL

    Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

    Authors: Jason Phang, Haokun Liu, Samuel R. Bowman

    Abstract: Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning. In this work, we use centered kernel alignment (CKA), a method for comparing learned representations, to measure the similarity of representations in task-tuned models across layers. In exper… ▽ More

    Submitted 20 September, 2021; v1 submitted 17 September, 2021; originally announced September 2021.

    Comments: BlackboxNLP 2021

  40. arXiv:2109.06987  [pdf, other

    cs.CL

    NOPE: A Corpus of Naturally-Occurring Presuppositions in English

    Authors: Alicia Parrish, Sebastian Schuster, Alex Warstadt, Omar Agha, Soo-Hwan Lee, Zhuoye Zhao, Samuel R. Bowman, Tal Linzen

    Abstract: Understanding language requires gras** not only the overtly stated content, but also making inferences about things that were left unsaid. These inferences include presuppositions, a phenomenon by which a listener learns about new information through reasoning about what a speaker takes as given. Presuppositions require complex understanding of the lexical and syntactic properties that trigger t… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Comments: CoNLL 2021. Data and code available at https://github.com/nyu-mll/nope

  41. arXiv:2106.00840  [pdf, other

    cs.CL

    Comparing Test Sets with Item Response Theory

    Authors: Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, Samuel R. Bowman

    Abstract: Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: ACL 2021

  42. arXiv:2106.00794  [pdf, other

    cs.CL cs.AI cs.HC

    What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

    Authors: Nikita Nangia, Saku Sugawara, Harsh Trivedi, Alex Warstadt, Clara Vania, Samuel R. Bowman

    Abstract: Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: ACL 2021

  43. arXiv:2104.07179  [pdf, other

    cs.CL

    Does Putting a Linguist in the Loop Improve NLU Data Collection?

    Authors: Alicia Parrish, William Huang, Omar Agha, Soo-Hwan Lee, Nikita Nangia, Alex Warstadt, Karmanya Aggarwal, Emily Allaway, Tal Linzen, Samuel R. Bowman

    Abstract: Many crowdsourced NLP datasets contain systematic gaps and biases that are identified only after data collection is complete. Identifying these issues from early data samples during crowdsourcing should make mitigation more efficient, especially when done iteratively. We take natural language inference as a test case and ask whether it is beneficial to put a linguist `in the loop' during data coll… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

    Comments: 14 pages, 10 figures

  44. arXiv:2104.02145  [pdf, other

    cs.CL

    What Will it Take to Fix Benchmarking in Natural Language Understanding?

    Authors: Samuel R. Bowman, George E. Dahl

    Abstract: Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform… ▽ More

    Submitted 15 October, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Proceedings of NAACL 2020. This revision adds a missing acknowledgment

  45. arXiv:2011.04946  [pdf, other

    cs.CL

    When Do You Need Billions of Words of Pretraining Data?

    Authors: Yian Zhang, Alex Warstadt, Haau-Sing Li, Samuel R. Bowman

    Abstract: NLP is currently dominated by general-purpose pretrained language models like RoBERTa, which achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? We adopt four probing methods---classifier probing, information-theoretic probing, unsupervised r… ▽ More

    Submitted 10 November, 2020; originally announced November 2020.

    Comments: 10 pages, 6 figures

  46. arXiv:2010.06122  [pdf, other

    cs.CL

    Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options

    Authors: Clara Vania, Ruijie Chen, Samuel R. Bowman

    Abstract: Large-scale natural language inference (NLI) datasets such as SNLI or MNLI have been created by asking crowdworkers to read a premise and write three new hypotheses, one for each possible semantic relationships (entailment, contradiction, and neutral). While this protocol has been used to create useful benchmark data, it remains unclear whether the writing-based annotation protocol is optimal for… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

    Comments: AACL 2020

  47. arXiv:2010.05358  [pdf, other

    cs.CL

    Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

    Authors: Alex Warstadt, Yian Zhang, Haau-Sing Li, Haokun Liu, Samuel R. Bowman

    Abstract: One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mi… ▽ More

    Submitted 11 October, 2020; originally announced October 2020.

    Comments: accepted at EMNLP 2020

  48. arXiv:2010.04762  [pdf, other

    cs.CL

    Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data

    Authors: William Huang, Haokun Liu, Samuel R. Bowman

    Abstract: A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks---datasets collected from crowdworkers to create an evaluation task---while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data---data built by minimally editing a set of seed exa… ▽ More

    Submitted 9 October, 2020; originally announced October 2020.

  49. arXiv:2010.04043  [pdf, other

    cs.CL

    Precise Task Formalization Matters in Winograd Schema Evaluations

    Authors: Haokun Liu, William Huang, Dhara A. Mungra, Samuel R. Bowman

    Abstract: Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize that much of this improvement comes from recent changes in task formalization---the combination o… ▽ More

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: Accepted to the EMNLP 2020 conference

  50. arXiv:2010.00133  [pdf, other

    cs.CL cs.AI

    CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

    Authors: Nikita Nangia, Clara Vania, Rasika Bhalerao, Samuel R. Bowman

    Abstract: Pretrained language models, especially masked language models (MLMs) have seen success across many NLP tasks. However, there is ample evidence that they use the cultural biases that are undoubtedly present in the corpora they are trained on, implicitly creating harm with biased representations. To measure some forms of social bias in language models against protected demographic groups in the US,… ▽ More

    Submitted 30 September, 2020; originally announced October 2020.

    Comments: EMNLP 2020