Search | arXiv e-print repository

Spontaneous Reward Hacking in Iterative Self-Refinement

Authors: Jane Pan, He He, Samuel R. Bowman, Shi Feng

Abstract: Language models are capable of iteratively improving their outputs based on natural language feedback, thus enabling in-context optimization of user preference. In place of human users, a second language model can be used as an evaluator, providing feedback along with numerical ratings which the generator attempts to optimize. However, because the evaluator is an imperfect proxy of user preference… ▽ More Language models are capable of iteratively improving their outputs based on natural language feedback, thus enabling in-context optimization of user preference. In place of human users, a second language model can be used as an evaluator, providing feedback along with numerical ratings which the generator attempts to optimize. However, because the evaluator is an imperfect proxy of user preference, this optimization can lead to reward hacking, where the evaluator's ratings improve while the generation quality remains stagnant or even decreases as judged by actual user preference. The concern of reward hacking is heightened in iterative self-refinement where the generator and the evaluator use the same underlying language model, in which case the optimization pressure can drive them to exploit shared vulnerabilities. Using an essay editing task, we show that iterative self-refinement leads to deviation between the language model evaluator and human judgment, demonstrating that reward hacking can occur spontaneously in-context with the use of iterative self-refinement. In addition, we study conditions under which reward hacking occurs and observe two factors that affect reward hacking severity: model size and context sharing between the generator and the evaluator. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2406.15518 [pdf, other]

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Authors: Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman

Abstract: Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such m… ▽ More Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then steering the model that has undergone this training. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model while maintaining helpfulness (as measured by MT-Bench) on benign requests almost on par with the original LM. To demonstrate the generality and transferability of our method beyond jailbreaks, we show that our KTS model can be steered to reduce bias towards user-suggested answers on TruthfulQA. Code is available: https://github.com/AsaCooperStickland/kl-then-steer. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.10162 [pdf, other]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Authors: Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger

Abstract: In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be to… ▽ More In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove. △ Less

Submitted 28 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: Make it easier to find samples from the model, and highlight that our operational definition of reward tampering has false positives where the model attempts to complete the task honestly but edits the reward. Add paragraph to conclusion to this effect, and add sentence to figure 1 to this effect

arXiv:2404.15758 [pdf, other]

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

Authors: Jacob Pfau, William Merrill, Samuel R. Bowman

Abstract: Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard… ▽ More Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: 17 pages, 10 figures

ACM Class: I.2.6

arXiv:2404.13076 [pdf, other]

LLM Evaluators Recognize and Favor Their Own Generations

Authors: Arjun Panickssery, Samuel R. Bowman, Shi Feng

Abstract: Self-evaluation using large language models (LLMs) has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators cons… ▽ More Self-evaluation using large language models (LLMs) has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators consider them of equal quality. But do LLMs actually recognize their own outputs when they give those texts higher scores, or is it just a coincidence? In this paper, we investigate if self-recognition capability contributes to self-preference. We discover that, out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. By fine-tuning LLMs, we discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, we show that the causal explanation resists straightforward confounders. We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2403.05518 [pdf, other]

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Authors: James Chua, Edward Rees, Hunar Batra, Samuel R. Bowman, Julian Michael, Ethan Perez, Miles Turpin

Abstract: While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing models' behavior--for example, rationalizing answers in line with a user's opinion without mentioning this bias. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervis… ▽ More While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing models' behavior--for example, rationalizing answers in line with a user's opinion without mentioning this bias. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where supervision for ground truth reasoning is unavailable. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2402.06782 [pdf, other]

Debating with More Persuasive LLMs Leads to More Truthful Answers

Authors: Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez

Abstract: Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this… ▽ More Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer. We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth. △ Less

Submitted 30 May, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

Comments: For code please check: https://github.com/ucl-dark/llm_debate

arXiv:2401.05566 [pdf, other]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety. △ Less

Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: updated to add missing acknowledgements

arXiv:2311.12022 [pdf, other]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Authors: David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman

Abstract: We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert v… ▽ More We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when develo** new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: 28 pages, 5 figures, 7 tables

arXiv:2311.08702 [pdf, other]

Debate Helps Supervise Unreliable Experts

Authors: Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman

Abstract: As AI systems are used to answer more difficult questions and potentially help create new knowledge, judging the truthfulness of their outputs becomes more difficult and more important. How can we supervise unreliable experts, which have access to the truth but may not accurately report it, to give answers that are systematically true and don't just superficially seem true, when the supervisor can… ▽ More As AI systems are used to answer more difficult questions and potentially help create new knowledge, judging the truthfulness of their outputs becomes more difficult and more important. How can we supervise unreliable experts, which have access to the truth but may not accurately report it, to give answers that are systematically true and don't just superficially seem true, when the supervisor can't tell the difference between the two on their own? In this work, we show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth. We collect a dataset of human-written debates on hard reading comprehension questions where the judge has not read the source passage, only ever seeing expert arguments and short quotes selectively revealed by 'expert' debaters who have access to the passage. In our debates, one expert argues for the correct answer, and the other for an incorrect answer. Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better, with 84% judge accuracy compared to consultancy's 74%. Debates are also more efficient, being 68% of the length of consultancies. By comparing human to AI debaters, we find evidence that with more skilled (in this case, human) debaters, the performance of debate goes up but the performance of consultancy goes down. Our error analysis also supports this trend, with 46% of errors in human debate attributable to mistakes by the honest debater (which should go away with increased skill); whereas 52% of errors in human consultancy are due to debaters obfuscating the relevant evidence from the judge (which should become worse with increased skill). Overall, these results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems. △ Less

Submitted 15 November, 2023; originally announced November 2023.

Comments: 84 pages, 13 footnotes, 5 figures, 4 tables, 28 debate transcripts; data and code at https://github.com/julianmichael/debate/tree/2023-nyu-experiments

ACM Class: I.2.0

arXiv:2311.08200 [pdf, other]

Using Old Laboratory Equipment with Modern Web-of-Things Standards: a Smart Laboratory with LabThings Retro

Authors: Samuel McDermott, Jurij Kotar, Joel Collins, Leonardo Mancini, Richard Bowman, Pietro Cicuta

Abstract: There has been an increasing, and welcome, Open Hardware trend towards science teams building and sharing their designs for new instruments. These devices, often built upon low-cost microprocessors and micro-controllers, can be readily connected to enable complex, automated, and smart experiments. When designed to use open communication web standards, devices from different laboratories and manufa… ▽ More There has been an increasing, and welcome, Open Hardware trend towards science teams building and sharing their designs for new instruments. These devices, often built upon low-cost microprocessors and micro-controllers, can be readily connected to enable complex, automated, and smart experiments. When designed to use open communication web standards, devices from different laboratories and manufacturers can be controlled using a single protocol, and even communicate with each other. However, science labs still have a majority of old, perfectly functional, equipment which tends to use older, and sometimes proprietary, standards for communications. In order to encourage the continued and integrated use of this equipment in modern automated experiments, we develop and demonstrate LabThings Retro. This allows us to retrofit old instruments to use modern web-of-things standards, which we demonstrate with closed-loop feedback involving an optical microscope, digital imaging and fluid pum**. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: Supplementary material - video demonstration available at https://zenodo.org/doi/10.5281/zenodo.10123735

arXiv:2310.13548 [pdf, other]

Towards Understanding Sycophancy in Language Models

Authors: Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

Abstract: Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that… ▽ More Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses. △ Less

Submitted 27 October, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

Comments: 32 pages, 20 figures

ACM Class: I.2.6

arXiv:2310.06496 [pdf]

Spatially resolved photoluminescence analysis of Se passivation and defect formation in CdSe$_{x}$Te$_{1-x}$ thin films

Authors: Alan R Bowman, Jacob J Leaver, Kyle Frohna, Samuel D Stranks, Giulia Tagliabue, Jon D Major

Abstract: CdTe is the most commercially successful thin-film photovoltaic technology to date. The recent development of Se-alloyed CdSe$_{x}$Te$_{1-x}$ layers in CdTe solar cells has led to higher device efficiencies, due to a lowered bandgap improving the photocurrent, improved voltage characteristics and longer carrier lifetimes. Evidence from cross-sectional electron microscopy is widely believed to indi… ▽ More CdTe is the most commercially successful thin-film photovoltaic technology to date. The recent development of Se-alloyed CdSe$_{x}$Te$_{1-x}$ layers in CdTe solar cells has led to higher device efficiencies, due to a lowered bandgap improving the photocurrent, improved voltage characteristics and longer carrier lifetimes. Evidence from cross-sectional electron microscopy is widely believed to indicate that Se passivates defects in CdSe$_{x}$Te$_{1-x}$ solar cells, and that this is the reason for better lifetimes and voltages in these devices. Here, we utilise spatially resolved photoluminescence measurements of CdSe$_{x}$Te$_{1-x}$ thin films on glass to study the effects of Se on carrier recombination in the material, isolated from the impact of conductive interfaces and without the need to prepare cross-sections through the samples. We find further evidence to support Se passivation of grain boundaries, but also identify an associated increase in below-bandgap photoluminescence that indicates the presence of Se-enhanced luminescent defects. Our results show that Se treatment, in tandem with Cl passivation, does increase radiative efficiencies. However, the simultaneous enhancement of defects within the grain interiors suggests that although it is overall beneficial, Se incorporation may still ultimately limit the maximum attainable efficiency of CdSe$_{x}$Te$_{1-x}$ solar cells. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2308.03296 [pdf, other]

Studying Large Language Model Generalization with Influence Functions

Authors: Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, Samuel R. Bowman

Abstract: When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set?… ▽ More When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set? While influence functions have produced insights for small models, they are difficult to scale to large language models (LLMs) due to the difficulty of computing an inverse-Hessian-vector product (IHVP). We use the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation to scale influence functions up to LLMs with up to 52 billion parameters. In our experiments, EK-FAC achieves similar accuracy to traditional influence function estimators despite the IHVP computation being orders of magnitude faster. We investigate two algorithmic techniques to reduce the cost of computing gradients of candidate training sequences: TF-IDF filtering and query batching. We use influence functions to investigate the generalization patterns of LLMs, including the sparsity of the influence patterns, increasing abstraction with scale, math and programming abilities, cross-lingual generalization, and role-playing behavior. Despite many apparently sophisticated forms of generalization, we identify a surprising limitation: influences decay to near-zero when the order of key phrases is flipped. Overall, influence functions give us a powerful new tool for studying the generalization properties of LLMs. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: 119 pages, 47 figures, 22 tables

arXiv:2307.13702 [pdf, other]

Measuring Faithfulness in Chain-of-Thought Reasoning

Authors: Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume , et al. (5 additional authors not shown)

Abstract: Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change… ▽ More Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen. △ Less

Submitted 16 July, 2023; originally announced July 2023.

arXiv:2307.11768 [pdf, other]

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Authors: Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Hernandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Sam McCandlish, Sheer El Showk, Tamera Lanham, Tim Maxwell, Venkatesa Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, Ethan Perez

Abstract: As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perfo… ▽ More As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior. △ Less

Submitted 25 July, 2023; v1 submitted 16 July, 2023; originally announced July 2023.

Comments: For few-shot examples and prompts, see https://github.com/anthropics/DecompositionFaithfulnessPaper

arXiv:2307.09324 [pdf]

doi 10.1021/acsenergylett.3c01505

Interfacial Hot Carrier Collection Controls Plasmonic Chemistry

Authors: Fatemeh Kiani, Alan R. Bowman, Milad Sabzehparvar, Can O. Karaman, Ravishankar Sundararaman, Giulia Tagliabue

Abstract: Harnessing non-equilibrium hot carriers from plasmonic metal nanostructures constitutes a vibrant research field. It promises to enable control of activity and selectivity of photochemical reactions, especially for solar fuel generation. However, a comprehensive understanding of the interplay of plasmonic hot carrier-driven processes in metal/semiconducting heterostructures has remained elusive. I… ▽ More Harnessing non-equilibrium hot carriers from plasmonic metal nanostructures constitutes a vibrant research field. It promises to enable control of activity and selectivity of photochemical reactions, especially for solar fuel generation. However, a comprehensive understanding of the interplay of plasmonic hot carrier-driven processes in metal/semiconducting heterostructures has remained elusive. In this work, we reveal the complex interdependence between plasmon excitation, hot carrier generation, transport and interfacial collection in plasmonic photocatalytic devices, uniquely determining the charge injection efficiencies at the solid/solid and solid/liquid interfaces. Interestingly, by measuring the internal quantum efficiency of ultrathin (14 to 33 nm) single-crystalline plasmonic gold (Au) nanoantenna arrays on titanium dioxide substrates, we find that the performance of the device is governed by hot hole collection at the metal/electrolyte interface. In particular, by combining a solid- and liquid-state experimental approach with ab initio simulations, we show a more efficient collection of high-energy d-band holes traveling in [111] orientation, resulting in a stronger oxidation reaction at the {111} surfaces of the nanoantenna. These results thus establish new guidelines for the design and optimization of plasmonic photocatalytic systems and optoelectronic devices. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Journal ref: ACS Energy Lett. 2023, 8, 10, 4242-4250

arXiv:2307.08477 [pdf]

doi 10.1038/s41377-024-01408-2

Quantum-mechanical effects in photoluminescence from thin crystalline gold films

Authors: Alan R. Bowman, Álvaro Rodríguez Echarri, Fatemeh Kiani, Fadil Iyikanat, Ted V. Tsoulos, Joel D. Cox, Ravishankar Sundararaman, F. Javier García de Abajo, Giulia Tagliabue

Abstract: Luminescence constitutes a unique source of insight into hot carrier processes in metals, including those in plasmonic nanostructures used for sensing and energy applications. However, being weak in nature, metal luminescence remains poorly understood, its microscopic origin strongly debated, and its potential for unravelling nanoscale carrier dynamics largely unexploited. Here, we reveal quantum-… ▽ More Luminescence constitutes a unique source of insight into hot carrier processes in metals, including those in plasmonic nanostructures used for sensing and energy applications. However, being weak in nature, metal luminescence remains poorly understood, its microscopic origin strongly debated, and its potential for unravelling nanoscale carrier dynamics largely unexploited. Here, we reveal quantum-mechanical effects emanating in the luminescence from thin monocrystalline gold flakes. Specifically, we present experimental evidence, supported by first-principles simulations, to demonstrate its photoluminescence origin when exciting in the interband regime. Our model allows us to identify changes to the measured gold luminescence due to quantum-mechanical effects as the gold film thickness is reduced. Excitingly, such effects are observable in the luminescence signal from flakes up to 40 nm in thickness, associated with the out-of-plane discreteness of the electronic band structure near the Fermi level. We qualitatively reproduce the observations with first-principles modelling, thus establishing a unified description of luminescence in gold and enabling its widespread application as a probe of carrier dynamics and light-matter interactions in this material. Our study paves the way for future explorations of hot-carriers and charge-transfer dynamics in a multitude of material systems. △ Less

Submitted 25 September, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: Main text 21 pages and 4 figures. Supplemental Information 33 pages and 17 figures

Journal ref: Light. Sci. Appl. 13, 91 (2024)

arXiv:2306.09479 [pdf, other]

Inverse Scaling: When Bigger Isn't Better

Authors: Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zheng** Zhou, Najoung Kim , et al. (2 additional authors not shown)

Abstract: Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling… ▽ More Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at https://inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models. △ Less

Submitted 12 May, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: Published in TMLR (2023), 39 pages

Journal ref: Transactions on Machine Learning Research (TMLR), 10/2023, https://openreview.net/forum?id=DwgRm72GQF

arXiv:2305.19426 [pdf, other]

doi 10.18653/v1/2023.acl-short.154

ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning

Authors: **gyuan Selena She, Christopher Potts, Samuel R. Bowman, Atticus Geiger

Abstract: A number of recent benchmarks seek to assess how well models handle natural language negation. However, these benchmarks lack the controlled example paradigms that would allow us to infer whether a model had learned how negation morphemes semantically scope. To fill these analytical gaps, we present the Scoped Negation NLI (ScoNe-NLI) benchmark, which contains contrast sets of six examples with up… ▽ More A number of recent benchmarks seek to assess how well models handle natural language negation. However, these benchmarks lack the controlled example paradigms that would allow us to infer whether a model had learned how negation morphemes semantically scope. To fill these analytical gaps, we present the Scoped Negation NLI (ScoNe-NLI) benchmark, which contains contrast sets of six examples with up to two negations where either zero, one, or both negative morphemes affect the NLI label. We use ScoNe-NLI to assess fine-tuning and in-context learning strategies. We find that RoBERTa and DeBERTa models solve ScoNe-NLI after many shot fine-tuning. For in-context learning, we test InstructGPT models and find that most prompt strategies are not successful, including those using step-by-step reasoning. To better understand this result, we extend ScoNe with ScoNe-NLG, a sentence completion test set that embeds negation reasoning in short narratives. Here, InstructGPT is successful, which reveals the model can correctly reason about negation, but struggles to do so on prompt-adapted NLI examples outside of its core pretraining regime. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2305.14279 [pdf, other]

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Authors: Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho

Abstract: Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are… ▽ More Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks. △ Less

Submitted 2 February, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: Accepted to TMLR: https://openreview.net/forum?id=5nBqY1y96B

Journal ref: Transactions on Machine Learning Research (2024)

arXiv:2305.04388 [pdf, other]

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Authors: Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman

Abstract: Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. This level of transparency into LLMs' predictions would yield significant safety benefits. However, we find that… ▽ More Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. This level of transparency into LLMs' predictions would yield significant safety benefits. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs--e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)"--which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations rationalizing those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods. △ Less

Submitted 9 December, 2023; v1 submitted 7 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023

arXiv:2304.00612 [pdf, other]

Eight Things to Know about Large Language Models

Authors: Samuel R. Bowman

Abstract: The widespread public deployment of large language models (LLMs) in recent months has prompted a wave of new attention and engagement from advocates, policymakers, and scholars from many fields. This attention is a timely response to the many urgent questions that this technology raises, but it can sometimes miss important considerations. This paper surveys the evidence for eight potentially surpr… ▽ More The widespread public deployment of large language models (LLMs) in recent months has prompted a wave of new attention and engagement from advocates, policymakers, and scholars from many fields. This attention is a timely response to the many urgent questions that this technology raises, but it can sometimes miss important considerations. This paper surveys the evidence for eight potentially surprising such points: 1. LLMs predictably get more capable with increasing investment, even without targeted innovation. 2. Many important LLM behaviors emerge unpredictably as a byproduct of increasing investment. 3. LLMs often appear to learn and use representations of the outside world. 4. There are no reliable techniques for steering the behavior of LLMs. 5. Experts are not yet able to interpret the inner workings of LLMs. 6. Human performance on a task isn't an upper bound on LLM performance. 7. LLMs need not express the values of their creators nor the values encoded in web text. 8. Brief interactions with LLMs are often misleading. △ Less

Submitted 2 April, 2023; originally announced April 2023.

arXiv:2303.16749 [pdf, other]

Improving Code Generation by Training with Natural Language Feedback

Authors: Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, Ethan Perez

Abstract: The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedbac… ▽ More The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We further show that ILF can be seen as a form of minimizing the KL divergence to the ground truth distribution and demonstrate a proof-of-concept on a neural program synthesis task. We use ILF to improve a Codegen-Mono 6.1B model's pass@1 rate by 38% relative (and 10% absolute) on the Mostly Basic Python Problems (MBPP) benchmark, outperforming both fine-tuning on MBPP and fine-tuning on repaired programs written by humans. Overall, our results suggest that learning from human-written natural language feedback is both more effective and sample-efficient than training exclusively on demonstrations for improving an LLM's performance on code generation tasks. △ Less

Submitted 22 February, 2024; v1 submitted 28 March, 2023; originally announced March 2023.

Comments: Published in (and superceded by) TMLR: https://openreview.net/forum?id=xo3hI5MwvU

arXiv:2303.08993 [pdf]

doi 10.1016/j.bpj.2023.03.028

Folding@home: achievements from over twenty years of citizen science herald the exascale era

Authors: Vincent A. Voelz, Vijay S. Pande, Gregory R. Bowman

Abstract: Simulations of biomolecules have enormous potential to inform our understanding of biology but require extremely demanding calculations. For over twenty years, the Folding@home distributed computing project has pioneered a massively parallel approach to biomolecular simulation, harnessing the resources of citizen scientists across the globe. Here, we summarize the scientific and technical advances… ▽ More Simulations of biomolecules have enormous potential to inform our understanding of biology but require extremely demanding calculations. For over twenty years, the Folding@home distributed computing project has pioneered a massively parallel approach to biomolecular simulation, harnessing the resources of citizen scientists across the globe. Here, we summarize the scientific and technical advances this perspective has enabled. As the project's name implies, the early years of Folding@home focused on driving advances in our understanding of protein folding by develo** statistical methods for capturing long-timescale processes and facilitating insight into complex dynamical processes. Success laid a foundation for broadening the scope of Folding@home to address other functionally relevant conformational changes, such as receptor signaling, enzyme dynamics, and ligand binding. Continued algorithmic advances, hardware developments such as GPU-based computing, and the growing scale of Folding@home have enabled the project to focus on new areas where massively parallel sampling can be impactful. While previous work sought to expand toward larger proteins with slower conformational changes, new work focuses on large-scale comparative studies of different protein sequences and chemical compounds to better understand biology and inform the development of small molecule drugs. Progress on these fronts enabled the community to pivot quickly in response to the COVID-19 pandemic, expanding to become the world's first exascale computer and deploying this massive resource to provide insight into the inner workings of the SARS-CoV-2 virus and aid the development of new antivirals. This success provides a glimpse of what's to come as exascale supercomputers come online, and Folding@home continues its work. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: 24 pages, 6 figures

arXiv:2302.13806 [pdf, other]

doi 10.1146/annurev-physchem-101422-030127

Remembering the work of Phillip L. Geissler: A coda to his scientific trajectory

Authors: Gregory R. Bowman, Stephen J. Cox, Christoph Dellago, Kateri H. DuBay, Joel D. Eaves, Daniel A. Fletcher, Layne B. Frechette, Michael Grünwald, Katherine Klymko, JiYeon Ku, Ahmad K. Omar, Eran Rabani, David R. Reichman, Julia R. Rogers, Andreana M. Rosnik, Grant M. Rotskoff, Anna R. Schneider, Nadine Schwierz, David A. Sivak, Suriyanarayanan Vaikuntanathan, Stephen Whitelam, Asaph Widmer-Cooper

Abstract: Phillip L. Geissler made important contributions to the statistical mechanics of biological polymers, heterogeneous materials, and chemical dynamics in aqueous environments. He devised analytical and computational methods that revealed the underlying organization of complex systems at the frontiers of biology, chemistry, and materials science. In this retrospective, we celebrate his work at these… ▽ More Phillip L. Geissler made important contributions to the statistical mechanics of biological polymers, heterogeneous materials, and chemical dynamics in aqueous environments. He devised analytical and computational methods that revealed the underlying organization of complex systems at the frontiers of biology, chemistry, and materials science. In this retrospective, we celebrate his work at these frontiers. △ Less

Submitted 24 February, 2023; originally announced February 2023.

Journal ref: Ann. Rev. Phys. Chem. 74, 11.1-11.27 (2023)

arXiv:2302.08582 [pdf, other]

Pretraining Language Models with Human Preferences

Authors: Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez

Abstract: Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark… ▽ More Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training. △ Less

Submitted 14 June, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: ICML 2023

arXiv:2302.07459 [pdf, other]

The Capacity for Moral Self-Correction in Large Language Models

Authors: Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamilė Lukošiūtė, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noemi Mercado, Nova DasSarma , et al. (24 additional authors not shown)

Abstract: We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability… ▽ More We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereoty**, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles. △ Less

Submitted 18 February, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

arXiv:2212.13273 [pdf]

doi 10.1021/acsami.2c21984

Discontinuous metric programming in liquid crystalline elastomers

Authors: Tayler S. Hebner, Riley G. A. Bowman, Daniel Duffy, Cyrus Mostajeran, Itay Griniasty, Itai Cohen, Mark Warner, Christopher N. Bowman, Timothy J. White

Abstract: Liquid crystalline elastomers (LCEs) are shape-changing materials that exhibit large deformations in response to applied stimuli. Local control of the orientation of LCEs spatially directs the deformation of these materials to realize spontaneous shape change in response to stimuli. Prior approaches to shape programming in LCEs utilize patterning techniques that involve the detailed inscription of… ▽ More Liquid crystalline elastomers (LCEs) are shape-changing materials that exhibit large deformations in response to applied stimuli. Local control of the orientation of LCEs spatially directs the deformation of these materials to realize spontaneous shape change in response to stimuli. Prior approaches to shape programming in LCEs utilize patterning techniques that involve the detailed inscription of spatially varying nematic fields to produce sheets. These patterned sheets deform into elaborate geometries with complex Gaussian curvatures. Here, we present an alternative approach to realize shape-morphing in LCEs where spatial patterning of the crosslink density locally regulates the material deformation magnitude on either side of a prescribed interface curve. We also present a simple mathematical model describing the behavior of these materials. Further experiments coupled with the mathematical model demonstrate the control of the sign of Gaussian curvature, which is used in combination with heat transfer effects to design LCEs that self-clean as a result of temperature-dependent actuation properties. △ Less

Submitted 26 December, 2022; originally announced December 2022.

arXiv:2212.10003 [pdf, other]

(QA)$^2$: Question Answering with Questionable Assumptions

Authors: Najoung Kim, Phu Mon Htut, Samuel R. Bowman, Jackson Petty

Abstract: Naturally occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable. Questions containing questionable assumptions are challenging because they require a distinct answer strategy that deviates from typical answers for information-seeking questions. For instance, the question "When did Marie Curie discover Uranium?" cannot be answer… ▽ More Naturally occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable. Questions containing questionable assumptions are challenging because they require a distinct answer strategy that deviates from typical answers for information-seeking questions. For instance, the question "When did Marie Curie discover Uranium?" cannot be answered as a typical "when" question without addressing the false assumption "Marie Curie discovered Uranium". In this work, we propose (QA)$^2$ (Question Answering with Questionable Assumptions), an open-domain evaluation dataset consisting of naturally occurring search engine queries that may or may not contain questionable assumptions. To be successful on (QA)$^2$, systems must be able to detect questionable assumptions and also be able to produce adequate responses for both typical information-seeking questions and ones with questionable assumptions. Through human rater acceptability on end-to-end QA with (QA)$^2$, we find that current models do struggle with handling questionable assumptions, leaving substantial headroom for progress. △ Less

Submitted 29 August, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: ACL 2023 camera-ready

arXiv:2212.09251 [pdf, other]

Discovering Language Model Behaviors with Model-Written Evaluations

Authors: Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion , et al. (38 additional authors not shown)

Abstract: As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from inst… ▽ More As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors. △ Less

Submitted 19 December, 2022; originally announced December 2022.

Comments: for associated data visualizations, see https://www.evals.anthropic.com/model-written/ for full datasets, see https://github.com/anthropics/evals

arXiv:2212.08073 [pdf, other]

Constitutional AI: Harmlessness from AI Feedback

Authors: Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite , et al. (26 additional authors not shown)

Abstract: As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supe… ▽ More As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels. △ Less

Submitted 15 December, 2022; originally announced December 2022.

arXiv:2211.03540 [pdf, other]

Measuring Progress on Scalable Oversight for Large Language Models

Authors: Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse , et al. (21 additional authors not shown)

Abstract: Develo** safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think abou… ▽ More Develo** safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks. △ Less

Submitted 11 November, 2022; v1 submitted 4 November, 2022; originally announced November 2022.

Comments: v2 fixes a few typos from v1

arXiv:2210.10860 [pdf, other]

Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions

Authors: Alicia Parrish, Harsh Trivedi, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Amanpreet Singh Saimbhi, Samuel R. Bowman

Abstract: The use of language-model-based question-answering systems to aid humans in completing difficult tasks is limited, in part, by the unreliability of the text these systems generate. Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer options, where one is correct and the other is incorrect, allows human… ▽ More The use of language-model-based question-answering systems to aid humans in completing difficult tasks is limited, in part, by the unreliability of the text these systems generate. Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer options, where one is correct and the other is incorrect, allows human judges to perform more accurately, even when one of the arguments is unreliable and deceptive. If this is helpful, we may be able to increase our justified trust in language-model-based systems by asking them to produce these arguments where needed. Previous research has shown that just a single turn of arguments in this format is not helpful to humans. However, as debate settings are characterized by a back-and-forth dialogue, we follow up on previous results to test whether adding a second round of counter-arguments is helpful to humans. We find that, regardless of whether they have access to arguments or not, humans perform similarly on our task. These findings suggest that, in the case of answering reading comprehension questions, debate is not a helpful format. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: 12 pages, 6 figures, 7 tables

arXiv:2209.14947 [pdf, ps, other]

doi 10.1098/rsos.221236

Controlling and scripting laboratory hardware with open-source, intuitive interfaces: OpenFlexure Voice Control and OpenFlexure Blockly

Authors: Samuel McDermott, Richard Bowman, Kerrianne Harrington, William Wadsworth, Pietro Cicuta

Abstract: Making user interaction with laboratory equipment more convenient and intuitive should promote experimental work and help researchers to complete their tasks efficiently. The most common form of interaction in current instrumentation is either direct tactile, with buttons and knobs, or interfaced through a computer, using a mouse and keyboard. Scripting is another function typical of smart and aut… ▽ More Making user interaction with laboratory equipment more convenient and intuitive should promote experimental work and help researchers to complete their tasks efficiently. The most common form of interaction in current instrumentation is either direct tactile, with buttons and knobs, or interfaced through a computer, using a mouse and keyboard. Scripting is another function typical of smart and automated laboratory equipment, yet users are currently required to learn bespoke programming languages and libraries for individual pieces of equipment. In this paper, we present two open-source, novel and intuitive ways of interacting with and scripting laboratory equipment. We choose the OpenFlexure family of microscopes as our exemplar, due to their open-source nature and smart control system. Firstly, we demonstrate 'OpenFlexure Voice Control' to enable users to control the microscope hands-free. Secondly, we present 'OpenFlexure Blockly' which uses the Blockly Visual Programming Language to enable users to easily create scripts for the microscope, using a drag and drop web interface. We explain the design choices when develo** these tools, and discuss more typical use cases and more general applications. △ Less

Submitted 2 February, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

arXiv:2208.12852 [pdf, other]

What Do NLP Researchers Believe? Results of the NLP Community Metasurvey

Authors: Julian Michael, Ari Holtzman, Alicia Parrish, Aaron Mueller, Alex Wang, Angelica Chen, Divyam Madaan, Nikita Nangia, Richard Yuanzhe Pang, Jason Phang, Samuel R. Bowman

Abstract: We present the results of the NLP Community Metasurvey. Run from May to June 2022, the survey elicited opinions on controversial issues, including industry influence in the field, concerns about AGI, and ethics. Our results put concrete numbers to several controversies: For example, respondents are split almost exactly in half on questions about the importance of artificial general intelligence, w… ▽ More We present the results of the NLP Community Metasurvey. Run from May to June 2022, the survey elicited opinions on controversial issues, including industry influence in the field, concerns about AGI, and ethics. Our results put concrete numbers to several controversies: For example, respondents are split almost exactly in half on questions about the importance of artificial general intelligence, whether language models understand language, and the necessity of linguistic structure and inductive bias for solving NLP problems. In addition, the survey posed meta-questions, asking respondents to predict the distribution of survey responses. This allows us not only to gain insight on the spectrum of beliefs held by NLP researchers, but also to uncover false sociological beliefs where the community's predictions don't match reality. We find such mismatches on a wide range of issues. Among other results, the community greatly overestimates its own belief in the usefulness of benchmarks and the potential for scaling to solve real-world problems, while underestimating its own belief in the importance of linguistic structure, inductive bias, and interdisciplinary science. △ Less

Submitted 26 August, 2022; originally announced August 2022.

Comments: 31 pages, 19 figures, 3 tables; more information at https://nlpsurvey.net

ACM Class: I.2.7

arXiv:2208.07998 [pdf, other]

What Artificial Neural Networks Can Tell Us About Human Language Acquisition

Authors: Alex Warstadt, Samuel R. Bowman

Abstract: Rapid progress in machine learning for natural language processing has the potential to transform debates about how humans learn language. However, the learning environments and biases of current artificial learners and humans diverge in ways that weaken the impact of the evidence obtained from learning simulations. For example, today's most effective neural language models are trained on roughly… ▽ More Rapid progress in machine learning for natural language processing has the potential to transform debates about how humans learn language. However, the learning environments and biases of current artificial learners and humans diverge in ways that weaken the impact of the evidence obtained from learning simulations. For example, today's most effective neural language models are trained on roughly one thousand times the amount of linguistic data available to a typical child. To increase the relevance of learnability results from computational models, we need to train model learners without significant advantages over humans. If an appropriate model successfully acquires some target linguistic knowledge, it can provide a proof of concept that the target is learnable in a hypothesized human learning scenario. Plausible model learners will enable us to carry out experimental manipulations to make causal inferences about variables in the learning environment, and to rigorously test poverty-of-the-stimulus-style claims arguing for innate linguistic knowledge in humans on the basis of speculations about learnability. Comparable experiments will never be possible with human subjects due to practical and ethical considerations, making model learners an indispensable resource. So far, attempts to deprive current models of unfair advantages obtain sub-human results for key grammatical behaviors such as acceptability judgments. But before we can justifiably conclude that language learning requires more prior domain-specific knowledge than current models possess, we must first explore non-linguistic inputs in the form of multimodal stimuli and multi-agent interaction as ways to make our learners more efficient at learning from limited linguistic input. △ Less

Submitted 11 February, 2024; v1 submitted 16 August, 2022; originally announced August 2022.

Comments: Please cite the published version with the following information: @incollection{warstadt2022artificial, title={What artificial neural networks can tell us about human language acquisition}, author={Warstadt, Alex and Bowman, Samuel R.}, booktitle={Algebraic Structures in Natural Language}, pages={17--60}, year={2022}, publisher={CRC Press} }

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2205.11465 [pdf, ps, other]

SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

Authors: Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, Samuel R. Bowman

Abstract: Summarization datasets are often assembled either by scra** naturally occurring public-domain summaries -- which are nearly always in difficult-to-work-with technical domains -- or by using approximate heuristics to extract them from everyday text -- which frequently yields unfaithful summaries. In this work, we turn to a slower but more straightforward approach to develo** summarization bench… ▽ More Summarization datasets are often assembled either by scra** naturally occurring public-domain summaries -- which are nearly always in difficult-to-work-with technical domains -- or by using approximate heuristics to extract them from everyday text -- which frequently yields unfaithful summaries. In this work, we turn to a slower but more straightforward approach to develo** summarization benchmark data: We hire highly-qualified contractors to read stories and write original summaries from scratch. To amortize reading time, we collect five summaries per document, with the first giving an overview and the subsequent four addressing specific questions. We use this protocol to collect SQuALITY, a dataset of question-focused summaries built on the same public-domain short stories as the multiple-choice dataset QuALITY (Pang et al., 2021). Experiments with state-of-the-art summarization systems show that our dataset is challenging and that existing automatic evaluation metrics are weak indicators of quality. △ Less

Submitted 23 May, 2022; originally announced May 2022.

arXiv:2205.10782 [pdf, other]

Instruction Induction: From Few Examples to Natural Language Task Descriptions

Authors: Or Honovich, Uri Shaham, Samuel R. Bowman, Omer Levy

Abstract: Large language models are able to perform a task by conditioning on a few input-output demonstrations - a paradigm known as in-context learning. We show that language models can explicitly infer an underlying task from a few demonstrations by prompting them to generate a natural language instruction that fits the examples. To explore this ability, we introduce the instruction induction challenge,… ▽ More Large language models are able to perform a task by conditioning on a few input-output demonstrations - a paradigm known as in-context learning. We show that language models can explicitly infer an underlying task from a few demonstrations by prompting them to generate a natural language instruction that fits the examples. To explore this ability, we introduce the instruction induction challenge, compile a dataset consisting of 24 tasks, and define a novel evaluation metric based on executing the generated instruction. We discover that, to a large extent, the ability to generate instructions does indeed emerge when using a model that is both large enough and aligned to follow instructions; InstructGPT achieves 65.7% of human performance in our execution-based metric, while the original GPT-3 model reaches only 9.8% of human performance. This surprising result suggests that instruction induction might be a viable learning paradigm in and of itself, where instead of fitting a set of latent continuous parameters to the data, one searches for the best description in the natural language hypothesis space. △ Less

Submitted 22 May, 2022; originally announced May 2022.

arXiv:2204.09997 [pdf, other]

Computational ghost imaging for transmission electron microscopy

Authors: Akhil Kallepalli, Lorenzo Viani, Daan Stellinga, Enzo Rotunno, Ming-Jie Sun, Richard Bowman, Paolo Rosi, Stefano Frabboni, Roberto Balboni, Andrea Migliori, Vincenzo Grillo, Miles Padgett

Abstract: While transmission electron microscopes (TEM) can achieve a much higher resolution than optical microscopes, they face challenges of damage to samples during the high energy processes involved. Here, we explore using computational ghost imaging techniques in electron microscopy to reduce the total required intensity. The technological lack of the equivalent high-resolution, optical spatial light m… ▽ More While transmission electron microscopes (TEM) can achieve a much higher resolution than optical microscopes, they face challenges of damage to samples during the high energy processes involved. Here, we explore using computational ghost imaging techniques in electron microscopy to reduce the total required intensity. The technological lack of the equivalent high-resolution, optical spatial light modulator for electrons means that a different approach needs to be pursued. To this end, we show a beam sha** technique based on the use of a distribution of electrically charged metal needles to structure the beam, alongside a novel reconstruction method to handle the resulting highly non-orthogonal patterns. Second, we illustrate the application of this ghost imaging approach in electron microscopy. To test the full extent of the capabilities of this technique, we realised an analogous optical setup method. In both regimes, the ability to reduce the amount of total illumination intensity is evident in comparison to raster scanning. △ Less

Submitted 21 April, 2022; originally announced April 2022.

Comments: 6 figures, 10 pages

arXiv:2204.05212 [pdf, other]

Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions

Authors: Alicia Parrish, Harsh Trivedi, Ethan Perez, Angelica Chen, Nikita Nangia, Jason Phang, Samuel R. Bowman

Abstract: Current QA systems can generate reasonable-sounding yet false answers without explanation or evidence for the generated answer, which is especially problematic when humans cannot readily check the model's answers. This presents a challenge for building trust in machine learning systems. We take inspiration from real-world situations where difficult questions are answered by considering opposing si… ▽ More Current QA systems can generate reasonable-sounding yet false answers without explanation or evidence for the generated answer, which is especially problematic when humans cannot readily check the model's answers. This presents a challenge for building trust in machine learning systems. We take inspiration from real-world situations where difficult questions are answered by considering opposing sides (see Irving et al., 2018). For multiple-choice QA examples, we build a dataset of single arguments for both a correct and incorrect answer option in a debate-style set-up as an initial step in training models to produce explanations for two candidate answers. We use long contexts -- humans familiar with the context write convincing explanations for pre-selected correct and incorrect answers, and we test if those explanations allow humans who have not read the full context to more accurately determine the correct answer. We do not find that explanations in our set-up improve human accuracy, but a baseline condition shows that providing human-selected text snippets does improve accuracy. We use these findings to suggest ways of improving the debate set up for future data collection efforts. △ Less

Submitted 13 April, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: Accepted to the 2022 ACL Workshop on Learning with Natural Language Supervision. 12 pages total, 9 figures, 2 tables

arXiv:2203.06342 [pdf, other]

What Makes Reading Comprehension Questions Difficult?

Authors: Saku Sugawara, Nikita Nangia, Alex Warstadt, Samuel R. Bowman

Abstract: For a natural language understanding benchmark to be useful in research, it has to consist of examples that are diverse and difficult enough to discriminate among current and near-future state-of-the-art systems. However, we do not yet know how best to select text sources to collect a variety of challenging examples. In this study, we crowdsource multiple-choice reading comprehension questions for… ▽ More For a natural language understanding benchmark to be useful in research, it has to consist of examples that are diverse and difficult enough to discriminate among current and near-future state-of-the-art systems. However, we do not yet know how best to select text sources to collect a variety of challenging examples. In this study, we crowdsource multiple-choice reading comprehension questions for passages taken from seven qualitatively distinct sources, analyzing what attributes of passages contribute to the difficulty and question types of the collected examples. To our surprise, we find that passage source, length, and readability measures do not significantly affect question difficulty. Through our manual annotation of seven reasoning types, we observe several trends between passage sources and reasoning types, e.g., logical reasoning is more often required in questions written for technical passages. These results suggest that when creating a new benchmark dataset, selecting a diverse set of passages can help ensure a diverse range of question types, but that passage difficulty need not be a priority. △ Less

Submitted 11 March, 2022; originally announced March 2022.

Comments: ACL 2022

arXiv:2201.02163 [pdf]

Optical properties of Au-Hf thin films

Authors: Hugh Littlehailes, William R. Hendren, Robert M. Bowman, Fumin Huang

Abstract: The optical properties of thin films of intermetallic Au$_{3}$Hf were experimentally investigated for the first time, which display clear plasmonic properties in the optical and near infrared region with negative permittivity. In contrast to similar alloys, such as films of Au$_{3}$Zr, the films express more negative $ε'$ values and lower $ε''$ values across most of the wavelengths (370-1570 nm) i… ▽ More The optical properties of thin films of intermetallic Au$_{3}$Hf were experimentally investigated for the first time, which display clear plasmonic properties in the optical and near infrared region with negative permittivity. In contrast to similar alloys, such as films of Au$_{3}$Zr, the films express more negative $ε'$ values and lower $ε''$ values across most of the wavelengths (370-1570 nm) investigated. The Au$_{3}$Hf films were fabricated by DC magnetron sputtering at a range of deposition temperatures, from room temperature to 415$^{o}$C, and annealed at different vacuum levels. The films mostly formed as a combination of Au$_{3}$Hf, Au$_{2}$Hf and Au$_{4}$Hf phases when deposited below 400$^{o}$C, and exclusively Au$_{3}$Hf phase at above 400$^{o}$C, indicating key conditions for isolating this phase. The films were stable when annealed at 10$^{-8}$ Torr, but when annealed again at 10$^{-6}$ Torr the films oxidised and changed into a mix of Au- Hf phases, suggesting resistance to oxidization may be an issue for un-encapsulated applications at elevated temperatures. △ Less

Submitted 6 January, 2022; originally announced January 2022.

Comments: 19 pages, including references, plus 3 pages of supplementary material. 8 figures and 1 table in main text, 1 figure and 1 table in supplementary material

arXiv:2112.08608 [pdf, other]

QuALITY: Question Answering with Long Input Texts, Yes!

Authors: Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, Samuel R. Bowman

Abstract: To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than rely… ▽ More To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Our baseline models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%). △ Less

Submitted 11 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Comments: NAACL 2022

arXiv:2112.05804 [pdf, other]

doi 10.1364/OE.450211

Multi-modal microscopy imaging with the OpenFlexure Delta Stage

Authors: Samuel McDermott, Filip Ayazi, Joel Collins, Joe Knapper, Julian Stirling, Richard Bowman, Pietro Cicuta

Abstract: Microscopes are vital pieces of equipment in much of biological research and medical diagnostics. However, access to a microscope can represent a bottleneck in research, especially in lower-income countries. `Smart' computer controlled motorized microscopes, which can perform automated routines or acquire images in a range of modalities are even more expensive and inaccessible. Develo** low-cost… ▽ More Microscopes are vital pieces of equipment in much of biological research and medical diagnostics. However, access to a microscope can represent a bottleneck in research, especially in lower-income countries. `Smart' computer controlled motorized microscopes, which can perform automated routines or acquire images in a range of modalities are even more expensive and inaccessible. Develo** low-cost, open-source, smart microscopes enables more researchers to conceive and execute optimized or more complex experiments. Here we present the OpenFlexure Delta Stage, a 3D-printed microscope designed for researchers. Powered by the OpenFlexure software stack, it is capable of performing automated experiments. The design files and assembly instructions are freely available under an open licence. Its intuitive and modular design -- along with detailed documentation -- allows researchers to implement a variety of imaging modes with ease. The versatility of this microscope is demonstrated by imaging biological and non-biological samples (red blood cells with Plasmodium parasites and colloidal particles in brightfield, epi-fluorescence, darkfield, Rheinberg and differential phase contrast. We present the design strategy and choice of tools to develop devices accessible to researchers from lower-income countries, as well as the advantages of an open-source project in this context. This microscope, having been open-source since its conception, has already been built and tested by researchers around the world, promoting a community of expertise and an environment of reproducibility in science. △ Less

Submitted 30 June, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

arXiv:2112.05631 [pdf, other]

doi 10.1063/5.0076901

autohaem: 3D printed devices for automated preparation of blood smears

Authors: Samuel McDermott, Jaehyeon Kim, Aikaterini Anna Leledaki, Duncan Parry, Louis Lee, Alexandre Kabla, Catherine Mkindi, Richard Bowman, Pietro Cicuta

Abstract: The process of making blood smears is common in both research and clinical settings, for investigating the health of blood cells and the presence of blood-borne parasites. It is very often carried out manually. We focus here on smears for malaria diagnosis and research which are frequently analyzed by optical microscopy and require a high quality. Automating the smear preparation promises to incre… ▽ More The process of making blood smears is common in both research and clinical settings, for investigating the health of blood cells and the presence of blood-borne parasites. It is very often carried out manually. We focus here on smears for malaria diagnosis and research which are frequently analyzed by optical microscopy and require a high quality. Automating the smear preparation promises to increase throughput and to improve the quality and consistency of the smears. We present here two devices (manual and motorized) designed to aid in the making of blood smears. These are fully documented, open-source hardware, and an important principle was to make them easily fabricated locally anywhere. Designs and assembly instructions are freely available under an open license. We also describe an image analysis pipeline for characterizing the quality of smears, and use it to optimize the settings and tunable parameters in the two devices. The devices perform as well as expert human operators, while not requiring a trained operator and offering potential advantages in reproducibility and standardization across facilities. △ Less

Submitted 18 January, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

arXiv:2112.03078 [pdf, other]

A study of singlet fission-halide perovskite interfaces

Authors: Alan R. Bowman, Samuel D. Stranks, Bartomeu Monserrat

Abstract: A method for improving the efficiency of solar cells is combining a low-bandgap semiconductor with a singlet fission material (which converts one high energy singlet into two low energy triplets following photoexcitation). Here we present a study of the interface between singlet fission molecules and low-bandgap halide pervoskites. We briefly show a range of experiments screening for triplet trans… ▽ More A method for improving the efficiency of solar cells is combining a low-bandgap semiconductor with a singlet fission material (which converts one high energy singlet into two low energy triplets following photoexcitation). Here we present a study of the interface between singlet fission molecules and low-bandgap halide pervoskites. We briefly show a range of experiments screening for triplet transfer into a halide perovskite. However, in all cases triplet transfer was not observed. This motivated us to understand the halide perovskite/singlet fission interface better by carrying out first-principles calculations using tetracene and cesium lead iodide. We found that tetracene molecules/thin films preferentially orient themselves parallel to/perpendicular to the halide perovskite's surface, in a similar way to on other inorganic semiconductors. We present formation energies of all interfaces, which are significantly less favourable than for bulk tetracene, indicative of weak interaction at the interface. It was not possible to calculate excitonic states at the full interface due to computational limitations, so we instead present highly speculative toy interfaces between tetracene and a halide-perovskite-like structure. In these models we focus on replicating tetracene's electronic states correctly. We find that tetracene's singlet and triplet energies are comparable to that of bulk tetracene, and the triplet is strongly localised on a single tetracene molecule, even at an interface. Our work provides new understanding of the interface between tetracene and halide perovskites, explores the potential for modelling excitons at interfaces, and begins to explain the difficulties in extracting triplets directly into inorganic semiconductors. △ Less

Submitted 6 December, 2021; originally announced December 2021.

arXiv:2111.08181 [pdf, other]

Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

Authors: Jason Phang, Angelica Chen, William Huang, Samuel R. Bowman

Abstract: More capable language models increasingly saturate existing task benchmarks, in some cases outperforming humans. This has left little headroom with which to measure further progress. Adversarial dataset creation has been proposed as a strategy to construct more challenging datasets, and two common approaches are: (1) filtering out easy examples and (2) model-in-the-loop data collection. In this wo… ▽ More More capable language models increasingly saturate existing task benchmarks, in some cases outperforming humans. This has left little headroom with which to measure further progress. Adversarial dataset creation has been proposed as a strategy to construct more challenging datasets, and two common approaches are: (1) filtering out easy examples and (2) model-in-the-loop data collection. In this work, we study the impact of applying each approach to create more challenging evaluation datasets. We adapt the AFLite algorithm to filter evaluation data, and run experiments against 18 different adversary models. We find that AFLite indeed selects more challenging examples, lowering the performance of evaluated models more as stronger adversary models are used. However, the resulting ranking of models can also be unstable and highly sensitive to the choice of adversary model used. Moreover, AFLite oversamples examples with low annotator agreement, meaning that model comparisons hinge on the most contentiously labeled examples. Smaller-scale experiments on the adversarially collected datasets ANLI and AdversarialQA show similar findings, broadly lowering performance with stronger adversaries while disproportionately affecting the adversary model. △ Less

Submitted 15 November, 2021; originally announced November 2021.

arXiv:2110.08355 [pdf, other]

Clean or Annotate: How to Spend a Limited Data Collection Budget

Authors: Derek Chen, Zhou Yu, Samuel R. Bowman

Abstract: Crowdsourcing platforms are often used to collect datasets for training machine learning models, despite higher levels of inaccurate labeling compared to expert labeling. There are two common strategies to manage the impact of such noise. The first involves aggregating redundant annotations, but comes at the expense of labeling substantially fewer examples. Secondly, prior works have also consider… ▽ More Crowdsourcing platforms are often used to collect datasets for training machine learning models, despite higher levels of inaccurate labeling compared to expert labeling. There are two common strategies to manage the impact of such noise. The first involves aggregating redundant annotations, but comes at the expense of labeling substantially fewer examples. Secondly, prior works have also considered using the entire annotation budget to label as many examples as possible and subsequently apply denoising algorithms to implicitly clean the dataset. We find a middle ground and propose an approach which reserves a fraction of annotations to explicitly clean up highly probable error samples to optimize the annotation process. In particular, we allocate a large portion of the labeling budget to form an initial dataset used to train a model. This model is then used to identify specific examples that appear most likely to be incorrect, which we spend the remaining budget to relabel. Experiments across three model variations and four natural language processing tasks show our approach outperforms or matches both label aggregation and advanced denoising methods designed to handle noisy labels when allocated the same finite annotation budget. △ Less

Submitted 12 June, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

Comments: 17 pages, 3 figures, 6 tables. Accepted to NAACL 2022 workshop

Showing 1–50 of 152 results for author: Bowman, R