-
Anomalous spin precession systematic effects in the search for a muon EDM using the frozen-spin technique
Authors:
G. Cavoto,
R. Chakraborty,
A. Doinaki,
C. Dutsov,
M. Giovannozzi,
T. Hume,
K. Kirch,
K. Michielsen,
L. Morvaj,
A. Papa,
F. Renga,
M. Sakurai,
P. Schmidt-Wellenburg
Abstract:
At the Paul Scherrer Institut (PSI), we are currently working on the development of a high-precision apparatus with the aim of searching for the muon electric dipole moment (EDM) with unprecedented sensitivity. The underpinning principle of this experiment is the frozen-spin technique, a method that suppresses the spin precession due to the anomalous magnetic moment, thereby enhancing the signal-t…
▽ More
At the Paul Scherrer Institut (PSI), we are currently working on the development of a high-precision apparatus with the aim of searching for the muon electric dipole moment (EDM) with unprecedented sensitivity. The underpinning principle of this experiment is the frozen-spin technique, a method that suppresses the spin precession due to the anomalous magnetic moment, thereby enhancing the signal-to-noise ratio for EDM signals. This increased sensitivity facilitates measurements that would be difficult to achieve with conventional $g - 2$ muon storage rings. Given the availability of the $p = 125$ MeV/$c$ muon beam at PSI, the anticipated statistical sensitivity for the EDM after a year of data collection is $6\times 10^{-23}e\cdot$cm. To achieve this goal, it is imperative to meticulously analyse and mitigate any potential spurious effects that could mimic EDM signals. In this study, we present a quantitative methodology to evaluate the systematic effects that might arise in the context of employing the frozen-spin technique within a compact storage ring. Our approach entails the analytical derivation of equations governing the motion of the muon spin in the electromagnetic (EM) fields intrinsic to the experimental setup, validated through subsequent numerical simulations. We also illustrate a method to calculate the cumulative geometric (Berry's) phase. This work complements ongoing experimental efforts to detect a muon EDM at PSI and contributes to a broader understanding of spin-precession systematic effects.
△ Less
Submitted 17 November, 2023;
originally announced November 2023.
-
Specific versus General Principles for Constitutional AI
Authors:
Sandipan Kundu,
Yuntao Bai,
Saurav Kadavath,
Amanda Askell,
Andrew Callahan,
Anna Chen,
Anna Goldie,
Avital Balwit,
Azalia Mirhoseini,
Brayden McLean,
Catherine Olsson,
Cassie Evraets,
Eli Tran-Johnson,
Esin Durmus,
Ethan Perez,
Jackson Kernion,
Jamie Kerr,
Kamal Ndousse,
Karina Nguyen,
Nelson Elhage,
Newton Cheng,
Nicholas Schiefer,
Nova DasSarma,
Oliver Rausch,
Robin Larson
, et al. (11 additional authors not shown)
Abstract:
Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expressi…
▽ More
Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
Measuring Faithfulness in Chain-of-Thought Reasoning
Authors:
Tamera Lanham,
Anna Chen,
Ansh Radhakrishnan,
Benoit Steiner,
Carson Denison,
Danny Hernandez,
Dustin Li,
Esin Durmus,
Evan Hubinger,
Jackson Kernion,
Kamilė Lukošiūtė,
Karina Nguyen,
Newton Cheng,
Nicholas Joseph,
Nicholas Schiefer,
Oliver Rausch,
Robin Larson,
Sam McCandlish,
Sandipan Kundu,
Saurav Kadavath,
Shannon Yang,
Thomas Henighan,
Timothy Maxwell,
Timothy Telleen-Lawton,
Tristan Hume
, et al. (5 additional authors not shown)
Abstract:
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change…
▽ More
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.
△ Less
Submitted 16 July, 2023;
originally announced July 2023.
-
Operating the GridPix detector with helium-isobutane gas mixtures for a high-precision, low-mass Time Projection Chamber
Authors:
G. Cavoto,
C. Dutsov,
M. Gruber,
M. Hildebrandt,
T. D. Hume,
J. Kaminski,
F. Neuhaus,
A. Papa,
F. Renga,
P. Schmidt-Wellenburg,
M. Schott,
B. Vitali,
C. Voena
Abstract:
High precision experiments with muons and pions often require tracking charged particles with $O(100~μ\mathrm{m})$ single-hit resolution, possibly with particle identification capabilities, down to very low momenta ($p \lesssim 100$~MeV/$c$). In such conditions, the particle trajectories are strongly affected by the interaction with the detector material, and the reconstruction of the kinematic ob…
▽ More
High precision experiments with muons and pions often require tracking charged particles with $O(100~μ\mathrm{m})$ single-hit resolution, possibly with particle identification capabilities, down to very low momenta ($p \lesssim 100$~MeV/$c$). In such conditions, the particle trajectories are strongly affected by the interaction with the detector material, and the reconstruction of the kinematic observables consequently deteriorates. A good compromise between resolution and material budget can be obtained with a Time Projection Chamber (TPC), if very light gases and a high-granularity readout are used. In this paper, we present a characterization of the GridPix detector in helium-isobutane gas mixtures, within a TPC with 9~cm maximum drift. Measurements of the main electron drift properties for these gas mixtures are also presented.
△ Less
Submitted 8 September, 2023; v1 submitted 5 May, 2023;
originally announced May 2023.
-
The Capacity for Moral Self-Correction in Large Language Models
Authors:
Deep Ganguli,
Amanda Askell,
Nicholas Schiefer,
Thomas I. Liao,
Kamilė Lukošiūtė,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Catherine Olsson,
Danny Hernandez,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,
Ethan Perez,
Jackson Kernion,
Jamie Kerr,
Jared Mueller,
Joshua Landau,
Kamal Ndousse,
Karina Nguyen,
Liane Lovitt,
Michael Sellitto,
Nelson Elhage,
Noemi Mercado,
Nova DasSarma
, et al. (24 additional authors not shown)
Abstract:
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability…
▽ More
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereoty**, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
△ Less
Submitted 18 February, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Discovering Language Model Behaviors with Model-Written Evaluations
Authors:
Ethan Perez,
Sam Ringer,
Kamilė Lukošiūtė,
Karina Nguyen,
Edwin Chen,
Scott Heiner,
Craig Pettit,
Catherine Olsson,
Sandipan Kundu,
Saurav Kadavath,
Andy Jones,
Anna Chen,
Ben Mann,
Brian Israel,
Bryan Seethor,
Cameron McKinnon,
Christopher Olah,
Da Yan,
Daniela Amodei,
Dario Amodei,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,
Guro Khundadze,
Jackson Kernion
, et al. (38 additional authors not shown)
Abstract:
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from inst…
▽ More
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Constitutional AI: Harmlessness from AI Feedback
Authors:
Yuntao Bai,
Saurav Kadavath,
Sandipan Kundu,
Amanda Askell,
Jackson Kernion,
Andy Jones,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Cameron McKinnon,
Carol Chen,
Catherine Olsson,
Christopher Olah,
Danny Hernandez,
Dawn Drain,
Deep Ganguli,
Dustin Li,
Eli Tran-Johnson,
Ethan Perez,
Jamie Kerr,
Jared Mueller,
Jeffrey Ladish,
Joshua Landau,
Kamal Ndousse,
Kamile Lukosuite
, et al. (26 additional authors not shown)
Abstract:
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supe…
▽ More
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
△ Less
Submitted 15 December, 2022;
originally announced December 2022.
-
Systematic effects in the search for the muon electric dipole moment using the frozen-spin technique
Authors:
Chavdar Dutsov,
Timothy Hume,
Philipp Schmidt-Wellenburg
Abstract:
At the Paul Scherrer Institute (PSI) we are develo** a high precision instrument to measure the muon electric dipole moment (EDM). The experiment is based on the frozen-spin method in which the spin precession induced by the anomalous magnetic moment is suppressed, thus increasing the signal-to-noise ratio for EDM signals to achieve a sensitivity otherwise unattainable using conventional $g-2$ m…
▽ More
At the Paul Scherrer Institute (PSI) we are develo** a high precision instrument to measure the muon electric dipole moment (EDM). The experiment is based on the frozen-spin method in which the spin precession induced by the anomalous magnetic moment is suppressed, thus increasing the signal-to-noise ratio for EDM signals to achieve a sensitivity otherwise unattainable using conventional $g-2$ muon storage rings. The expected statistical sensitivity for the EDM after a year of data taking is $6\times 10^{-23} e\cdot$cm with the $p = 125$ MeV/c muon beam available at the PSI. Reaching this goal necessitates a comprehensive analysis on spurious effects that mimic the EDM signal. This work discusses a quantitative approach to study systematic effects for the frozen-spin method when searching for the muon EDM. Equations for the motion of the muon spin in the electromagnetic fields of the experimental system are analytically derived and validated by simulation.
△ Less
Submitted 24 November, 2022;
originally announced November 2022.
-
Measuring Progress on Scalable Oversight for Large Language Models
Authors:
Samuel R. Bowman,
Jeeyoon Hyun,
Ethan Perez,
Edwin Chen,
Craig Pettit,
Scott Heiner,
Kamilė Lukošiūtė,
Amanda Askell,
Andy Jones,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Cameron McKinnon,
Christopher Olah,
Daniela Amodei,
Dario Amodei,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,
Jackson Kernion,
Jamie Kerr,
Jared Mueller,
Jeffrey Ladish,
Joshua Landau,
Kamal Ndousse
, et al. (21 additional authors not shown)
Abstract:
Develo** safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think abou…
▽ More
Develo** safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
△ Less
Submitted 11 November, 2022; v1 submitted 4 November, 2022;
originally announced November 2022.
-
Toy Models of Superposition
Authors:
Nelson Elhage,
Tristan Hume,
Catherine Olsson,
Nicholas Schiefer,
Tom Henighan,
Shauna Kravec,
Zac Hatfield-Dodds,
Robert Lasenby,
Dawn Drain,
Carol Chen,
Roger Grosse,
Sam McCandlish,
Jared Kaplan,
Dario Amodei,
Martin Wattenberg,
Christopher Olah
Abstract:
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising…
▽ More
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
△ Less
Submitted 21 September, 2022;
originally announced September 2022.
-
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Authors:
Deep Ganguli,
Liane Lovitt,
Jackson Kernion,
Amanda Askell,
Yuntao Bai,
Saurav Kadavath,
Ben Mann,
Ethan Perez,
Nicholas Schiefer,
Kamal Ndousse,
Andy Jones,
Sam Bowman,
Anna Chen,
Tom Conerly,
Nova DasSarma,
Dawn Drain,
Nelson Elhage,
Sheer El-Showk,
Stanislav Fort,
Zac Hatfield-Dodds,
Tom Henighan,
Danny Hernandez,
Tristan Hume,
Josh Jacobson,
Scott Johnston
, et al. (11 additional authors not shown)
Abstract:
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmle…
▽ More
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.
△ Less
Submitted 22 November, 2022; v1 submitted 23 August, 2022;
originally announced September 2022.
-
Room-temperature emission of muonium from aerogel and zeolite targets
Authors:
A. Antognini,
P. Crivelli,
L. Gerchow,
T. D. Hume,
K. Kirch,
A. Knecht,
J. Nuber,
A. Papa,
N. Ritjoho,
M. Sakurai,
A. Soter,
D. Taqqu,
S. M. Vogiatzi,
J. Zhang,
L. Ziegler
Abstract:
A low-emittance, high-intensity atomic beam of muonium ($\mathrm{M}=μ^+ + \mathrm{e}^-$) using superfluid helium as muon-to-muonium converter is being developed at the Paul Scherrer Institute (PSI). This beam could advance laser spectroscopy of muonium and allow the first atomic interferometry experiments for the direct observation of the M gravitational interaction. In this paper, we describe the…
▽ More
A low-emittance, high-intensity atomic beam of muonium ($\mathrm{M}=μ^+ + \mathrm{e}^-$) using superfluid helium as muon-to-muonium converter is being developed at the Paul Scherrer Institute (PSI). This beam could advance laser spectroscopy of muonium and allow the first atomic interferometry experiments for the direct observation of the M gravitational interaction. In this paper, we describe the development of compact detection schemes which resulted in the background-suppressed observation of atomic muonium in vacuum, and can be adapted for cryogenic measurements. Using these setups, we compared the emission characteristics of various muonium production targets using low momentum ($p_μ = 11$-$13~$MeV/c) muons, and observed muonium emission from zeolite targets into vacuum for the first time. For a specific laser-ablated aerogel target, we determined a muon-to-vacuum-muonium conversion efficiency of $7.23 \pm 0.05 \text{(stat)} ^{+1.06}_{-0.76}\text{(sys)}\,\text{%}$, assuming thermal emission of muonium. Moreover, we investigated muonium-helium collisions and from it we determined an upper temperature limit of 0.3 K for the superfluid helium converter.
△ Less
Submitted 24 August, 2022;
originally announced August 2022.
-
Language Models (Mostly) Know What They Know
Authors:
Saurav Kadavath,
Tom Conerly,
Amanda Askell,
Tom Henighan,
Dawn Drain,
Ethan Perez,
Nicholas Schiefer,
Zac Hatfield-Dodds,
Nova DasSarma,
Eli Tran-Johnson,
Scott Johnston,
Sheer El-Showk,
Andy Jones,
Nelson Elhage,
Tristan Hume,
Anna Chen,
Yuntao Bai,
Sam Bowman,
Stanislav Fort,
Deep Ganguli,
Danny Hernandez,
Josh Jacobson,
Jackson Kernion,
Shauna Kravec,
Liane Lovitt
, et al. (11 additional authors not shown)
Abstract:
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answe…
▽ More
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.
△ Less
Submitted 21 November, 2022; v1 submitted 11 July, 2022;
originally announced July 2022.
-
Scaling Laws and Interpretability of Learning from Repeated Data
Authors:
Danny Hernandez,
Tom Brown,
Tom Conerly,
Nova DasSarma,
Dawn Drain,
Sheer El-Showk,
Nelson Elhage,
Zac Hatfield-Dodds,
Tom Henighan,
Tristan Hume,
Scott Johnston,
Ben Mann,
Chris Olah,
Catherine Olsson,
Dario Amodei,
Nicholas Joseph,
Jared Kaplan,
Sam McCandlish
Abstract:
Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repea…
▽ More
Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.
△ Less
Submitted 20 May, 2022;
originally announced May 2022.
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Authors:
Yuntao Bai,
Andy Jones,
Kamal Ndousse,
Amanda Askell,
Anna Chen,
Nova DasSarma,
Dawn Drain,
Stanislav Fort,
Deep Ganguli,
Tom Henighan,
Nicholas Joseph,
Saurav Kadavath,
Jackson Kernion,
Tom Conerly,
Sheer El-Showk,
Nelson Elhage,
Zac Hatfield-Dodds,
Danny Hernandez,
Tristan Hume,
Scott Johnston,
Shauna Kravec,
Liane Lovitt,
Neel Nanda,
Catherine Olsson,
Dario Amodei
, et al. (6 additional authors not shown)
Abstract:
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where prefer…
▽ More
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.