Skip to main content

Showing 1–5 of 5 results for author: Davies, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2309.05973  [pdf, other

    cs.CL cs.LG

    Circuit Breaking: Removing Model Behaviors with Targeted Ablation

    Authors: Maximilian Li, Xander Davies, Max Nadeau

    Abstract: Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where t… ▽ More

    Submitted 29 January, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

    Journal ref: Workshop on Challenges in Deployable Generative AI at International Conference on Machine Learning (ICML), Honolulu, Hawaii, USA. 2023

  2. arXiv:2307.15217  [pdf, other

    cs.AI cs.CL cs.LG

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Authors: Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen , et al. (7 additional authors not shown)

    Abstract: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and rel… ▽ More

    Submitted 11 September, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

  3. arXiv:2307.03637  [pdf, other

    cs.AI

    Discovering Variable Binding Circuitry with Desiderata

    Authors: Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau

    Abstract: Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{deside… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

  4. arXiv:2303.11934  [pdf, other

    cs.NE cond-mat.dis-nn cs.AI cs.LG q-bio.NC

    Sparse Distributed Memory is a Continual Learner

    Authors: Trenton Bricken, Xander Davies, Deepak Singh, Dmitry Krotov, Gabriel Kreiman

    Abstract: Continual learning is a problem for artificial neural networks that their biological counterparts are adept at solving. Building on work using Sparse Distributed Memory (SDM) to connect a core neural circuit with the powerful Transformer model, we create a modified Multi-Layered Perceptron (MLP) that is a strong continual learner. We find that every component of our MLP variant translated from bio… ▽ More

    Submitted 20 March, 2023; originally announced March 2023.

    Comments: 9 Pages. ICLR Acceptance

    Journal ref: ICLR 2023

  5. arXiv:2303.06173  [pdf, other

    cs.LG cs.AI

    Unifying Grokking and Double Descent

    Authors: Xander Davies, Lauro Langosco, David Krueger

    Abstract: A principled understanding of generalization in deep learning may require unifying disparate observations under a single conceptual framework. Previous work has studied \emph{grokking}, a training dynamic in which a sustained period of near-perfect training performance and near-chance test performance is eventually followed by generalization, as well as the superficially similar \emph{double desce… ▽ More

    Submitted 10 March, 2023; originally announced March 2023.

    Comments: ML Safety Workshop, 36th Conference on Neural Information Processing Systems (NeurIPS 2022)