Skip to main content

Showing 1–7 of 7 results for author: Belrose, N

.
  1. arXiv:2404.05971  [pdf, other

    cs.LG cs.AI cs.CL

    Does Transformer Interpretability Transfer to RNNs?

    Authors: Gonçalo Paulo, Thomas Marshall, Nora Belrose

    Abstract: Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for tran… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  2. arXiv:2402.04362  [pdf, other

    cs.LG

    Neural Networks Learn Statistics of Increasing Complexity

    Authors: Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, Xiaoli Fern

    Abstract: The distributional simplicity bias (DSB) posits that neural networks learn low-order moments of the data distribution first, before moving on to higher-order correlations. In this work, we present compelling new evidence for the DSB by showing that networks automatically learn to perform well on maximum-entropy distributions whose low-order statistics match those of the training set early in train… ▽ More

    Submitted 13 February, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  3. arXiv:2312.01037  [pdf, other

    cs.LG cs.AI cs.CL

    Eliciting Latent Knowledge from Quirky Language Models

    Authors: Alex Mallen, Madeline Brumley, Julia Kharchenko, Nora Belrose

    Abstract: Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted. To further ELK research, we introduce 12 datasets and a corresponding suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions… ▽ More

    Submitted 3 April, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

    Comments: Preprint

  4. arXiv:2306.03819  [pdf, other

    cs.LG cs.CL cs.CY

    LEACE: Perfect linear concept erasure in closed form

    Authors: Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

    Abstract: Concept erasure aims to remove specified features from a representation. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing t… ▽ More

    Submitted 29 October, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

  5. arXiv:2303.08112  [pdf, other

    cs.LG

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Authors: Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt

    Abstract: We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the \emph{tuned lens}, is a refinement of the earlier ``logit lens'' technique… ▽ More

    Submitted 26 November, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

  6. arXiv:2211.11972  [pdf, other

    cs.LG cs.AI

    imitation: Clean Imitation Learning Implementations

    Authors: Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, Stuart Russell

    Abstract: imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch. We include three inverse reinforcement learning (IRL) algorithms, three imitation learning algorithms and a preference comparison algorithm. The implementations have been benchmarked against previous results, and automated tests cover 98% of the code. Moreover, the algorithms are implemented in a… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

  7. arXiv:2211.00241  [pdf, other

    cs.LG cs.AI cs.CR stat.ML

    Adversarial Policies Beat Superhuman Go AIs

    Authors: Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

    Abstract: We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human exper… ▽ More

    Submitted 13 July, 2023; v1 submitted 31 October, 2022; originally announced November 2022.

    Comments: Accepted to ICML 2023, see paper for changelog

    ACM Class: I.2.6