Skip to main content

Showing 1–4 of 4 results for author: Goldowsky-Dill, N

.
  1. arXiv:2405.12241  [pdf, other

    cs.LG cs.AI

    Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

    Authors: Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey

    Abstract: Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. Sparse autoencoders (SAEs), which learn a sparse, overcomplete dictionary that reconstructs a network's internal activations, have been used to identify these features. However, SAEs may learn more about the structure of the datatset than the computational structure of the network. There is the… ▽ More

    Submitted 24 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

  2. arXiv:2405.10928  [pdf, other

    cs.LG

    The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

    Authors: Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, Marius Hobbhahn

    Abstract: Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functi… ▽ More

    Submitted 20 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

  3. arXiv:2405.10927  [pdf, other

    cs.LG

    Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

    Authors: Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn

    Abstract: Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory t… ▽ More

    Submitted 20 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

  4. arXiv:2304.05969  [pdf, other

    cs.LG

    Localizing Model Behavior with Path Patching

    Authors: Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, Aryaman Arora

    Abstract: Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing… ▽ More

    Submitted 16 May, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

    Comments: 20 pages, 16 figures