Skip to main content

Showing 1–1 of 1 results for author: Stöhler, J

.
  1. arXiv:2405.10928  [pdf, other

    cs.LG

    The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

    Authors: Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, Marius Hobbhahn

    Abstract: Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functi… ▽ More

    Submitted 20 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.