Skip to main content

Showing 1–1 of 1 results for author: Rajamanoharan, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.16014  [pdf, other

    cs.LG cs.AI

    Improving Dictionary Learning with Gated Sparse Autoencoders

    Authors: Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

    Abstract: Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to enco… ▽ More

    Submitted 30 April, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

    Comments: 15 main text pages, 22 appendix pages