Skip to main content

Showing 1–1 of 1 results for author: Obeso, O

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11717  [pdf, other

    cs.LG cs.AI cs.CL

    Refusal in Language Models Is Mediated by a Single Direction

    Authors: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, Neel Nanda

    Abstract: Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.