Skip to main content

Showing 1–1 of 1 results for author: Vazquez, J J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2308.10248  [pdf, other

    cs.CL cs.LG

    Activation Addition: Steering Language Models Without Optimization

    Authors: Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

    Abstract: Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implici… ▽ More

    Submitted 4 June, 2024; v1 submitted 20 August, 2023; originally announced August 2023.