Skip to main content

Showing 1–1 of 1 results for author: Getzen, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02577  [pdf, other

    cs.CL cs.CR cs.LG

    Are PPO-ed Language Models Hackable?

    Authors: Suraj Anand, David Getzen

    Abstract: Numerous algorithms have been proposed to $\textit{align}$ language models to remove undesirable behaviors. However, the challenges associated with a very large state space and creating a proper reward function often result in various jailbreaks. Our paper aims to examine this effect of reward in the controlled setting of positive sentiment language generation. Instead of online training of a rewa… ▽ More

    Submitted 28 May, 2024; originally announced June 2024.

    Comments: 8 pages, 4 figures