Skip to main content

Showing 1–1 of 1 results for author: Rosenblatt, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19552  [pdf, other

    cs.CL cs.AI cs.LG

    Rethinking harmless refusals when fine-tuning foundation models

    Authors: Florin Pop, Judd Rosenblatt, Diogo Schwerz de Lucena, Michael Vaiana

    Abstract: In this paper, we investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior. Through the lens of semi-realistic role-playing exercises designed to elicit such behaviors, we explore the response dynamics of LLMs post fine-tuning interventions. Our methodology involves prompting models for Chain-of-Thought (CoT) reas… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: ICLR 2024 AGI Workshop Poster