Jailbreaking as a Reward Misspecification Problem

Xie, Zhihui; Gao, Jiahui; Li, Lei; Li, Zhenguo; Liu, Qi; Kong, Lingpeng

Computer Science > Machine Learning

arXiv:2406.14393 (cs)

[Submitted on 20 Jun 2024]

Title:Jailbreaking as a Reward Misspecification Problem

Authors:Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

View PDF HTML (experimental)

Abstract:The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2406.14393 [cs.LG]
	(or arXiv:2406.14393v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2406.14393

Submission history

From: Zhihui Xie [view email]
[v1] Thu, 20 Jun 2024 15:12:27 UTC (284 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2024-06

Change to browse by:

cs
cs.CL

References & Citations

export BibTeX citation

Computer Science > Machine Learning

Title:Jailbreaking as a Reward Misspecification Problem

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Jailbreaking as a Reward Misspecification Problem

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators