Mitigating Exaggerated Safety in Large Language Models

Bhalani, Ruchi; Ray, Ruchira

Computer Science > Computation and Language

arXiv:2405.05418 (cs)

[Submitted on 8 May 2024]

Title:Mitigating Exaggerated Safety in Large Language Models

Authors:Ruchi Bhalani, Ruchira Ray

View PDF HTML (experimental)

Abstract:As the popularity of Large Language Models (LLMs) grow, combining model safety with utility becomes increasingly important. The challenge is making sure that LLMs can recognize and decline dangerous prompts without sacrificing their ability to be helpful. The problem of "exaggerated safety" demonstrates how difficult this can be. To reduce excessive safety behaviours -- which was discovered to be 26.1% of safe prompts being misclassified as dangerous and refused -- we use a combination of XSTest dataset prompts as well as interactive, contextual, and few-shot prompting to examine the decision bounds of LLMs such as Llama2, Gemma Command R+, and Phi-3. We find that few-shot prompting works best for Llama2, interactive prompting works best Gemma, and contextual prompting works best for Command R+ and Phi-3. Using a combination of these prompting strategies, we are able to mitigate exaggerated safety behaviors by an overall 92.9% across all LLMs. Our work presents a multiple prompting strategies to jailbreak LLMs' decision-making processes, allowing them to navigate the tight line between refusing unsafe prompts and remaining helpful.

Comments:	17 pages, 8 figures, 2 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.05418 [cs.CL]
	(or arXiv:2405.05418v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2405.05418

Submission history

From: Ruchira Ray [view email]
[v1] Wed, 8 May 2024 20:39:54 UTC (184 KB)

Computer Science > Computation and Language

Title:Mitigating Exaggerated Safety in Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Mitigating Exaggerated Safety in Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators