JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

Zhang, Xiaoyu; Zhang, Cen; Li, Tianlin; Huang, Yihao; Jia, Xiaojun; Hu, Ming; Zhang, Jie; Liu, Yang; Ma, Shiqing; Shen, Chao

Computer Science > Cryptography and Security

arXiv:2312.10766 (cs)

[Submitted on 17 Dec 2023 (v1), last revised 18 Jun 2024 (this version, v3)]

Title:JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

Authors:Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Ming Hu, Jie Zhang, Yang Liu, Shiqing Ma, Chao Shen

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs) have played a critical role in numerous applications. However, current LLMs are vulnerable to prompt-based attacks, with jailbreaking attacks enabling LLMs to generate harmful content, while hijacking attacks manipulate the model to perform unintended tasks, underscoring the necessity for detection methods. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework for jailbreaking and hijacking attacks across LLMs and MLLMs. JailGuard operates on the principle that attacks are inherently less robust than benign ones, regardless of method or modality. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants' responses on the model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. To evaluate the effectiveness of JailGuard, we build the first comprehensive multi-modal attack dataset, containing 11,000 data items across 15 known attack types. The evaluation suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.

Comments:	28 pages, 9 figures
Subjects:	Cryptography and Security (cs.CR)
Cite as:	arXiv:2312.10766 [cs.CR]
	(or arXiv:2312.10766v3 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2312.10766

Submission history

From: Xiaoyu Zhang [view email]
[v1] Sun, 17 Dec 2023 17:02:14 UTC (769 KB)
[v2] Sat, 23 Dec 2023 14:17:31 UTC (769 KB)
[v3] Tue, 18 Jun 2024 02:21:02 UTC (4,341 KB)

Computer Science > Cryptography and Security

Title:JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators