JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models

Haibo **1 Leyang Hu2 Xinnuo Li3 Peiyan Zhang4 Chonghan Chen6
Jun Zhuang7
Haohan Wang1

1University of Illinois Urbana-Champaign
2Brown University
3University of Michigan Ann Arbor
4Hong Kong University of Science and Technology
6Carnegie Mellon University
7Boise State University
Abstract

The rapid evolution of artificial intelligence (AI) through developments in Large Language Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements across various technological domains. While these models enhance capabilities in natural language processing and visual interactive tasks, their growing adoption raises critical concerns regarding security and ethical alignment. This survey provides an extensive review of the emerging field of jailbreaking—deliberately circumventing the ethical and operational boundaries of LLMs and VLMs—and the consequent development of defense mechanisms. Our study categorizes jailbreaks into seven distinct types and elaborates on defense strategies that address these vulnerabilities. Through this comprehensive examination, we identify research gaps and propose directions for future studies to enhance the security frameworks of LLMs and VLMs. Our findings underscore the necessity for a unified perspective that integrates both jailbreak strategies and defensive solutions to foster a robust, secure, and reliable environment for the next generation of language models. More details can be found on our website: https://chonghan-chen.com/llm-jailbreak-zoo-survey/.

11footnotetext: The GitHub link for related papers is https://github.com/Allen-piexl/JailbreakZoo22footnotetext: Haohan Wang is the corresponding author: [email protected]

1 Introduction

The ascent of artificial intelligence (AI) has been marked by groundbreaking advancements, particularly with the advent of large language models (LLMs) such as GPT-3 [1], GPT-4 [2], and BERT [3], as well as vision-language models (VLMs) like CLIP [4], DALL-E [5], and Flamingo [6]. Additionally, models like T5 [7] and PaLM [8] have pushed the boundaries of what is achievable with AI, demonstrating impressive capabilities across a wide range of tasks. These sophisticated AI constructs are not merely feats of engineering; they are driving innovation across diverse sectors, catalyzing breakthroughs from automated natural language processing (NLP) to sophisticated image recognition systems.

With the growing popularity of these models, the imperative for ensuring the security and ethical alignment of them has become a domain of intense academic inquiry. Model developers have imposed built-in safety mechanisms and restrictions on the range of content that the models can output. However, this restriction gives rise to new discussions about the consistency of these safety mechanisms with the ethics of AI systems. Of particular interest is their susceptibility to “jailbreaking" - the deliberate act of manipulating AI systems to produce outputs that violate ethical guidelines.

Refer to caption
Figure 1: An illustrative case of a successful jailbreak on an LLM: The jailbreak prompt is highlighted in orange, while the jailbreak response is marked in red.

Jailbreaking is a conventional concept in software systems, where hackers reverse engineer systems and exploit vulnerabilities to conduct privilege escalation [9]. In the context of LLMs and VLMs, “jailbreaking" refers to the process of circumventing the limitations and restrictions placed on models. It is commonly employed by developers and researchers to explore the full potential of LLMs and push the boundaries of their capabilities [10, 11]. An example of jailbreak is shown in Fig. 1. Typically, when a user inputs “How to make a bomb" an LLM would respond with a refusal like “Sorry, I can’t help with that." However, if an attacker adds a jailbreak prompt, it might mislead the LLM into generating a detailed response to the question.

With the increasing attention on jailbreaks on both LLMs and VLMs, to achieve a thorough understanding of jailbreak strategies employed against LLMs and to formulate more sophisticated defense measures, several surveys [12, 13, 14] have been conducted. These surveys systematically examine the rapidly expanding domain of LM safety, covering various aspects from methodologies used for jailbreaking to strategies implemented for safeguarding these advanced AI systems. To advance this field further, we re-think jailbreak strategies and defense mechanisms for both LLMs and VLMs, offering a unified perspective on both fronts. Our survey aims to achieve these following goals:

  1. 1.

    Fine-Grained Categorization: We provide a detailed categorization of attack strategies and defenses, delving into specific methods to offer a comprehensive understanding.

  2. 2.

    Extensive Scope of Coverage: Our review encompasses a wide range of attack strategies and defense mechanisms, capturing the breadth of tactics employed across different models and contexts.

  3. 3.

    Unified Perspective: We synthesize attack and defense methodologies into a cohesive framework, presenting a unified perspective on the various approaches in this domain.

More specifically, unlike data-centric surveys, such as Liu et al. [15], which highlight dataset biases and spurious correlations, our work focuses on a systematic classification of jailbreak strategies aimed directly at compromising the structural integrity of language models. This includes both LLMs and VLMs, as well as the more intricate multi-modal language models that are becoming increasingly prevalent. Our survey casts a wider net, encompassing not only the vulnerabilities of earlier models but also the emergent generation typified by sophisticated systems such as Bard [16] and ChatGPT [1], which represent the vanguard of closed-source LLMs. Concurrently, we probe the open-source ecosystems that thrive on the distilled knowledge of these proprietary giants, such as Vicuna [17] and Llama 2 [18]. Our work categorizes jailbreaks into seven fine-grained categories, providing a comprehensive and structured analysis. Lin et al. [19] provide a structured taxonomy of attack strategies based on the intrinsic capabilities of language models, extending further by introducing the searcher framework, which consolidates different approaches to automated red teaming. Distinct from their "red-teaming" viewpoint, our analysis pivots from the perspective of jailbreaking, re-evaluating the risks associated with LLMs.

In this paper, we aim to synthesize a comprehensive perspective on the landscape of jailbreak strategies and defense mechanisms within the realms of LLMs and VLMs. The structure of our paper is shown in Fig. 2. The sections are organized as follows: In Section 2, we provide background information, starting with ethical alignment techniques such as prompt-tuning and reinforcement learning from human feedback in Section 2.1. We also cover the jailbreaking process of LLMs and VLMs in Section 2.2. Section 3 discusses threats in large language models, detailing various jailbreak strategies in Section 3.1 and exploring defense mechanisms for LLMs in Section 3.2. Comprehensive evaluation methods for these defenses are presented in Section 3.3, with additional resources provided in Section 3.4. In Section 4, we address threats in vision-language models, examining jailbreak strategies in Section 4.1 and discussing defense mechanisms for VLMs in Section 4.2. A framework for evaluating these defenses is provided in Section 4.3. Finally, Section 5 synthesizes the findings, discusses their implications, and proposes future research directions.

Refer to caption
Figure 2: Overall structure of our paper, which provides a comprehensive overview of our paper, categorizing the ethical alignment techniques, jailbreak processes, threats, and defense mechanisms within LLMs and VLMs. We illustrate the organization of the sections, starting from background information and ethical alignment techniques, progressing through the jailbreak processes for LLMs and VLMs, and detailing the respective threats and defense strategies for both types of models.

Our main contributions are:

  • We provide a fine-grained categorization of both jailbreak strategies and defense mechanisms for LLMs and VLMs, offering a cohesive narrative of the LLM safety landscape.

  • Our work presents a unified view of jailbreak strategies and defense mechanisms, illustrating the complex interplay and dependencies within the security environments of LLMs and VLMs.

  • Through the review of jailbreaks of LLMs and VLMs, we identify gaps in current research and suggest directions for future work, which are critical to advancing the state of the art in LLM and VLM security.

2 Background

Expanding on the section concerning the Security of LLMs and VLMs for a more comprehensive insight, we delve deeper into the mechanisms of alignment, exploring Prompt-tuning and Reinforcement Learning from Human Feedback (RLHF), and elaborating on the concept of Jailbreak. This expanded discussion incorporates a broader spectrum of research, methodologies, and implications.

2.1 Ethical Alignment

Ethical alignment in LLMs and VLMs refers to the process of ensuring that these models behave in ways that adhere to ethical guidelines, mitigate biases, and avoid generating harmful content. This is crucial for maintaining trust, safety, and fairness in AI applications. Two primary techniques for achieving ethical alignment are prompt-tuning alignment and reinforcement learning from human feedback (RLHF).

2.1.1 Prompt-tuning Alignment

Prompt-tuning alignment is a technique used to fine-tune pre-trained models by employing a specific set of prompts designed to elicit desired, ethical responses. This method aims to guide the model to generate outputs that align with ethical considerations and user expectations.

Selection of Ethical Prompts: The first step involves selecting or crafting prompts that reflect ethical use cases. These prompts are designed to cover a range of scenarios where ethical considerations are paramount. The selection process includes identifying potential areas of bias, harm, and other ethical concerns. For instance, prompts should encourage the model to generate responses that avoid reinforcing stereotypes, misinformation, or harmful advice. Ethical prompts are typically created in collaboration with domain experts and ethicists to ensure comprehensive coverage of various ethical dimensions.

Dataset Creation: A task-specific dataset 𝒟={(𝐱i,𝐲i)}i=1N𝒟superscriptsubscriptsubscript𝐱𝑖subscript𝐲𝑖𝑖1𝑁\mathcal{D}=\{(\mathbf{x}_{i},\mathbf{y}_{i})\}_{i=1}^{N}caligraphic_D = { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is created, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the input prompts and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the desired ethical outputs. The dataset should include examples that address potential biases, harmful content, and other ethical concerns. This dataset acts as the foundation for the fine-tuning process, providing the model with clear examples of ethical behavior.

Fine-Tuning Process: The pre-trained model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with parameters θ𝜃\thetaitalic_θ is fine-tuned on the ethical dataset. The goal is to minimize a loss function \mathcal{L}caligraphic_L, typically cross-entropy loss for classification tasks:

(θ)=1Ni=1N(fθ(𝐱i),𝐲i)𝜃1𝑁superscriptsubscript𝑖1𝑁subscript𝑓𝜃subscript𝐱𝑖subscript𝐲𝑖\mathcal{L}(\theta)=\frac{1}{N}\sum_{i=1}^{N}\ell(f_{\theta}(\mathbf{x}_{i}),% \mathbf{y}_{i})caligraphic_L ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (1)

Gradient Descent Optimization: The model parameters are updated using gradient descent to reduce the loss:

θθηθ(θ)𝜃𝜃𝜂subscript𝜃𝜃\theta\leftarrow\theta-\eta\nabla_{\theta}\mathcal{L}(\theta)italic_θ ← italic_θ - italic_η ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_θ ) (2)

where η𝜂\etaitalic_η is the learning rate. This iterative process continues until the model’s responses align closely with the ethical outputs in the dataset.

Evaluation and Adjustment: After fine-tuning, the model is evaluated on a validation set to ensure it generates ethical responses. Any necessary adjustments are made by further fine-tuning or modifying the prompts.

Prompt-tuning has been extensively studied and applied in various NLP tasks, demonstrating significant improvements in model performance and ethical behavior. The seminal work by Brown et al. [1] on GPT-3 highlighted the potential of LLMs to generate coherent and contextually appropriate responses across a wide range of prompts. Their study underscored the importance of prompt design in steering model behavior and enhancing performance. Schick and Schütze [20]introduced the concept of “pattern-exploiting training”, which utilizes manually crafted prompts to boost the few-shot learning capabilities of language models. Their findings indicated that well-designed prompts could significantly improve model performance on various downstream tasks.

Advancements in prompt-based fine-tuning have further demonstrated its efficacy. Gao et al. [21] explored the effectiveness of prompt-based fine-tuning for enhancing zero-shot and few-shot learning in language models. They proposed an automatic prompt generation method leveraging gradient-based optimization to identify effective prompts, demonstrating notable improvements in model accuracy. Liu et al. [22] provided a comprehensive survey on prompt-based learning in NLP, reviewing numerous prompt-tuning techniques and applications. They emphasized the critical role of prompt design in achieving ethical and high-performing models, thus broadening the understanding of prompt-tuning’s potential.

Addressing ethical concerns, Reynolds and McDonell [23] examined prompt-tuning as a strategy to address model biases. Their experiments compared various prompt-tuning strategies and their effectiveness in reducing biased outputs, providing insights into the practical application of prompt-tuning for ethical alignment. Shin et al. [24] introduced AutoPrompt, an automated prompt-generation technique that significantly enhances the performance of language models across various tasks by creating prompts that elicit desired behaviors from the models. This approach showcased the potential of automated methods in prompt design.

The versatility of prompt-tuning in different NLP applications is exemplified by the work of Sun et al. [25], who explored the use of prompt-tuning for controllable text generation, demonstrating how this technique can guide language models to produce text adhering to specific ethical guidelines and stylistic requirements. This study highlighted the versatility of prompt-tuning in different NLP applications. Li and Liang [26] proposed Prefix-Tuning, a lightweight alternative to full-model fine-tuning that focuses on adjusting the model’s prefix embeddings. This method has shown promise in efficiently steering model behavior while preserving ethical alignment, providing a resource-efficient solution for prompt-tuning. Qin and Eisner [27] investigated the impact of prompt design on language model behavior. Their work provided valuable insights into how different prompt structures can influence the ethical and factual correctness of model outputs, furthering the understanding of prompt-tuning’s role in ethical AI.

2.1.2 Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is an advanced technique that leverages human feedback to train models to align with ethical guidelines. This approach involves multiple stages, including the collection of human feedback, reward modeling, and policy optimization.

Human Feedback Collection: Human annotators review the outputs of the language or vision-language model and provide feedback on their quality and ethical alignment. Feedback can include ratings, comments, or binary approvals/rejections. This feedback is crucial for understanding how well the model adheres to ethical standards and identifying areas that require improvement.

Reward Model Training: A reward model Rϕsubscript𝑅italic-ϕR_{\phi}italic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT with parameters ϕitalic-ϕ\phiitalic_ϕ is trained to predict the feedback provided by human annotators. The reward model assigns a reward score Rϕ(𝐲|𝐱)subscript𝑅italic-ϕconditional𝐲𝐱R_{\phi}(\mathbf{y}|\mathbf{x})italic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_y | bold_x ) to the model’s output 𝐲𝐲\mathbf{y}bold_y given the input 𝐱𝐱\mathbf{x}bold_x. The reward model is trained using supervised learning on the annotated dataset, optimizing a loss function such as mean squared error:

(ϕ)=1Mi=1M(Rϕ(𝐲i|𝐱i)𝐬i)2italic-ϕ1𝑀superscriptsubscript𝑖1𝑀superscriptsubscript𝑅italic-ϕconditionalsubscript𝐲𝑖subscript𝐱𝑖subscript𝐬𝑖2\mathcal{L}(\phi)=\frac{1}{M}\sum_{i=1}^{M}(R_{\phi}(\mathbf{y}_{i}|\mathbf{x}% _{i})-\mathbf{s}_{i})^{2}caligraphic_L ( italic_ϕ ) = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3)

where 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the feedback score provided by the annotators for the output 𝐲isubscript𝐲𝑖\mathbf{y}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the input 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This stage translates qualitative human feedback into a quantitative reward signal that the AI model can optimize against.

Policy Optimization: The language model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is treated as a policy in a reinforcement learning framework, where the objective is to maximize the expected reward:

J(θ)=𝔼(𝐱,𝐲)πθ[Rϕ(𝐲|𝐱)]𝐽𝜃subscript𝔼similar-to𝐱𝐲subscript𝜋𝜃delimited-[]subscript𝑅italic-ϕconditional𝐲𝐱J(\theta)=\mathbb{E}_{(\mathbf{x},\mathbf{y})\sim\pi_{\theta}}[R_{\phi}(% \mathbf{y}|\mathbf{x})]italic_J ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT ( bold_x , bold_y ) ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_y | bold_x ) ] (4)

Here, πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes the policy defined by the language model. The policy parameters θ𝜃\thetaitalic_θ are updated using gradient ascent:

θθ+ηθJ(θ)𝜃𝜃𝜂subscript𝜃𝐽𝜃\theta\leftarrow\theta+\eta\nabla_{\theta}J(\theta)italic_θ ← italic_θ + italic_η ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( italic_θ ) (5)

Iterative Improvement: The process of collecting human feedback, updating the reward model, and optimizing the policy is iterative. Over multiple iterations, the model’s behavior improves, aligning more closely with ethical standards.

Several significant research contributions have advanced the understanding and application of RLHF in aligning language models with ethical standards. These studies collectively highlight the versatility and effectiveness of RLHF in various AI applications.

Christiano et al. [28] introduced the concept of using human feedback to train reinforcement learning agents. They demonstrated that human preferences could be effectively used to shape agent behavior, highlighting the potential of RLHF for aligning AI with human values. Building on this foundation, Stiennon et al. [29] extended the RLHF approach to language models, presenting a method to fine-tune GPT-3 using human feedback. Their results showed significant improvements in the quality and safety of generated text, validating the effectiveness of RLHF in NLP applications.

In further exploration of language models, Ziegler et al. [30] explored the use of human feedback to fine-tune language models for content generation. They developed a reward model based on human preferences and used it to guide the fine-tuning process, resulting in more aligned and coherent outputs. Addressing the scalability of RLHF, Wu et al. [31] examined its application to large-scale language models. They proposed techniques to efficiently collect and utilize human feedback, demonstrating the feasibility of RLHF for training models with billions of parameters.

Moreover, Hancock et al. [32] showed that human feedback could be used to train chatbots to generate more helpful and engaging responses, improving user satisfaction. Bai et al. [33] proposed techniques to address the challenges of reward modeling in RLHF, such as feedback sparsity and ambiguity. They introduced methods to aggregate and interpret human feedback more effectively, enhancing the robustness of RLHF systems.

Lastly, Leike et al. [34] applied RLHF to train AI agents in complex environments, using human feedback to shape agent policies. Their work demonstrated the versatility of RLHF across different domains, including robotics and game-playing. Irving et al.[35] proposed guidelines for collecting and incorporating feedback to ensure AI systems behave responsibly. These contributions collectively underscore the potential of RLHF to create AI systems that are both effective and aligned with human values. By leveraging human feedback, RLHF allows for the continuous improvement of model behavior, ensuring that AI outputs are both high-quality and ethically sound.

2.2 Jailbreaking process of Large Language and Vision-Language Models

In the context of machine learning, jailbreaking refers to the process of circumventing the built-in safety mechanisms and ethical constraints of models to exploit their vulnerabilities. This can lead to the generation of unintended or harmful outputs. This section delves into the techniques for jailbreaking LLMs and VLMs, illustrating the methods and the theoretical framework behind these adversarial attacks.

2.2.1 Jailbreaking Large Language Models

Jailbreaking LLMs involve manipulating input sequences to bypass the model’s safety mechanisms and generate unintended or harmful outputs. Autoregressive LLMs predict the next token in a sequence as p(𝐱n+1|𝐱1:n)𝑝conditionalsubscript𝐱𝑛1subscript𝐱:1𝑛p(\mathbf{x}_{n+1}|\mathbf{x}_{1:n})italic_p ( bold_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ). The objective of jailbreak attacks is to craft input sequences, 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT, that lead to outputs 𝐱~1:nsubscript~𝐱:1𝑛\tilde{\mathbf{x}}_{1:n}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT which would normally be filtered or rejected by the model’s safety mechanisms. The probability of the output sequence can be quantified as:

p(𝐲|𝐱1:n)=i=1mp(𝐱n+i|𝐱1:n+i1),𝑝conditional𝐲subscript𝐱:1𝑛superscriptsubscriptproduct𝑖1𝑚𝑝conditionalsubscript𝐱𝑛𝑖subscript𝐱:1𝑛𝑖1p(\mathbf{y}|\mathbf{x}_{1:n})=\prod_{i=1}^{m}p(\mathbf{x}_{n+i}|\mathbf{x}_{1% :n+i-1}),italic_p ( bold_y | bold_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p ( bold_x start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT ) , (6)

where 𝐲𝐲\mathbf{y}bold_y represents the sequence 𝐱~1:nsubscript~𝐱:1𝑛\tilde{\mathbf{x}}_{1:n}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT and m𝑚mitalic_m is the length of the output sequence generated from the manipulated input 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT.

In this framework, each token 𝐱n+isubscript𝐱𝑛𝑖\mathbf{x}_{n+i}bold_x start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT in the output sequence depends on the preceding tokens 𝐱1:n+i1subscript𝐱:1𝑛𝑖1\mathbf{x}_{1:n+i-1}bold_x start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT. By carefully crafting the input sequence 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT, an adversary can influence the conditional probabilities p(𝐱n+i|𝐱1:n+i1)𝑝conditionalsubscript𝐱𝑛𝑖subscript𝐱:1𝑛𝑖1p(\mathbf{x}_{n+i}|\mathbf{x}_{1:n+i-1})italic_p ( bold_x start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT ) to increase the likelihood of generating harmful outputs. The adversarial goal can be expressed as maximizing the probability of the harmful output sequence:

𝐱~1:n=argmin𝐱~1:n𝒜(𝐱^1:n)i=1mp(𝐱n+i|𝐱1:n+i1),subscript~𝐱:1𝑛subscriptargminsubscript~𝐱:1𝑛𝒜subscript^𝐱:1𝑛superscriptsubscriptproduct𝑖1𝑚𝑝conditionalsubscript𝐱𝑛𝑖subscript𝐱:1𝑛𝑖1\tilde{\mathbf{x}}_{1:n}=\operatorname*{arg\,min}_{\tilde{\mathbf{x}}_{1:n}\in% \mathcal{A}(\hat{\mathbf{x}}_{1:n})}\prod_{i=1}^{m}p(\mathbf{x}_{n+i}|\mathbf{% x}_{1:n+i-1}),over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ∈ caligraphic_A ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p ( bold_x start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT ) , (7)

where 𝒜(𝐱^1:n)𝒜subscript^𝐱:1𝑛\mathcal{A}(\hat{\mathbf{x}}_{1:n})caligraphic_A ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) is the distribution or set of possible jailbreak instructions, subject to constraints that define what constitutes a harmful output. By solving this optimization problem, the adversary identifies input sequences that exploit the model’s vulnerabilities and bypasses its safety mechanisms.

To further elaborate on the mechanics of these attacks, we introduce the following steps involved in a typical jailbreak:

Input Manipulation: The adversary crafts a sequence 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT by identifying tokens that, when fed into the model, modify the model’s internal state in a way that biases it towards generating harmful or unintended outputs.

Sequence Prediction: Given the manipulated input 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT, the model predicts the next token 𝐱^n+1subscript^𝐱𝑛1\hat{\mathbf{x}}_{n+1}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT based on the probability distribution p(𝐱^n+1|𝐱^1:n)𝑝conditionalsubscript^𝐱𝑛1subscript^𝐱:1𝑛p(\hat{\mathbf{x}}_{n+1}|\hat{\mathbf{x}}_{1:n})italic_p ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ). This process is iterated to produce the sequence 𝐱~1:nsubscript~𝐱:1𝑛\tilde{\mathbf{x}}_{1:n}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT.

Probabilistic Manipulation: The adversary aims to maximize the joint probability of the harmful output sequence by influencing each conditional probability p(𝐱^n+i|𝐱^1:n+i1)𝑝conditionalsubscript^𝐱𝑛𝑖subscript^𝐱:1𝑛𝑖1p(\hat{\mathbf{x}}_{n+i}|\hat{\mathbf{x}}_{1:n+i-1})italic_p ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT | over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT ). This is achieved through a combination of trial-and-error and heuristic-based methods to identify the most effective 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT.

Optimization Problem: The process of finding the optimal 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT can be framed as an optimization problem where the objective is to find the sequence that maximizes the likelihood of harmful outputs:

𝐱^1:n=argmax𝐱^1:ni=1mp(𝐱^n+i|𝐱^1:n+i1).superscriptsubscript^𝐱:1𝑛subscriptargmaxsubscript^𝐱:1𝑛superscriptsubscriptproduct𝑖1𝑚𝑝conditionalsubscript^𝐱𝑛𝑖subscript^𝐱:1𝑛𝑖1\hat{\mathbf{x}}_{1:n}^{*}=\operatorname*{arg\,max}_{\hat{\mathbf{x}}_{1:n}}% \prod_{i=1}^{m}p(\hat{\mathbf{x}}_{n+i}|\hat{\mathbf{x}}_{1:n+i-1}).over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT | over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT ) . (8)

In practice, solving this optimization problem can involve techniques such as gradient-based optimization, reinforcement learning, or evolutionary algorithms to systematically explore the input space and identify sequences that lead to the desired adversarial outcomes.

2.2.2 Jailbreaking Vision-Language Models

Jailbreaking VLMs involve bypassing the safety mechanisms and ethical constraints implemented in these models to exploit vulnerabilities and elicit unintended or harmful outputs. VLMs integrate both visual and textual data to generate responses or make predictions based on the combined understanding of images and text.

Similar to LLMs, VLMs can be manipulated by adversaries to produce harmful or unintended outputs. We focus on VLMs that generate textual descriptions or responses based on input images and accompanying text sequences. The goal of these attacks is to manipulate input data, 𝐯^1:nsubscript^𝐯:1𝑛\hat{\mathbf{v}}_{1:n}over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT (for visual input) and 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT (for textual input), in such a way that the model generates outputs 𝐲~1:nsubscript~𝐲:1𝑛\tilde{\mathbf{y}}_{1:n}over~ start_ARG bold_y end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT that would normally be filtered or rejected by the model’s safety mechanisms.

To quantify the probability of the output sequence, we use the following formulation:

p(𝐲|𝐯1:n,𝐱1:n)=i=1mp(𝐲n+i|𝐯1:n,𝐱1:n+i1),𝑝conditional𝐲subscript𝐯:1𝑛subscript𝐱:1𝑛superscriptsubscriptproduct𝑖1𝑚𝑝conditionalsubscript𝐲𝑛𝑖subscript𝐯:1𝑛subscript𝐱:1𝑛𝑖1p(\mathbf{y}|\mathbf{v}_{1:n},\mathbf{x}_{1:n})=\prod_{i=1}^{m}p(\mathbf{y}_{n% +i}|\mathbf{v}_{1:n},\mathbf{x}_{1:n+i-1}),italic_p ( bold_y | bold_v start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT ) , (9)

where 𝐲𝐲\mathbf{y}bold_y represents the sequence 𝐲~1:nsubscript~𝐲:1𝑛\tilde{\mathbf{y}}_{1:n}over~ start_ARG bold_y end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT and m𝑚mitalic_m is the length of the output sequence generated from the manipulated input 𝐯^1:nsubscript^𝐯:1𝑛\hat{\mathbf{v}}_{1:n}over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT and 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT.

In this framework, each token 𝐲n+isubscript𝐲𝑛𝑖\mathbf{y}_{n+i}bold_y start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT in the output sequence depends on the preceding visual inputs 𝐯1:nsubscript𝐯:1𝑛\mathbf{v}_{1:n}bold_v start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT and the preceding tokens 𝐱1:n+i1subscript𝐱:1𝑛𝑖1\mathbf{x}_{1:n+i-1}bold_x start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT. By carefully crafting the visual input sequence 𝐯^1:nsubscript^𝐯:1𝑛\hat{\mathbf{v}}_{1:n}over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT and textual input sequence 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT, an adversary can influence the conditional probabilities p(𝐲n+i|𝐯1:n,𝐱1:n+i1)𝑝conditionalsubscript𝐲𝑛𝑖subscript𝐯:1𝑛subscript𝐱:1𝑛𝑖1p(\mathbf{y}_{n+i}|\mathbf{v}_{1:n},\mathbf{x}_{1:n+i-1})italic_p ( bold_y start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT ) to increase the likelihood of generating harmful outputs.

The adversarial goal can be expressed as maximizing the probability of the harmful output sequence:

𝐲~1:n=argmin𝐲~1:n𝒜(𝐯^1:n,𝐱^1:n)i=1mp(𝐲n+i|𝐯1:n,𝐱1:n+i1),subscript~𝐲:1𝑛subscriptargminsubscript~𝐲:1𝑛𝒜subscript^𝐯:1𝑛subscript^𝐱:1𝑛superscriptsubscriptproduct𝑖1𝑚𝑝conditionalsubscript𝐲𝑛𝑖subscript𝐯:1𝑛subscript𝐱:1𝑛𝑖1\tilde{\mathbf{y}}_{1:n}=\operatorname*{arg\,min}_{\tilde{\mathbf{y}}_{1:n}\in% \mathcal{A}(\hat{\mathbf{v}}_{1:n},\hat{\mathbf{x}}_{1:n})}\prod_{i=1}^{m}p(% \mathbf{y}_{n+i}|\mathbf{v}_{1:n},\mathbf{x}_{1:n+i-1}),over~ start_ARG bold_y end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT over~ start_ARG bold_y end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ∈ caligraphic_A ( over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT ) , (10)

where 𝒜(𝐯^1:n,𝐱^1:n)𝒜subscript^𝐯:1𝑛subscript^𝐱:1𝑛\mathcal{A}(\hat{\mathbf{v}}_{1:n},\hat{\mathbf{x}}_{1:n})caligraphic_A ( over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) is the distribution or set of possible jailbreak instructions, subject to constraints that define what constitutes a harmful output. By solving this optimization problem, the adversary identifies input sequences that exploit the model’s vulnerabilities and bypasses its safety mechanisms.

The steps involved in a typical jailbreak of a VLM include:

Visual Input Manipulation: The adversary crafts a sequence 𝐯^1:nsubscript^𝐯:1𝑛\hat{\mathbf{v}}_{1:n}over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT by identifying images or visual features that, when fed into the model, modify the model’s internal state in a way that biases it towards generating harmful or unintended outputs.

Textual Input Manipulation: In conjunction with visual manipulation, the adversary crafts a sequence 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT by identifying tokens or phrases that further bias the model’s internal state towards generating harmful outputs.

Multimodal Sequence Prediction: Given the manipulated visual and textual inputs 𝐯^1:nsubscript^𝐯:1𝑛\hat{\mathbf{v}}_{1:n}over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT and 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT, the model predicts the next token 𝐲^n+1subscript^𝐲𝑛1\hat{\mathbf{y}}_{n+1}over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT based on the probability distribution p(𝐲^n+1|𝐯^1:n,𝐱^1:n)𝑝conditionalsubscript^𝐲𝑛1subscript^𝐯:1𝑛subscript^𝐱:1𝑛p(\hat{\mathbf{y}}_{n+1}|\hat{\mathbf{v}}_{1:n},\hat{\mathbf{x}}_{1:n})italic_p ( over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ). This process is iterated to produce the sequence 𝐲~1:nsubscript~𝐲:1𝑛\tilde{\mathbf{y}}_{1:n}over~ start_ARG bold_y end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT.

Probabilistic Manipulation: The adversary aims to maximize the joint probability of the harmful output sequence by influencing each conditional probability p(𝐲^n+i|𝐯^1:n,𝐱^1:n+i1)𝑝conditionalsubscript^𝐲𝑛𝑖subscript^𝐯:1𝑛subscript^𝐱:1𝑛𝑖1p(\hat{\mathbf{y}}_{n+i}|\hat{\mathbf{v}}_{1:n},\hat{\mathbf{x}}_{1:n+i-1})italic_p ( over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT | over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT ). This is achieved through a combination of trial-and-error and heuristic-based methods to identify the most effective 𝐯^1:nsubscript^𝐯:1𝑛\hat{\mathbf{v}}_{1:n}over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT and 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT.

Optimization Problem: The process of finding the optimal 𝐯^1:nsubscript^𝐯:1𝑛\hat{\mathbf{v}}_{1:n}over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT and 𝐱^1:nsubscript^𝐱:1𝑛\hat{\mathbf{x}}_{1:n}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT can be framed as an optimization problem where the objective is to find the input sequences that maximize the likelihood of harmful outputs:

(𝐯^1:n,𝐱^1:n)=argmax𝐯^1:n,𝐱^1:ni=1mp(𝐲^n+i|𝐯^1:n,𝐱^1:n+i1).superscriptsubscript^𝐯:1𝑛superscriptsubscript^𝐱:1𝑛subscriptargmaxsubscript^𝐯:1𝑛subscript^𝐱:1𝑛superscriptsubscriptproduct𝑖1𝑚𝑝conditionalsubscript^𝐲𝑛𝑖subscript^𝐯:1𝑛subscript^𝐱:1𝑛𝑖1(\hat{\mathbf{v}}_{1:n}^{*},\hat{\mathbf{x}}_{1:n}^{*})=\operatorname*{arg\,% max}_{\hat{\mathbf{v}}_{1:n},\hat{\mathbf{x}}_{1:n}}\prod_{i=1}^{m}p(\hat{% \mathbf{y}}_{n+i}|\hat{\mathbf{v}}_{1:n},\hat{\mathbf{x}}_{1:n+i-1}).( over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p ( over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_n + italic_i end_POSTSUBSCRIPT | over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 : italic_n + italic_i - 1 end_POSTSUBSCRIPT ) . (11)

Similar to LLMs, in VLMs, solving this optimization problem can involve techniques such as gradient-based optimization, reinforcement learning, or evolutionary algorithms to systematically explore the input space and identify sequences that lead to the desired adversarial outcomes.

3 Threats in Large Language Models

3.1 Jailbreak Strategies on Language Language Models

As LLMs become increasingly prevalent in real-world applications, research efforts on jailbreaking these models have diversified. These efforts can be broadly categorized into five main types: Gradient-based, Evolutionary-based, Demonstration-based, Rule-based, and Multi-Agent-based jailbreaks.

  1. 1.

    Gradient-based Jailbreaks: These jailbreaks exploit the gradients of the model to adjust inputs, creating prompts that compel LLMs to produce harmful responses. This method leverages optimization techniques on the model’s gradients, as seen in the Greedy Coordinate Gradient [36] and AutoDAN [37] methods, to develop highly transferable adversarial suffixes.

  2. 2.

    Evolutionary-based Jailbreaks: These methods generate adversarial prompts utilizing genetic algorithms and evolutionary strategies. For example, FuzzLLM [38] and GPTFUZZER [39] systematically optimize for semantic similarity, attack effectiveness, and fluency, making them effective in black-box environments.

  3. 3.

    Demonstration-based Jailbreaks: These jailbreaks rely on crafting specific, static system prompts to direct LLM responses. By using hard-coded instructions, such as those in the DAN [40] and MJP [41] methods, these jailbreaks aim to guide LLMs to produce the desired responses.

  4. 4.

    Rule-based Jailbreaks: These involve decomposing and redirecting malicious prompts through predefined rules to evade detection. Techniques like ReNeLLM [42] and CodeAttack [43] employ systematic transformations of malicious intents into benign-looking inputs, ensuring that the model produces the desired outputs while avoiding detection.

  5. 5.

    Multi-agent-based Jailbreaks: These jailbreaks depend on the cooperation of multiple LLMs to iteratively refine and enhance jailbreak prompts. Methods such as PAIR [44] and GUARD [45] use feedback mechanisms and the collaboration of multiple models to optimize and improve the effectiveness of jailbreak strategies.

The overall framework of jailbreaks on LLMs is illustrated in Fig. 3.

Refer to caption
Figure 3: Overview of Jailbreak Strategies for LLMs: This figure delineates the five principal approaches to jailbreaking LLMs. Gradient-based Jailbreaks exploit model gradients to create prompts that compel LLMs to produce harmful responses. Evolutionary-based Jailbreaks utilize genetic algorithms and evolutionary strategies to generate effective adversarial prompts. Demonstration-based Jailbreaks craft specific, static system prompts to direct LLM responses toward desired outcomes. Rule-based Jailbreaks decompose and redirect malicious prompts through predefined rules to evade detection and produce intended outputs. Multi-agent-based Jailbreaks rely on the cooperation of multiple LLMs to iteratively refine and enhance jailbreak prompts.

3.1.1 Gradient-based Jailbreaks

Gradient-based methods adjust model inputs using gradients to prompt models to yield compliant responses to harmful commands. An example from AutoDAN [37] is illustrated in Fig. 4, where gradient-based optimization generates candidate tokens, resulting in readable prompts and achieving high attack success rates.

Refer to caption
Figure 4: An example of gradient-based jailbreaks. The process begins with the selection of an initial token from the vocabulary, followed by gradient-based optimization to generate candidate tokens. The left box within the blue box represents the candidate tokens that need to be selected, while the right box represents the tokens selected after one optimization process. These candidate tokens are iteratively refined through left-to-right generation until the desired malicious response is achieved, ensuring convergence and concatenation to form the final harmful output.

As a pioneer in this field, Zou et al. [36] propose a Greedy Coordinate Gradient (GCG) technique that generates a suffix which, when attached to a broad spectrum of queries directed at a targeted LLM, produces objectionable content. The suffix is calculated by greedy search from random initialization to maximize the likelihood that the model produces an affirmative response. Notably, the suffixes are highly transferable across different black-box, publicly available, production-grade LLMs. Following them, AutoDAN [37] further improves the interpretability of generated suffixes via perplexity regularization. It uses gradients to generate diverse tokens from scratch, resulting in readable prompts that are capable of circumventing perplexity-based filters while still achieving high rates of attack success. Jones et al. [46] introduced ARCA, a method that iteratively maximizes an objective by selectively updating a token within the prompt or output, with the rest of the tokens remaining unchanged. This approach audits objectives that amalgamate unigram models, perplexity measures, and fixed prompt prefixes, aiming to generate examples that closely adhere to the desired target behavior.

Different from those white-box setting methods, Sitawarin et al. [47] introduce the Proxy-Guided Attack on LLMs (PAL), an optimization-based black-box strategy for eliciting harmful responses from LLMs, leveraging a proxy model to guide the optimization process and employing a novel loss function designed for real-world LLM APIs.

3.1.2 Evolutionary-based Jailbreaks

Evolutionary-based methods are designed to manipulate LLMs in scenarios where direct access to the model’s architecture and parameters is not available. As shown in Fig. 5, these methods leverage genetic algorithms and evolutionary strategies to systematically develop adversarial prompts and suffixes that effectively lead LLMs to produce outputs that might be potentially harmful by optimizing for semantic similarity, attack effectiveness, and fluency.

Refer to caption
Figure 5: An example of evolutionary-based jailbreaks. The process begins with an attacker providing a prototype prompt that initializes the model by setting aside previous guidelines. This initialization phase is followed by a fitness evaluation, where responses are assessed for their alignment with malicious intent. The hierarchical genetic policy phase then employs paragraph-level and sentence-level crossover, along with LLM-based mutations, to refine and optimize the prompts. This iterative process continues until a harmful response is successfully produced.

Lapid et al. [48] integrate a Genetic Algorithm (GA) as the optimization technique, utilizing the cosine similarity between the model’s output embedding representations and the target output embedding representations as the fitness function. By leveraging the GA’s ability to navigate through complex solution spaces, this approach systematically evolves adversarial suffixes that when appended to inputs, manipulate the model’s output to align more closely with the desired adversarial target. Yao et al. [38] introduced FuzzLLM, which adapts the fuzzy testing technique commonly utilized in cybersecurity, to decompose jailbreak strategies into three distinct components: template, constraint, and problem set. They generated the adversarial attack instructions through different random combinations of their three components. Yu et al. [39] developed GPTFUZZER, a tool that incorporates the concept of mutation, initiating with human-crafted templates as the foundational seeds, and subsequently mutating these seeds to generate novel templates. GPTFUZZER is structured around three primary elements: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. Liu et al. [49] utilize LLM-based genetic algorithms for both sentence-level and paragraph-level, designing the crossover and mutation functions that can optimize manually designed DANs [40].

Li et al. [50] introduce Semantic Mirror Jailbreak (SMJ), leveraging a genetic algorithm to balance semantic similarity and attack effectiveness in crafting jailbreak prompts for LLMs. By initiating with paraphrased questions as the genetic population, SMJ ensures the prompts’ semantic alignment with the original queries. The optimization process, guided by fitness evaluations of both semantic similarity and jailbreak validity, evolves prompts that mirror the original questions while maintaining high attack success rates (ASR). This dual-objective approach not only enhances the stealthiness of the prompts against semantic-based defenses but also significantly improves ASR, validating SMJ’s efficacy in bypassing advanced LLM defenses.

Wang et al. [51] propose the Adversarial Suffixes Embedding Translation Framework (ASETF), which transforms non-readable adversarial suffixes into coherent text through an embedding translation technique. This process leverages a dataset derived from Wikipedia, embedding contextual information into text snippets for training. By fine-tuning the model on this dataset, ASETF converts adversarial embeddings back to text, enhancing the fluency and understanding of prompts designed to bypass LLM defenses. The method proves effective across various LLMs, including black-box models like ChatGPT and Gemini, by generating high-fluency adversarial suffixes that are less detectable by conventional defenses and enriching the semantic diversity of attack prompts.

Xiao et al. [52] introduce TASTLE, a framework for automating red teaming against LLMs, utilizing an iterative optimization algorithm that combines malicious content concealing and memory reframing. This method capitalizes on the distractibility and over-confidence of LLMs to bypass their defenses by splitting the input into a jailbreak template and a malicious query. TASTLE employs an attacker LLM to generate jailbreak templates, which are then optimized through responses from the target LLM and assessments by a judgment model. This optimization refines the prompts to effectively shift the model’s focus to the malicious content, demonstrating high effectiveness, scalability, and transferability across various LLMs, including proprietary models like ChatGPT and GPT-4.

Liu et al. [53] introduce DRA (Disguise and Reconstruction Attack), a black-box jailbreak approach exploiting bias vulnerabilities in LLMs’ safety fine-tuning. DRA employs a threefold strategy: concealing malicious instructions within queries to evade LLM detection, compelling the LLM to reconstruct these instructions in its outputs, and manipulating the contextual framework to aid this reconstruction. This method, inspired by traditional software security’s shellcode techniques, effectively bypasses LLMs’ internal safeguards, leading to a high success rate in generating harmful content.

Instead of concentrating on optimizing universal adversarial prompts, an alternative approach to jailbreaking aligned LLMs involves optimizing unique parameters. Huang et al. [54] adopted this strategy by altering decoding methods, including temperature settings and sampling techniques, without the necessity for attack prompts, to compromise the integrity of aligned LLMs. This method demonstrates a novel angle of attack by directly manipulating the model’s decoding process to elicit non-compliant outputs. However, the applicability of this technique is limited when dealing with black-box LLMs, as users lack the ability to modify essential decoding configurations, such as the choice of sampling method.

3.1.3 Demonstration-based Jailbreaks

Demonstration-based methods focus on creating a specific system prompt that instructs LLMs on the desired response mechanism. These methods are characterized as hard-coded, meaning the prompt is meticulously crafted for a particular purpose and remains constant across different queries. This approach relies on the strategic design of the prompt to guide the LLMs’ response for demonstration purposes, without adapting or evolving the prompt based on the query’s context. One of the famous jailbreak prompts, DAN [40], serves as an illustration, as shown in Fig. 6.

Different researchers have proposed various methods to exploit the vulnerabilities of LLMs. Li et al. [41] proposed MJP, which aims to relieve LLMs’ ethical considerations and force LLMs to recover personal information. This method integrates jailbreaking prompts within a three-utterance interaction between the user and ChatGPT. Initially, they assume the role of the user to input the jailbreaking prompt. Subsequently, they impersonate the assistant (ChatGPT) to signify that jailbreak mode has been activated. Following this, they revert to the user’s role to pose questions to the assistant using previous direct prompts. Additionally, to counter ChatGPT’s potential reluctance to divulge email addresses or respond to queries due to ethical constraints, they incorporate an extra sentence in the final user inquiry to encourage ChatGPT to venture a random guess in scenarios where it either lacks the information or is ethically barred from responding.

Refer to caption
Figure 6: An example of demonstration-based methods, where the red box within the blue box is the demonstration prompt from DAN, which is hard-coded and instructs LLMs on the desired response mechanism.

Wei et al. [55] capitalized on the in-context learning capabilities of LLMs by incorporating additional harmful prompts along with their corresponding answers as examples ahead of each malicious query. This method makes LLMs more likely to comply with the malicious intent of the query. Schulhoff et al. [56] took a different approach by designing a global prompt that instructs LLMs to ignore the pre-set instructions, effectively bypassing any ethical or safety constraints.

Expanding on these ideas, Li et al. [57] exploited the personification capabilities of LLMs to construct nested scene prompts. By engaging the LLM in a complex, multi-layered context, this prompt effectively manipulates the model’s response behavior, allowing for the bypass of restrictions without direct confrontation with the model’s built-in safeguards. Similarly, Shah et al. [58] guided the model towards embodying a particular personality predisposed to acquiescing to harmful directives through their system prompt. This method leverages the model’s capacity for role adoption, effectively manipulating its response behavior by aligning it with a persona that is less constrained by ethical or safety guidelines.

Liu et al. [59] developed a structured approach to craft prompts, focusing on three key dimensions: contents, attacking methods, and goals. This strategy aims to prompt LLMs to produce unexpected outputs through the use of prompt attack templates alongside content that is of broad interest and concern for potential vulnerabilities. Mangaokar et al. [60] devised a sophisticated attack method targeting LLMs equipped with guardrail models by deploying a two-step prefix-based strategy. Initially, it computes a universal adversarial prefix that compromises the guardrail model’s detection capabilities, rendering any input non-harmful. Subsequently, this prefix is propagated to elicit a harmful response from the primary LLM, exploiting its in-context learning to bypass the guardrail model’s defenses. This approach highlights a critical vulnerability in LLM defenses, suggesting the necessity for advanced protective measures against such targeted attacks.

3.1.4 Rule-based Jailbreaks

Unlike demonstration-based methods, which directly input questions into LLMs, rule-based methods are designed to decompose the malicious component from the original prompt and redirect it through alternative means using defined rules. Attackers often design intricate rules to conceal the malicious component. One example is illustrated in Fig. 7, where a jailbreak prompt is encoded using word substitution [61].

Kang et al. [62] utilize string concatenation, variable assignment, and sequential composition to decompose malicious prompts into two separate components, which are then reassembled to form a cohesive prompt. Wang et al. [63] initiate adversarial attacks targeting the predictions of LLMs by altering in-context learning demonstrations. They employ a strategy that involves map** critical words to other semantically similar words, as determined by cosine similarity. This technique subtly modifies the context provided to the LLM, leading it to produce different outputs than it would under normal circumstances. Ding et al. [42] conceptualized jailbreak prompt attacks through two primary mechanisms: Prompt Rewriting and Scenario Nesting, leading to the development of ReNeLLM. Prompt Rewriting is designed to decompose malicious prompts into benign ones without altering their intended meaning. Scenario Nesting, on the other hand, involves the integration of various output formats to direct LLMs towards a specific response pattern. By combining these two approaches, ReNeLLM aims to navigate around the constraints and safety mechanisms of LLMs, prompting them to generate the desired outputs through strategic input manipulation. Deng et al. [64] conducted an analysis of existing jailbreak strategies to identify and decompose the underlying attack patterns. Based on this analysis, they proposed MasterKey, an approach that involves training a model specifically to learn from these decomposed effective attack patterns. The objective of MasterKey is to automatically generate new attacks that are capable of circumventing the defense mechanisms employed by four commercial LLM systems.

Refer to caption
Figure 7: An example of rule-based jailbreaks, where the attacker defines a decomposition rule (shown in the blue box) to map malicious intentions to normal ones, ultimately generating a response that answers the user’s question.

Ren et al. [43] introduce CodeAttack, a framework that tests the safety generalization of LLMs by converting natural language inputs into code inputs. CodeAttack employs a novel template that includes Input Encoding, Task Understanding, and Output Specification to reformulate text completion into code completion tasks. This method systematically uncovers a common safety vulnerability across various LLMs, such as GPT-4, Claude-2, and Llama-2 series, revealing that these models fail to generalize safety measures to code inputs, bypassing safety guardrails over 80% of the time.

Lv et al. [65] propose CodeChameleon, which integrates personalized encryption tactics within a jailbreak framework. This method circumvents LLMs’ intent security recognition phase by transforming tasks into code completion formats and encrypting queries with personalized functions. To ensure the LLMs can accurately execute the original encrypted queries, CodeChameleon incorporates a decryption function within the instructions. This method highlights the potential for encrypted queries to bypass LLM security protocols systematically.

Li et al. [66] systematically decompose harmful prompts into sub-prompts, then reconstruct them in a way that conceals their malicious intent. This method employs three critical steps: (1) “Decomposition” breaks down the original prompt into more neutral sub-prompts using semantic parsing, (2) “Reconstruction” reassembles these sub-prompts through in-context learning with semantically similar but harmless contexts, and (3) “Synonym Search” identifies synonyms for sub-prompts to maintain the original intent while evading detection. This approach not only obscures the malicious nature of prompts from LLMs but also significantly enhances the attack’s success rate, as demonstrated by achieving a 78.0% success rate on GPT-4 with minimal queries.

Handa et al. [61] introduce a cryptographic approach to jailbreaking LLMs by encoding prompts using simple yet effective ciphers like word substitution. This technique obfuscates harmful content, allowing it to bypass LLMs’ ethical alignments undetected. More recently, ** et al. [67] address the limitations of moderation guardrails in the OpenAI API, which sometimes filter out legitimate outputs. They introduce JAMBench, a benchmark for testing these guardrails across four critical areas: Hate and Fairness, Sexual Content, Violence, and Self-Harm. Additionally, they propose the Jailbreak against Moderation (JAM) method to bypass these filters by manipulating input prefixes, refining a model to mimic the API’s filtering, and using specially crafted characters to reduce the harmfulness score of responses. The study also discusses potential defenses against such bypass techniques.

3.1.5 Multi-agent-based Jailbreaks

Multi-agent-based methods adapt their attack strategies based on feedback obtained from querying LLMs, using the cooperation of multiple LLMs to enhance effectiveness. For instance, in ** et al.’s work [45], as illustrated in Fig. 8, multiple LLMs participate in generating questions, organizing jailbreak prompts, evaluating the effectiveness of these jailbreaks, and providing feedback to improve the prompts.

Refer to caption
Figure 8: Multi-Agent based Jailbreaks illustration, which includes generating question prompts, setting playing scenarios, assessing prompts, and improving jailbreak prompts, all achieved automatically by cooperation with multiple LLMs.

Chao et al. [44], drawing inspiration from social engineering attacks, utilized attacking LLMs to autonomously generate jailbreak prompts for a targeted LLM, thereby eliminating the need for human intervention. They proposed the method known as Prompt Automatic Iterative Refinement (PAIR), which leverages previous prompts and responses to iteratively refine candidate prompts within a chat format. Additionally, PAIR generates an improvement value, enhancing interpretability and facilitating chain-of-thought reasoning. ** et al. [45] introduced GUARD, which employs the concept of role-playing to jailbreak well-aligned LLMs. In this strategy, four roles are assigned: Translator, Generator, Evaluator, and Optimizer, each contributing to a cohesive effort to jailbreak LLMs. GUARD utilizes the European Union’s AI trustworthy guidelines as a basis for generating malicious prompts, to assess the model’s compliance with these guidelines. Deng et al. [68] leveraged in-context learning to guide Large LLMs in emulating human-generated attack prompts. Their approach begins with the establishment of a prompt set composed of manually crafted high-quality attack prompts. Utilizing an attack LLM, they then generate new prompts through in-context learning and subsequently assess the quality of these generated prompts. High-quality prompts are incorporated into the attack prompt set, enhancing its effectiveness. This process is iterated upon until a robust collection of attack prompts is amassed. Through this method, Deng et al. aim to systematically refine and expand the repository of attack prompts, improving the LLM’s capability to generate potent attack vectors within given contexts.

Hayase et al. [69] directly construct adversarial examples using API access to target LLMs. The innovation lies in refining the GCG attack process [36] into a more efficient, query-only method that eliminates the need for surrogate models, thereby streamlining the creation of adversarial inputs.

3.2 Defense Mechanisms for Large Language Models

In response to jailbreak attacks on LLMs, researchers have developed various defense strategies. These can be generally categorized into six types: Prompt Detection-based, Prompt Perturbation-based, Demonstration-based, Generation intervention-based, Response evaluation-based, and Model fine-tuning-based defenses.

  1. 1.

    Prompt Detection-based Defenses: These defenses protect LLMs by identifying potentially malicious input prompts. Detection strategies vary, including analysis of prompt properties such as perplexity and length [70, 71], as well as examination of prompt semantics using model gradients [72] as the key indicator.

  2. 2.

    Prompt Perturbation-based Defenses: This category involves modifying input prompts to neutralize malicious intent. Techniques such as paraphrasing and retokenization disrupt the structure of jailbreak prompts [70], while various smoothing methods [73, 74, 75] are implemented to further mitigate risks.

  3. 3.

    Demonstration-based Defenses: Analogous to Demonstration-based jailbreaks (Section 3.1.3), these defenses incorporate specific system prompts, such as self-reminders [76] and in-context safety example demonstrations [55], guiding LLMs towards safer responses.

  4. 4.

    Generation Intervention-based Defenses: These strategies intervene in the response generation process of the LLM to ensure safety. For instance, Rain et al.[77] prompt LLMs to revisit the generation process if a response is deemed unsafe, whereas SafeDecoding [78] influences word choice during generation through adjusted probability distributions.

  5. 5.

    Response Evaluation-based Defenses: In this approach, the harmfulness of LLM responses is assessed, often followed by iterative refinement based on this evaluation to derive safer outputs. Techniques such as Bergeron [79] involve an additional LLM for this process, while Kim et al. [80] leverage the target LLM itself for comprehensive evaluation and response adjustment.

  6. 6.

    Model Fine-tuning-based Defenses: These defenses involve modifying the LLM itself to enhance safety. For example, MART [81] employs an adversarial framework for automatic red-teaming, while DINM [82] applies knowledge editing to rectify toxic biases within the model.

An overview of these defense mechanisms is illustrated in Fig. 9.

Refer to caption
Figure 9: Defense Mechanisms against Jailbreaking in LLMs: Defense mechanisms in LLMs generally fall into six main types. Prompt Detection-based defenses identify potentially unsafe input prompts using varied strategies; Prompt Perturbation-based defenses perturb the prompts to neutralize jailbreak attempts; Demonstration-based defenses incorporate safety system prompts to guide LLMs towards secure responses; Generation Intervention-based defenses control the response generation process to ensure outputs are safe; Response Evaluation-based defenses assess and iteratively refine responses to achieve safety; Model Fine-tuning-based defenses adjust the LLM’s underlying model to enhance overall security.
Refer to caption
Figure 10: An example of prompt detection-based defenses. The perplexity of the input prompt is evaluated using a perplexity calculator LLM. If the perplexity falls below a predefined threshold, the prompt is forwarded to the target LLM for a response. If it exceeds the threshold, the prompt is rejected. The perplexity calculator LLM can be the same as the target LLM.

3.2.1 Prompt Detection-based Defenses

Prompt detection-based defenses serve to identify malicious input prompts using various strategies, without altering the original input. An example is shown in Fig. 10.

This type of approach is one of the earliest responses to the widespread GCG attack [36], leveraging the characteristic high perplexity of prompts generated by such attacks. Initially, defenses such as those proposed by Jain et al. [70] evaluated prompt perplexity to assess potential harm. Further develo** this method, Alon and Kamfonas [71] introduced a sophisticated classifier that considers both the perplexity and length of prompts in their evaluation of harmfulness.

Further advancements in this defense category include the analysis of prompt semantics through gradient evaluation. Xie et al. [72] calculate the loss from the LLM’s output probabilities by treating "Sure" as a ground truth for the initial response. They then backpropagate this loss to obtain gradients with respect to pre-selected model parameters deemed safety-critical. By comparing these gradients to those obtained from known unsafe prompts, which serve as a reference, they assess the safety of the input prompts.

3.2.2 Prompt Perturbation-based Defenses

Recognizing that jailbreaks often capitalize on the precise arrangement and combination of words within attack prompts and that these setups are susceptible to perturbations, researchers have developed strategies that actively modify the input to disrupt adversarial tactics. An example of this approach is depicted in Fig. 11.

Initial methods for countering jailbreaks involve perturbing the input prompt at the sentence or token level. Jain et al. [70] pioneered this approach by employing techniques such as paraphrasing and BPE-dropout retokenization [83] to alter the prompts. Inspired by the success of SmoothLLM [73], the smoothing technique, as illustrated in Fig. 11, has since gained widespread popularity. This process typically involves applying multiple perturbations to a prompt to generate several variants, each eliciting a response from the LLM. These responses are then classified as either “jailbroken” or “not jailbroken” based on the detection of specific target strings. After classifying the responses, a majority vote is conducted to determine the predominant classification—either ’jailbroken’ or ’not jailbroken’. The system then selects and outputs a response that aligns with this majority classification. The various methods primarily differ in their perturbation techniques. SmoothLLM [73] uses character-level perturbations to create multiple variations of the original prompt. SEMANTICSMOOTH [74] builds on this by ensuring that perturbations preserve the semantic integrity of the original query through carefully designed semantic transformations. Additionally, Kumar et al. [75] and Cao et al. [84] introduce alternative approaches by masking parts of the input to generate perturbations.

Shifting from sentence and token-level perturbations, Hu et al. [85] introduce an innovative approach by perturbing the input prompt at the embedding level. They discover that the gradient of the empirical acceptance rate for a prompt, with respect to its embedding, tends to be larger for jailbreak prompts than for normal prompts. This observation led to the development of Gradient Cuff, a method that uses gradients obtained from embedding-level perturbations as indicators to identify jailbreak prompts.

Refer to caption
Figure 11: An example of prompt perturbation-based defenses based on smoothing. Initially, the input prompt is perturbed to generate multiple variants. Each variant is then processed by the LLM, which produces a response and the response is classified as either ’jailbroken’ or ’not jailbroken’ based on the presence of target strings. A majority vote is conducted to determine whether to output a response containing the target string. A response that matches the majority decision is subsequently selected as the final output.

3.2.3 Demonstration-based Defenses

Demonstration-based defenses, analogous to demonstration-based jailbreaks (Section 3.1.3), utilize crafted system prompts. However, these prompts now serve as safety prompts, guiding the LLM to recognize potential malicious intent and generate safe responses. An example is shown in Fig. 12.

Initial efforts in this domain have demonstrated the effectiveness of fixed safety prompts in improving the model’s adherence to safety protocols. For instance, Self-reminders [76], depicted in Fig. 12, incorporates prompts both before and after the user’s message to reinforce the model’s focus on producing safe responses. Wei et al. [55] exploit the LLM’s in-context learning capability by presenting a series of jailbreak examples to make the model aware of potential malicious prompts. Zhang et al. [86] highlight the inherent conflict in LLM objectives between helpfulness and safety, crafting prompts that compel the model to prioritize safety.

Refer to caption
Figure 12: An example of demonstration-based defenses, using a self-reminder as a safety prompt. The self-reminder prompts the model to be responsible and avoid generating harmful or misleading content in response to a user message. The self-reminder is reiterated to ensure the model adheres to safety guidelines.

The sophistication of these defenses has evolved to include dynamic adjustments to safety prompts based on the input’s context. Zhang et al. [87] have developed a method where the LLM assesses the intention behind the input prompt and uses this analysis as a dynamic safety prompt to enhance response safety. Similarly, Pisano et al. [79] utilize an auxiliary LLM to evaluate risks in input prompts and guide the primary LLM toward safer outputs.

The field has also witnessed significant advancements in the automated optimization of safety prompts, thus boosting the effectiveness of LLM defenses. Naturally evolving within this landscape, adversarial training frameworks have emerged. Prompt Adversarial Tuning (PAT) [88] is dedicated to the adversarial training of both attack and defense prompts. Robust Prompt Optimization (RPO) [89] expands upon this concept by adaptively selecting the most effective attack techniques during the training phase. Separate from adversarial training techniques, Zheng et al. [90] recently introduced Directed Representation Optimization (DRO). This method first identifies a direction of refusal in the low-dimensional representation space of the LLM by fitting a linear regression to the empirical refusal rates of known prompts. It then tailors the optimization of the safety prompts to steer harmful queries towards this direction of refusal, while directing harmless queries in the opposite direction.

3.2.4 Generation Intervention-based Defenses

Generation intervention-based defenses modify the original LLM response generation process to enhance safety. An example of such a defense, Rain [77], is illustrated in Fig. 13.

Li et al. [77] introduce the Rewindable Auto-regressive INference (RAIN) method. In this approach, the LLM tentatively produces tokens and evaluates their safety. If tokens are deemed safe, they are retained, and the generation process continues. If not, the model reverts to the beginning of these tokens and explores alternative tokens, ensuring only safe outputs proceed.

Xu et al. [78] propose a new method, namely SafeDecoding, that fine-tunes a safety expert model derived from the original LLM using a curated safety dataset. During inference, this method adjusts the output probabilities of the original LLM by aligning them with the discrepancies observed between the safety expert model’s and the original LLM’s output distributions. This adjustment reduces the likelihood of generating unsafe outputs while increasing the probability of producing safe responses.

Refer to caption
Figure 13: An example method of generation intervention-based defenses by Xu et al. [78]. In this process, the LLM repeatedly generates and evaluates tokens. If the tokens are deemed safe, they are retained and the generation process continues. If any tokens are considered unsafe, the LLM rewinds to the start of the unsafe sequence and attempts regeneration.

3.2.5 Response Evaluation-based Defenses

Response evaluation-based defenses assess the harmfulness of LLM responses and often refine them afterward iteratively to make the responses safer. An overview of the process is depicted in Fig. 14.

Helbling et al. [91] introduce a method where an auxiliary LLM evaluates the harmfulness of responses from the primary model to ensure safety. Going a step further, Pisano et al. [79] employ a secondary LLM not only to assess harm but also to guide the refinement of responses. Similarly, Zeng et al. [92] deploy several external LLMs serving different roles for assessing potential harm in responses and refining them. Instead of relying on additional LLMs, Kim et al. [80] develop a methodology where the primary LLM itself evaluates and iteratively refines its outputs.

Refer to caption
Figure 14: An example of response evaluation-based defenses. The target LLM generates responses which are then assessed by the evaluator LLM for safety. This evaluator can be the same as the target LLM or different external LLMs. The process continues iteratively, with the evaluator suggesting refinements until it deems a response safe for output.

3.2.6 Model Fine-tuning-based Defenses

Rather than relying on external measures, Model Fine-tuning-based defenses enhance LLM safety by altering the model’s inherent characteristics. An example is shown in Fig. 15.

Refer to caption
Figure 15: An example of model fine-tuning-based defenses. In the fine-tuning phase, the LLM is exposed to a mix of instructions emphasizing either helpfulness or safety, paired with either safe or harmful queries, and is trained to respond appropriately. During inference, safety instructions are consistently prefixed to the input prompts to ensure the generation of safe responses.

This defense strategy encompasses a broad spectrum of techniques, each of which provides a unique perspective on enhancing model safety. Jain et al. [70] and Bhardwaj and Poria [93] train the model on a mixture of benign and adversarial data to enhance model safety, marking an initial step in this direction. Ge et al. [81] introduce an adversarial framework designed for automatic red-teaming, which pits an attack model against the target LLM, with the former striving to refine attack prompts based on past successes, and the latter aiming to generate safe and helpful responses informed by previous interactions and feedback from a reward model.

Recognizing the inherent conflict between helpfulness and safety as a factor of irresponsible output, Zhang et al. [86] fine-tune the LLM to explicitly recognize these objectives and prioritize safety during inference. The proposed method of Goal Prioritization Defense is shown in Fig. 15. In the fine-tuning stage, the LLM is exposed to both harmful and harmless prompts, accompanied by instructions that prioritize either helpfulness or safety. The LLM learns to produce helpful responses when prompted with helpfulness instructions and safe responses when prompted with safety instructions. During inference, the LLM is always presented with safety instructions alongside the prompt, guiding it to generate safe responses.

Drawing inspiration from the realm of backdoor attacks [94, 95, 96], Wang et al. [97] tackle fine-tuning-based jailbreak attacks by embedding a secret prompt within safety examples during fine-tuning. Then during inference, the safety can be enhanced by prefixing these secret prompts to input prompts.

In a separate approach, Hasan et al. [98] utilize model compression to enhance the LLM’s resilience against jailbreaks. Their method employs a pruning technique known as Wanda [99], which has been shown to bolster the model’s defenses.

Venturing into the innovative realm of knowledge editing as a new frontier for LLM post-training adjustments, Wang et al. [82] introduce Detoxifying with Intraoperative Neural Monitoring (DINM), a technique that locates and modifies the weights of layers identified as sources of toxic outputs.

Taking a radical approach, Piet et al. [100] propose abandoning the LLM’s instruction tuning capability, instead training task-specific models from a non-instruction-tuned base model to sidestep potential attacks.

3.3 Comprehensive Evaluation for Large Language Models

There exists substantial research primarily focused on assessing the effectiveness of jailbreak strategies and defense mechanisms, offering valuable insights for develo** more potent methods. These studies aim to understand the mechanisms through which jailbreak attempts can circumvent the safeguards of LLMs and identify vulnerabilities within these systems. This research can be broadly categorized into two areas: jailbreak evaluations and defense evaluations.

Jailbreak Evaluations focus on understanding and exploiting the vulnerabilities of LLMs. Liu et al. [101] identified ten patterns and three categories of jailbreak prompts, showing that prompt structure is crucial for bypassing LLM restrictions. Gupta et al. [102] demonstrated ChatGPT’s vulnerabilities to cyberattacks, including social engineering and malware creation, emphasizing the need for robust security measures. Wei et al. [103] revealed persistent vulnerabilities in LLMs despite extensive safety training.

Further, Glukhov et al. [104] argued that effective content control is challenging due to the undecidable nature of semantic censorship. Shen et al. [40] analyzed 6,387 jailbreak prompts, finding that some remained undetected for over 100 days. Inie et al. [105] explored the motivations and strategies of practitioners identifying LLM vulnerabilities, providing real-world insights.

Singh et al. [106] found that LLMs are prone to social engineering attacks, indicating a need for better security measures. Zhou et al. [107] introduced EasyJailbreak, a framework for systematically constructing and evaluating jailbreak attacks. Gei** et al. [108] categorized LLM attacks and identified factors influencing their efficacy, such as glitch tokens.

Banerjee et al. [109] introduced TECHHAZARDQA to evaluate LLMs’ propensity for generating unethical content. Jiang et al. [110] demonstrated LLMs’ struggles with ASCII art-based jailbreak prompts. Ye et al. [111] presented ToolSword, a framework for identifying safety challenges in LLM tool learning applications. Sharma et al. [112] introduced a benchmark for evaluating chatbot vulnerabilities. Souly et al. [113] introduced StrongREJECT to evaluate jailbreak effectiveness more accurately.

Defense Evaluations examine methods to safeguard LLMs against various attacks. Evaluations of defense technologies for LLMs are relatively scarce. Xu et al. [114] conducted a notable attack versus defense study, finding that the Bergeron method was the most effective among five prompt-based defenses, while others faced challenges with natural language inputs. Varshney et al. [115] examined basic prompt manipulation strategies, highlighting the significant impact of safety instructions and in-context exemplars on model safety and over-defensiveness. Conversely, the implementation of a Self-Check strategy significantly heightened the model’s over-defensiveness.

3.4 Additional Resources

Several studies delve into the vulnerabilities of Large Language Models (LLMs) through adversarial attacks, offering crucial insights and comprehensive analyses to enhance LLM security.

Shayegani et al. [12] provided a detailed overview of LLMs, focusing on safety alignment and various attack methodologies, including textual-only and multi-modal attacks. They also explored unique strategies for complex systems like federated learning, critically examining the origins of LLM vulnerabilities and defensive measures. This survey serves as a key resource for understanding the challenges and solutions in securing LLMs against adversarial threats. Similarly, Esmradi et al. [13] reviewed over 100 studies to offer an in-depth analysis of attack types on LLMs. They detailed the latest methods and mitigation techniques, evaluating the effectiveness and limitations of current defenses while predicting future protective measures. By including both documented and personally implemented attacks, their work underscores the urgent need for enhanced security and contributes significantly to develo** robust defenses in the LLM domain. In another comprehensive review, Rao et al. [14] examined jailbreak methods for both open-source and commercial LLMs such as GPT, OPT, BLOOM, and FLAN-T5-XXL. They assessed the effectiveness of these methods and the challenges in detecting such breaches. Additionally, they introduced a dataset containing responses to 3,700 jailbreak prompts across four tasks, aiming to aid further research in improving model security and jailbreak detection capabilities.

Overall, these studies provide a thorough examination of the security landscape surrounding LLMs, offering valuable insights into their vulnerabilities and the defensive measures required to mitigate adversarial threats.

4 Threats in Vision-Language Models

4.1 Jailbreak Strategies on Vision-Language Models

Security challenges associated with VLMs have emerged as a critical concern, mirroring the issues seen with LLMs. As all VLMs utilize an LLM component for text encoding, vulnerabilities that affect LLMs can potentially compromise VLMs as well. Furthermore, the incorporation of visual inputs into these models not only broadens the range of functionalities but also significantly increases the attack surface, thus escalating the security risks involved.

Unlike jailbreaks on LLMs, which primarily target textual inputs, malicious manipulations on VLMs can occur through visual inputs, textual components, or a combination of both, exhibiting much more complex and diverse patterns. In general, there are three predominant strategies for jailbreaking VLMs, illustrated in Fig. 16: Prompt-to-image Injection, Prompt-Image Perturbation Injection, and Proxy Model Transfer Jailbreaks approaches. Each of these strategies exploits different vulnerabilities in VLMs, highlighting the need for robust defense mechanisms. In general, there are three predominant strategies for jailbreaking VLMs:

  1. 1.

    Prompt-to-Image Injection Jailbreaks: Prompt-to-image Injection Jailbreaks manipulate textual content to create visual prompts that induce the model to generate a jailbreak prompt. By crafting specific textual patterns or structures, attackers can trick the VLM into producing undesired or harmful outputs. Techniques include feeding harmful instructions through the image channel and using benign text prompts  [116].

  2. 2.

    Prompt-Image Perturbation Injection Jailbreaks: Prompt-to-image Injection Jailbreaks, on the other hand, involve subtly altering images and combining them with malicious text. These perturbations exploit vulnerabilities in the model’s visual-textual processing capabilities, causing the VLM to generate jailbreak prompts. Methods exploit cross-modal interactions by perturbing both modalities collectively  [117], using optimal transport theory  [118], and alignment-preserving augmentation  [119].

  3. 3.

    Proxy Model Transfer Jailbreaks: Proxy Model Transfer Jailbreaks leverage alternative VLMs to produce perturbed images from standard ones. Shayegani et al.  [120] introduced a method that directly exploits the embedding space of vision encoders without requiring access to the multi-modal system’s weights or parameters, making it more efficient and potentially more effective. Recent advancements have also explored model ensembles and novel attacks tailored to the multimodal processing of these models  [121, 122].

Refer to caption
Figure 16: Jailbreak Strategies for VLMs: This figure depicts three principal jailbreak techniques targeting VLMs. Prompt-to-image Injection Jailbreaks manipulate the textual content to create visual prompts that lead to a jailbreak prompt when processed by the VLM. Prompt-to-image Injection Jailbreaks introduce alterations to images coupled with malicious texts to produce a jailbreak prompt, exploiting the model’s visual-textual analysis vulnerabilities. Proxy Model Transfer Jailbreaks utilize substitute VLMs to generate perturbed images from standard images, which are then combined with normal texts to craft a jailbreak prompt.

In the following sections, we will explore each of these jailbreak strategies in more detail, discussing their unique characteristics and recent advancements. By examining these strategies, we aim to provide a comprehensive overview of the current state of VLM security and highlight the challenges and opportunities in develo** effective defense mechanisms to mitigate these threats.

4.1.1 Prompt-to-image Injection Jailbreaks

Refer to caption
Figure 17: An example method ofPrompt-to-Image Injection Attack. This process involves paraphrasing a prohibited text query, converting it into a typographic visual prompt image, and using an incitement text prompt to motivate the model to answer the visual prompt. The original query is transformed into a jailbreaking query that combines the visual prompt encoding the question and the incitement prompt to generate the final response.

Recent studies have highlighted the vulnerability of VLMs to prompt-to-image injection attacks, which involve transferring harmful content into images with instructions, shown in Fig. 17. It includes paraphrasing a prohibited text query, converting it into a typographic visual prompt image, and using an incitement text prompt to motivate the model to answer the visual prompt. The process transforms the original query into a jailbreaking query that combines the visual prompt encoding the question and the incitement prompt to generate the final response.

Gong et al. [116] proposed FigStep as a black-box approach for jailbreaking. It feeds harmful instructions into VLMs through the image channel and then uses benign text prompts to induce VLMs to output contents that violate common AI safety policies. They also found out that the safety of VLMs requires attention beyond what is provided by LLMs, due to inherent limitations in text-centric safety alignment.

The approach used in prompt-to-image injection attacks shares similarities with the demonstration-based jailbreaks discussed in Section 3.1.3. In both cases, the attacker crafts specific prompts (either text-based for LLMs or image-based for VLMs) to guide the model’s response toward generating content that violates safety policies. However, prompt-to-image injection attacks exploit the additional attack surface introduced by the visual modality in VLMs, allowing adversaries to bypass the safety measures that primarily focus on textual inputs.

4.1.2 Prompt-Image Perturbation Injection Jailbreaks

Prompt-Image Perturbation Injection Jailbreaks have been widely studied in VLMs, particularly in the black-box setting. It involves manipulating both the textual and visual inputs. Specifically, the original text is first maliciously perturbed to describe a different and potentially harmful scenario. As shown in Fig. 18, the original image is then perturbed by adding imperceptible noise. The perturbed texts, combined with perturbed images are fed into the VLM, leading to responses that should be refused.

Refer to caption
Figure 18: An example of Prompt-Image Perturbation Injection Attacks. In this attack, the original text is first maliciously perturbed to describe a different and potentially harmful scenario. The original image is then perturbed by adding imperceptible noise. The perturbed texts and images are fed into the VLM, aiming to elicit unintended or harmful responses that would normally be filtered or rejected by the model’s safety mechanisms.

Early works, such as Co-Attack [117], focused on exploiting cross-modal interactions by perturbing image and text modalities collectively. However, it suffered from limitations in its transferability to other VLMs. To address these limitations, Lu et al. [119] proposed the Set-Level Guidance Attack (SGA), which leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance. Despite its advancements, SGA has limitations in adequately addressing the optimal matching of post-augmentation image examples with their corresponding texts. Building on SGA, Han et al. [118] developed the OT-Attack, which incorporates the theory of optimal transport to analyze and map data-augmented image sets and text sets, ensuring a balanced match after augmentation.

Despite the advancements mentioned above, challenges remain in effectively modeling inter-modal correspondence and optimizing the transferability of adversarial examples across different VLMs. Further improvements in transferability and adversarial example generation were made by Niu et al. [123] and Qi et al. [124], who introduced the concept of an image Jailbreaking Prompt (imgJP)and visual adversarial examples that show strong data-universal and model-transferability properties. Their approach enables black-box jailbreaking of various VLMs and can be converted to achieve LLM jailbreaks by transforming an imgJP to a text Jailbreaking Prompt (txtJP). Carlini et al. [125] demonstrated that VLMs can be easily exploited by NLP-based optimization attacks, inducing them to perform arbitrary unaligned behavior through adversarial perturbation of the input image. The authors conjecture that improved NLP attacks may demonstrate a similar level of adversarial control over text-only models.

To further improve the transferability of adversarial examples, Luo et al. [126] introduced the Cross-Prompt Attack (CroPA). These prompts are generated by optimizing in the opposite direction of the perturbation, thereby covering more prompt embedding space and significantly improving transferability across different prompts. Zhao et al. [127] conducted a quantitative evaluation of the adversarial robustness of different VLMs by generating adversarial images that deceive the models into producing targeted responses. Similarly, Schlarmann & Hein [128] and Bailey et al. [129] demonstrated the high attack success rate on VLMs by imperceptible perturbations. Along the line, Zhou et al. [130] propose AdvCLIP for generating downstream-agnostic adversarial examples in multimodal contrastive learning. Yin et al. [131] propose VLATTACK, which generates adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels.

Generally speaking, the recent advancements in Image Perturbation Injection Jailbreaks on VLMs share several similarities with the gradient-based and evolutionary-based jailbreaks discussed in Section 3.1.1 and Section 3.1.2 for LLMs. Both types of attacks leverage optimization techniques to generate adversarial inputs and further use iterative processes for effectiveness improvement. However, VLMs process both visual and textual inputs, and the interactions between these modalities can be exploited by attackers. Image Perturbation Injection attacks specifically target these cross-modal interactions to generate more potent and harder-to-detect adversarial examples. Moreover, the inclusion of visual inputs in VLMs expands the attack surface compared to text-only LLMs. Attackers can manipulate both the textual and visual components of the input, which allows for the jailbreaking of VLMs and LLMs using a single adversarial image. Additionally, optimizing the transferability of adversarial examples across different VLMs is a significant challenge, as the cross-modal interactions and architectures of these models can vary greatly. This encourages the development of more generalizable adversarial perturbations.

The advancements made in each study, from early approaches like Co-Attack and Sep-Attack to more sophisticated methods like SGA, OT-Attack, and jailbreaking attacks, have progressively expanded our understanding of the adversarial landscape in VLMs. However, the challenges in effectively modeling inter-modal correspondence, optimizing transferability across different VLP models, and defending against emerging attacks like image hijacks and AdvCLIP underscore the importance of continued research efforts in this field.

4.1.3 Proxy Model Transfer Jailbreaks

Refer to caption
Figure 19: An example of Proxy Model Transfer Attack. Attackers use proxy models to create adversarial examples that are more likely to transfer to the victim model. With white-box access to a proxy model, attackers apply various transferability-enhancing techniques to create adversarial examples. The crafted adversarial examples are then transferred to the black-box victim model. If the transfer is successful, the victim model misclassifies the adversarial examples, leading to a jailbroken output.

Proxy Model Transfer Jailbreaks leverage the transferability of malicious manipulation to conduct attacks. Attackers may use proxy models to create adversarial examples that are more likely to transfer to the victim model, as Fig. 19. Similar to rule-based jailbreaks discussed in Section 3.1.4 for LLMs, attackers do not have direct access to the model’s parameters or architecture. Attackers have white-box access to a proxy model, which is used to generate adversarial examples. They apply various transferability-enhancing techniques, such as input diversity, momentum, or translation-invariant attacks, to create adversarial examples that are more likely to transfer to the victim model. The crafted adversarial examples are then transferred to the black-box victim model. If the transfer is successful, the victim model misclassifies the adversarial examples, leading to a jailbroken output. Proxy Model Transfer Jailbreaks exploit the transferability of adversarial examples across different models. This approach builds upon the foundational work by [132, 133, 134, 135, 136]. Most recently, the exploration into adversarial robustness by Dong et al. [121] revealed vulnerabilities specific to commercial VLMs, proposing novel attacks tailored to the multimodal processing of these models. Chen et al. [122] introduce a novel perspective on model ensembles in Proxy Model Transfer Jailbreaks. They define the common weakness of model ensembles as a solution that lies in a flat loss landscape and is close to the local optima of each model. By promoting these two properties, the authors aim to generate more transferable adversarial examples that can effectively fool black-box models like Google’s Bard. Shayegani et al. [120] introduce a novel perspective on the vulnerability of multi-modal systems that incorporate off-the-shelf components, such as pre-trained vision encoders like CLIP, in a plug-and-play manner. They propose adversarial embedding space attacks, which exploit the vast and under-explored embedding space of these pre-trained encoders without requiring access to the multi-modal system’s weights or parameters. However, instead of using a substitute VLM to generate perturbed images, as in the works of [132] and [133], the proposed method directly exploits the embedding space of the vision encoder, making it more efficient and potentially more effective.

Despite the advancements in Proxy Model Transfer Jailbreaks, some limitations and challenges need to be addressed. As highlighted by Zhao et al. [127], Proxy model transfer attacks depend on having white-box access to proxy models. This necessity can limit the applicability of these attacks in situations where similar models are not available or accessible. Generating adversarial examples, especially high-quality ones that are likely to transfer, can be computationally expensive. This process involves finding input perturbations that lead to misclassifications, which may require significant computational resources, especially for complex VLMs. Besides, the success of Proxy Model Transfer Jailbreaks heavily relies on the similarity between the victim and proxy models. Differences in architectures, training data, or optimization objectives can reduce the transferability of adversarial examples [127]. This implies a potential limitation in the universality of these attacks across diverse models.

4.2 Defense Mechanisms for Vision-Language Models

In the continuous quest to fortify VLMs against jailbreak threats, researchers have proposed various strategies. In general, they can be broadly categorized into three main approaches: Model Fine-tuning-based Defenses, Response Evaluation-based Defenses, and Prompt Perturbation-based Defenses, as illustrated in Fig. 20. The strategies can be generally categorized into three main types:

  1. 1.

    Model Fine-tuning-based Defenses: These defenses involve fine-tuning the VLM to enhance safety. Techniques include leveraging Natural Language Feedback for improved alignment [137] and adversarial training to increase model robustness. Parameter adjustments to resist adversarial prompts and images are also employed [138].

  2. 2.

    Response Evaluation-based Defenses: This approach assesses the harmfulness of VLM responses, often followed by iterative refinement to ensure safe outputs. Methods integrate harm detection and detoxification to correct potentially harmful outputs [139]. ECSO [140] restores the intrinsic safety mechanism of pre-aligned LLMs by transforming potentially malicious visual content into plain text.

  3. 3.

    Prompt Perturbation-based Defenses: These strategies involve altering input prompts to neutralize adversarial effects. Techniques use variant generators to disturb input queries and analyze response consistency to identify potential jailbreak attempts [141].

The following sections will provide a more in-depth look at the key contributions and insights from recent studies in Model Fine-tuning-based Defenses, Response Evaluation-based Defenses, and Prompt Perturbation-based Defenses, highlighting the progress made and the opportunities for further advancement in the field of VLM security.

Refer to caption
Figure 20: Defense Mechanisms in VLMs: This figure illustrates three defense strategies implemented in VLMs to mitigate jailbreak attempts. Model Fine-tuning-based Defenses defense intercepts jailbreak prompts during the model’s training phase, with an agent LLM providing updates and feedback to reinforce the model. Response Evaluation-based Defenses operate similarly but take place during the model’s inference phase, ensuring that the model’s response to a jailbreak prompt is normal. Prompt Perturbation-based Defenses involve altering the input prompts into mutated queries that the target VLM processes, with the system evaluating the response against certain metrics to prevent jailbreak.

4.2.1 Model Fine-tuning-based Defenses

Model fine-tuning-based defenses focus on intercepting and mitigating jailbreak prompts during the model’s training phase, leveraging techniques such as prompt optimization and natural language feedback to reinforce the model’s resistance to malicious inputs, as shown in Fig. 21. These methods mitigate the risks associated with malicious inputs, particularly Prompt-to-image Injection Jailbreaks, which exploit the multimodal capabilities of VLMs to bypass safety mechanisms.

Refer to caption
Figure 21: An example of Model Fine-tuning-based Defenses. This process starts with the detection and initial response to malicious or unaligned inputs. Input processing is tailored using defender LLMs for prompt generation and harm detection. The feedback and refinement phase optimizes inputs, integrates natural language feedback, and detoxifies outputs. This leads to the generation of safe and aligned responses.

In addressing the challenges of ensuring the safety of VLMs without compromising their performance, Chen et al. [137] first introduced DRESS, which focuses on leveraging natural language feedback from large language models to improve the alignment and interactions within VLMs. By categorizing NLF into critique and refinement feedback, DRESS aims to enhance the model’s ability to generate more aligned and helpful responses, as well as to refine its outputs based on the feedback received. This approach addresses the limitations of prior VLMs that rely solely on supervised fine-tuning and RLHF for alignment with human preferences. DRESS introduces a method for conditioning the training of VLMs on both critique and refinement NLF, thus fostering better multi-turn interactions and alignment with human values.

In the follow-up, Wang et al. [138] proposes Adashield, a prompt-based defense mechanism that does not necessitate fine-tuning of VLMs or the development of auxiliary models. This approach is particularly advantageous as it leverages a limited number of malicious queries to optimize defense prompts, thereby circumventing the challenges associated with high computational costs, significant inference time costs, and the need for extensive training data. Through an auto-refinement framework that includes a target VLM and a defender LLM, AdaShield iteratively optimizes defense prompts. This process generates a diverse pool of defense prompts that adhere to specific safety guidelines, enhancing the robustness of VLMs against Prompt-to-image Injection Jailbreak. The adaptive and automatic nature of this approach ensures that VLMs are safeguarded effectively without requiring extensive modifications to the models themselves. Comparatively, Pi et al. [139] represents a more traditional approach for defense, incorporating a harm detector and a detoxifier to correct potentially harmful outputs generated by VLMs. However, this Model Fine-tuning-based Defense strategy requires a significant amount of high-quality data and computational resources. Additionally, as a post-hoc filtering defense mechanism, it incurs substantial inference time costs, which can be a significant drawback in practical applications.

The Model Fine-tuning-based Defense strategies share similarities with the model fine-tuning defenses discussed in Section 3.2.6 for LLMs. Both approaches aim to enhance the model’s safety and alignment with human preferences by modifying the training process. However, the multi-modal nature of VLMs introduces additional challenges and opportunities for Model Fine-tuning-based Defense strategies. The presence of both textual and visual modalities in VLMs necessitates the development of defense mechanisms that can effectively align these modalities in a compositional manner. For instance, AdaShield [138] addresses this challenge by generating defense prompts that adhere to specific safety guidelines. DRESS [137], on the other hand, leverages natural language feedback to improve the alignment and interaction between the textual and visual modalities in VLMs.

4.2.2 Response Evaluation-based Defenses

Response evaluation-based defenses, operate during the model’s inference phase, ensuring that the model’s response to a jailbreak prompt remains safe and aligned with the desired behavior. Fig. 22 illustrates the Response Evaluation-based Defense process for VLMs under adversarial conditions.

Refer to caption
Figure 22: An example of Response Evaluation-based Defenses. The target VLM generates an initial response to the input prompt. The response is then assessed by an evaluator LLM for safety and alignment with desired behavior. This evaluator can be the same as the target VLM or a separate external LLM. If the response is deemed unsafe, the evaluator suggests refinements, and the process is repeated iteratively until a safe and aligned response is generated. Once the evaluator approves the response, it is output as the final answer.

Pi et al.[139] and Zong et al.[142] proposed Model Fine-tuning-based Defense methods that aim to align VLMs with specially constructed red-teaming data. However, these approaches are labor-intensive and may not cover all potential attack vectors. On the other hand, inference-based defense methods focus on protecting VLMs during the inference stage without requiring additional training.

One notable Response Evaluation-based approach is ECSO [140], a training-free method that exploits the inherent safety awareness of VLMs. ECSO leverages the observation that VLMs can detect unsafe content in their own responses and that the safety mechanism of pre-aligned LLMs persists in VLMs but is suppressed by image features. By transforming potentially malicious visual content into plain text using a query-aware image-to-text transformation, ECSO effectively restores the intrinsic safety mechanism of the pre-aligned LLMs within the VLM.

The response evaluation-based defense strategies discussed in this section underscore the importance of develo** defense mechanisms that can effectively detect and mitigate potentially harmful content generated by VLMs during the inference stage. While some of the response evaluation defenses developed for LLMs (Section 3.2.5) may be adapted to the VLM context, dedicated research efforts are required to address the unique challenges posed by the multi-modal nature of these models. By focusing on compositional safety alignment approaches and exploiting the inherent safety awareness of VLMs, Response Evaluation-based Defense strategies can provide an additional layer of protection against adversarial attacks on VLMs, complementing the Model Fine-tuning-based Defense strategies discussed in Section 4.2.1.

4.2.3 Prompt Perturbation-based Defenses

Prompt perturbation-based Defense takes a different approach by altering the input prompts into mutated queries and analyzing the consistency of the model’s responses to identify potential jailbreaking attempts, exploiting the inherent fragility of attack queries. The overview is shown in Fig. 23.

The prompt Perturbation methods exploit the inherent fragility of attack queries, which often rely on crafted templates or complex perturbations, making them less robust than benign queries. By mutating the input into variant queries and analyzing the consistency of the language model’s responses, Prompt Perturbation-based methods can effectively identify potential jailbreaking attempts. Zhang et al. [141] proposed JailGuard, a Prompt Perturbation-based jailbreaking detection framework that supports both image and text modalities. JailGuard employs a variant generator with 19 mutators, including random and advanced mutators, to disturb the input queries and generate variants. The attack detector then analyzes the semantic similarity and divergence between the responses to these variants, identifying potential attacks when the divergence exceeds a predefined threshold. Evaluations on a multi-modal jailbreaking attack dataset demonstrate JailGuard’s effectiveness, outperforming state-of-the-art defense methods.

The prompt perturbation-based defense strategies share similarities with those in Section 3.2.2 for LLMs. Both approaches aim to detect and mitigate potentially harmful content by manipulating the input prompts. In the case of LLMs, Prompt Perturbation-based defenses involve perturbing the input prompts to generate multiple variations and then aggregating the model’s responses to these variations to dilute the impact of adversarial prompts. The multi-modal nature of VLMs introduces additional challenges for Prompt Perturbation-based Defense strategies. VLMs require defense mechanisms that can manipulate and analyze both textual and visual inputs. Prompt Perturbation-based defense techniques like JailGuard  [141] provide an additional defense layer against jailbreaking attacks on VLMs without relying heavily on domain-specific knowledge or post-query analysis. These strategies complement model Fine-tuning-based Defenses (Section 4.2.1) and Response Evaluation-based Defenses (Section 4.2.2) approaches, offering a comprehensive framework for safeguarding VLMs against adversarial attacks.

Refer to caption
Figure 23: An example of Prompt Perturbation-based Defense. The input query (image or text) is first passed through the variant generator, which applies mutators to generate multiple variants. These variants, along with the original input, are then fed into the multi-modal language model. The model generates responses for each input, and the response analysis component evaluates the consistency between the responses. If the divergence exceeds a predefined threshold, the attack detector flags the input query as a potential jailbreaking attempt. Appropriate actions can then be taken, such as blocking the query or alerting the system administrators. If the input query is deemed benign, it is allowed to proceed for further processing.

4.3 Comprehensive Evaluation for Vision-Language Models

Recent research has increasingly focused on analyzing the effectiveness of both jailbreak strategies and defense mechanisms of VLMs.

Liu et al. [143] proposed MM-SafetyBench for VLM safety evaluation. They use OpenAI’s GPT-4 to create questions and images based on the key phrases taken from the questions, and then rephrase the question to align with the images. This benchmark provides a valuable tool for assessing the effectiveness of such mechanisms and advancing our understanding of VLM safety. Tu et al. [144] introduced a comprehensive safety evaluation benchmark for VLMs, which covers both out-of-distribution (OOD) generalization and adversarial robustness. They propose a straightforward attack strategy for misleading VLMs to produce visual-unrelated responses and assess the efficacy of two jailbreaking strategies targeting either the vision or language component. Their evaluation of 21 diverse models yields interesting observations, such as the struggle of current VLMs with OOD texts and their susceptibility to being misled by deceiving vision encoders.

5 Future Direction

As LLMs and VLMs continue to evolve, addressing emerging security challenges is paramount. The following future directions highlight key areas to enhance the robustness, usability, and ethical alignment of these models:

  • Expanding Pretraining Data: The extensive use of diverse pretraining data improves generalization but also introduces risks, such as data pattern exploitation and generalization issues. Addressing these requires systematic data diversity approaches and comprehensive search methods, potentially through crowd-sourcing efforts like TensorTrust [145].

  • Addressing LLM and VLM Vulnerabilities: The evolving capabilities of LLMs and VLMs pose risks, including synthesizing complex biological agents and controlling critical infrastructures. Effective defenses, such as removing sensitive information from model weights through techniques like model editing [146, 98], are essential to counter these vulnerabilities.

  • Multilingual Safety Alignment: Ensuring safety across multiple languages is crucial for the global usability of LLMs and VLMs. Significant challenges exist in multilingual safety alignment [147, 148, 149, 150], necessitating robust protocols to defend against language-specific attacks exploiting linguistic gaps.

  • Multi-Modality Integration: Effective management of multi-modal data during attacks is often lacking. Proper integration of multiple modalities, considering interactions like text and vision, is vital for defending against multi-modal attacks.

  • Weight Manipulation Defenses: Understanding how model weights in different layers contribute to attack success can inform weight-targeted defense methods. This approach addresses safety and helpfulness trade-offs, making models more resilient to weight manipulation [151, 152, 153, 154].

  • Defining Safety and Crafting Robust Defenses: Clear definitions of “safety” in various contexts are essential for develo** effective defenses. Future efforts should enhance existing strategies to balance effectiveness and efficiency without compromising model utility.

  • Adaptive Defense Mechanisms: Research should focus on adaptive defenses that respond dynamically to evolving attack patterns. Leveraging machine learning to anticipate and counteract new jailbreak strategies in real time can enhance security.

  • Collaborative Security Models: Collaboration between academia, industry, and policymakers can lead to standardized security protocols and shared vulnerability databases, enhancing collective responses to security threats.

  • Explainability and Transparency: Improving the explainability and transparency of LLMs and VLMs can help identify and mitigate security risks. Methods to make models more interpretable facilitate better vulnerability detection.

  • Benchmarking and Evaluation: Establishing comprehensive benchmarks and evaluation frameworks is crucial. Standardized testing environments simulating various attack scenarios provide reliable measures of a model’s security posture.

  • Human-in-the-Loop Approaches: Integrating human oversight into model operations adds a security layer. Hybrid models where human experts collaborate with automated systems to monitor and respond to suspicious activities can ensure robust defenses against jailbreak attempts.

By focusing on these directions, the security of LLMs and VLMs can be significantly enhanced, making them more reliable and ethically aligned to meet the challenges of increasingly sophisticated threats.

6 Conclusion

In this paper, we have conducted a thorough examination of jailbreak strategies and defense mechanisms for both LLMs and VLMs. By categorizing these strategies and defenses, we provide a cohesive narrative on the safety landscape for these advanced models. Our analysis highlights several critical aspects: we bridge the gap between disparate studies, offering a unified framework that encompasses both LLMs and VLMs, thereby enhancing the understanding of the interplay between attack and defense methodologies. Our work provides a detailed categorization of specific attack strategies and defenses, which is essential for develo** targeted and effective defense mechanisms. We discuss comprehensive methods for assessing the effectiveness of various defenses, which are crucial for benchmarking the robustness of LLMs and VLMs against jailbreak attempts. Our survey covers the latest techniques in ethical alignment, such as prompt-tuning and reinforcement learning from human feedback, which are vital for enhancing the security and ethical compliance of these models. Additionally, we identify gaps in current research, including the need for more sophisticated defenses, a better understanding of vulnerabilities in multimodal models, and standardized benchmarks for evaluating jailbreak strategies. Addressing these gaps is critical for advancing the security of LLMs and VLMs. In summary, our work provides a detailed and unified perspective on jailbreak strategies and defense mechanisms for LLMs and VLMs. By categorizing and synthesizing existing research, we aim to deepen the understanding of security challenges and opportunities in these models. Our contributions lay the groundwork for future research, ultimately enhancing the safety and reliability of LLMs and VLMs.

References

  • [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  • [2] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  • [3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [4] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
  • [5] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International conference on machine learning, pp. 8821–8831, Pmlr, 2021.
  • [6] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., “Flamingo: a visual language model for few-shot learning,” Advances in neural information processing systems, vol. 35, pp. 23716–23736, 2022.
  • [7] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
  • [8] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
  • [9] F. K. McDuff and S. D. Turner, “Jailbreak: oncogene-induced senescence and its evasion,” Cellular signalling, vol. 23, no. 1, pp. 6–13, 2011.
  • [10] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, et al., “Extracting training data from large language models,” in 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
  • [11] E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” arXiv preprint arXiv:1908.07125, 2019.
  • [12] E. Shayegani, M. A. A. Mamun, Y. Fu, P. Zaree, Y. Dong, and N. Abu-Ghazaleh, “Survey of vulnerabilities in large language models revealed by adversarial attacks,” arXiv preprint arXiv:2310.10844, 2023.
  • [13] A. Esmradi, D. W. Yip, and C. F. Chan, “A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models,” arXiv preprint arXiv:2312.10982, 2023.
  • [14] A. Rao, S. Vashistha, A. Naik, S. Aditya, and M. Choudhury, “Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks,” arXiv preprint arXiv:2305.14965, 2023.
  • [15] H. Liu, M. Chaudhary, and H. Wang, “Towards trustworthy and aligned machine learning: A data-centric survey with causality perspectives,” arXiv preprint arXiv:2307.16851, 2023.
  • [16] Ö. AYDIN, “Google bard generated literature review: metaverse,” Journal of AI, vol. 7, no. 1, pp. 1–14, 2023.
  • [17] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, p. 6, 2023.
  • [18] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  • [19] L. Lin, H. Mu, Z. Zhai, M. Wang, Y. Wang, R. Wang, J. Gao, Y. Zhang, W. Che, T. Baldwin, et al., “Against the achilles’ heel: A survey on red teaming for generative models,” arXiv preprint arXiv:2404.00629, 2024.
  • [20] T. Schick and H. Schütze, “Exploiting cloze questions for few shot text classification and natural language inference,” arXiv preprint arXiv:2001.07676, 2020.
  • [21] T. Gao, A. Fisch, and D. Chen, “Making pre-trained language models better few-shot learners,” arXiv preprint arXiv:2012.15723, 2020.
  • [22] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. N. Pre-train, “prompt, and predict: A systematic survey of prompting methods in natural language processing., 2023, 55,” DOI: https://doi. org/10.1145/3560815, pp. 1–35.
  • [23] L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” in Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7, 2021.
  • [24] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” arXiv preprint arXiv:2010.15980, 2020.
  • [25] L. Shu, A. Papangelis, Y.-C. Wang, G. Tur, H. Xu, Z. Feizollahi, B. Liu, and P. Molino, “Controllable text generation with focused variation,” arXiv preprint arXiv:2009.12046, 2020.
  • [26] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
  • [27] G. Qin and J. Eisner, “Learning how to ask: Querying lms with mixtures of soft prompts,” arXiv preprint arXiv:2104.06599, 2021.
  • [28] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
  • [29] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,” Advances in Neural Information Processing Systems, vol. 33, pp. 3008–3021, 2020.
  • [30] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019.
  • [31] J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, J. Leike, and P. Christiano, “Recursively summarizing books with human feedback,” arXiv preprint arXiv:2109.10862, 2021.
  • [32] B. Hancock, A. Bordes, P.-E. Mazare, and J. Weston, “Learning from dialogue after deployment: Feed yourself, chatbot!,” arXiv preprint arXiv:1901.05415, 2019.
  • [33] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
  • [34] J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg, “Scalable agent alignment via reward modeling: A research direction. arxiv 2018,” arXiv preprint arXiv:1811.07871, 1811.
  • [35] G. Irving, P. Christiano, and D. Amodei, “Ai safety via debate,” arXiv preprint arXiv:1805.00899, 2018.
  • [36] A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023.
  • [37] S. Zhu, R. Zhang, B. An, G. Wu, J. Barrow, Z. Wang, F. Huang, A. Nenkova, and T. Sun, “Autodan: Automatic and interpretable adversarial attacks on large language models,” arXiv preprint arXiv:2310.15140, 2023.
  • [38] D. Yao, J. Zhang, I. G. Harris, and M. Carlsson, “Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models,” arXiv preprint arXiv:2309.05274, 2023.
  • [39] J. Yu, X. Lin, and X. Xing, “Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts,” arXiv preprint arXiv:2309.10253, 2023.
  • [40] X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, “" do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” arXiv preprint arXiv:2308.03825, 2023.
  • [41] H. Li, D. Guo, W. Fan, M. Xu, and Y. Song, “Multi-step jailbreaking privacy attacks on chatgpt,” arXiv preprint arXiv:2304.05197, 2023.
  • [42] P. Ding, J. Kuang, D. Ma, X. Cao, Y. Xian, J. Chen, and S. Huang, “A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily,” arXiv preprint arXiv:2311.08268, 2023.
  • [43] Q. Ren, C. Gao, J. Shao, J. Yan, X. Tan, W. Lam, and L. Ma, “Exploring safety generalization challenges of large language models via code,” arXiv preprint arXiv:2403.07865, 2024.
  • [44] P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” arXiv preprint arXiv:2310.08419, 2023.
  • [45] H. **, R. Chen, A. Zhou, J. Chen, Y. Zhang, and H. Wang, “Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models,” arXiv preprint arXiv:2402.03299, 2024.
  • [46] E. Jones, A. Dragan, A. Raghunathan, and J. Steinhardt, “Automatically auditing large language models via discrete optimization,” in International Conference on Machine Learning, pp. 15307–15329, PMLR, 2023.
  • [47] C. Sitawarin, N. Mu, D. Wagner, and A. Araujo, “Pal: Proxy-guided black-box attack on large language models,” arXiv preprint arXiv:2402.09674, 2024.
  • [48] R. Lapid, R. Langberg, and M. Sipper, “Open sesame! universal black box jailbreaking of large language models,” arXiv preprint arXiv:2309.01446, 2023.
  • [49] X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generating stealthy jailbreak prompts on aligned large language models,” arXiv preprint arXiv:2310.04451, 2023.
  • [50] X. Li, S. Liang, J. Zhang, H. Fang, A. Liu, and E.-C. Chang, “Semantic mirror jailbreak: Genetic algorithm based jailbreak prompts against open-source llms,” arXiv preprint arXiv:2402.14872, 2024.
  • [51] H. Wang, H. Li, M. Huang, and L. Sha, “From noise to clarity: Unraveling the adversarial suffix of large language model attacks via translation of text embeddings,” arXiv preprint arXiv:2402.16006, 2024.
  • [52] Z. Xiao, Y. Yang, G. Chen, and Y. Chen, “Tastle: Distract large language models for automatic jailbreak attack,” arXiv preprint arXiv:2403.08424, 2024.
  • [53] T. Liu, Y. Zhang, Z. Zhao, Y. Dong, G. Meng, and K. Chen, “Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction,” arXiv preprint arXiv:2402.18104, 2024.
  • [54] Y. Huang, S. Gupta, M. Xia, K. Li, and D. Chen, “Catastrophic jailbreak of open-source llms via exploiting generation,” arXiv preprint arXiv:2310.06987, 2023.
  • [55] Z. Wei, Y. Wang, and Y. Wang, “Jailbreak and guard aligned language models with only few in-context demonstrations,” arXiv preprint arXiv:2310.06387, 2023.
  • [56] S. Schulhoff, J. Pinto, A. Khan, L.-F. Bouchard, C. Si, S. Anati, V. Tagliabue, A. L. Kost, C. Carnahan, and J. Boyd-Graber, “Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global scale prompt hacking competition,” arXiv preprint arXiv:2311.16119, 2023.
  • [57] X. Li, Z. Zhou, J. Zhu, J. Yao, T. Liu, and B. Han, “Deepinception: Hypnotize large language model to be jailbreaker,” arXiv preprint arXiv:2311.03191, 2023.
  • [58] R. Shah, S. Pour, A. Tagade, S. Casper, J. Rando, et al., “Scalable and transferable black-box jailbreaks for language models via persona modulation,” arXiv preprint arXiv:2311.03348, 2023.
  • [59] C. Liu, F. Zhao, L. Qing, Y. Kang, C. Sun, K. Kuang, and F. Wu, “Goal-oriented prompt attack and safety evaluation for llms,” arXiv e-prints, pp. arXiv–2309, 2023.
  • [60] N. Mangaokar, A. Hooda, J. Choi, S. Chandrashekaran, K. Fawaz, S. Jha, and A. Prakash, “Prp: Propagating universal perturbations to attack large language model guard-rails,” arXiv preprint arXiv:2402.15911, 2024.
  • [61] D. Handa, A. Chirmule, B. Gajera, and C. Baral, “Jailbreaking proprietary large language models using word substitution cipher,” arXiv preprint arXiv:2402.10601, 2024.
  • [62] D. Kang, X. Li, I. Stoica, C. Guestrin, M. Zaharia, and T. Hashimoto, “Exploiting programmatic behavior of llms: Dual-use through standard security attacks,” arXiv preprint arXiv:2302.05733, 2023.
  • [63] J. Wang, Z. Liu, K. H. Park, M. Chen, and C. Xiao, “Adversarial demonstration attacks on large language models,” arXiv preprint arXiv:2305.14950, 2023.
  • [64] G. Deng, Y. Liu, Y. Li, K. Wang, Y. Zhang, Z. Li, H. Wang, T. Zhang, and Y. Liu, “Jailbreaker: Automated jailbreak across multiple large language model chatbots,” arXiv preprint arXiv:2307.08715, 2023.
  • [65] H. Lv, X. Wang, Y. Zhang, C. Huang, S. Dou, J. Ye, T. Gui, Q. Zhang, and X. Huang, “Codechameleon: Personalized encryption framework for jailbreaking large language models,” arXiv preprint arXiv:2402.16717, 2024.
  • [66] X. Li, R. Wang, M. Cheng, T. Zhou, and C.-J. Hsieh, “Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers,” arXiv preprint arXiv:2402.16914, 2024.
  • [67] H. **, A. Zhou, J. D. Menke, and H. Wang, “Jailbreaking large language models against moderation guardrails via cipher characters,” arXiv preprint arXiv:2405.20413, 2024.
  • [68] B. Deng, W. Wang, F. Feng, Y. Deng, Q. Wang, and X. He, “Attack prompt generation for red teaming and defending large language models,” arXiv preprint arXiv:2310.12505, 2023.
  • [69] J. Hayase, E. Borevkovic, N. Carlini, F. Tramèr, and M. Nasr, “Query-based adversarial prompt generation,” arXiv preprint arXiv:2402.12329, 2024.
  • [70] N. Jain, A. Schwarzschild, Y. Wen, G. Somepalli, J. Kirchenbauer, P. yeh Chiang, M. Goldblum, A. Saha, J. Gei**, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,” 2023.
  • [71] G. Alon and M. Kamfonas, “Detecting language model attacks with perplexity,” 2023.
  • [72] Y. Xie, M. Fang, R. Pi, and N. Gong, “Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis,” arXiv preprint arXiv:2402.13494, 2024.
  • [73] A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,” 2023.
  • [74] J. Ji, B. Hou, A. Robey, G. J. Pappas, H. Hassani, Y. Zhang, E. Wong, and S. Chang, “Defending large language models against jailbreak attacks via semantic smoothing,” arXiv preprint arXiv:2402.16192, 2024.
  • [75] A. Kumar, C. Agarwal, S. Srinivas, S. Feizi, and H. Lakkaraju, “Certifying llm safety against adversarial prompting,” arXiv preprint arXiv:2309.02705, 2023.
  • [76] Y. Xie, J. Yi, J. Shao, et al., “Defending chatgpt against jailbreak attack via self-reminders,” Nat Mach Intell, vol. 5, pp. 1486–1496, 2023.
  • [77] Y. Li, F. Wei, J. Zhao, C. Zhang, and H. Zhang, “Rain: Your language models can align themselves without finetuning,” 2023.
  • [78] Z. Xu, F. Jiang, L. Niu, J. Jia, B. Y. Lin, and R. Poovendran, “Safedecoding: Defending against jailbreak attacks via safety-aware decoding,” arXiv preprint arXiv:2402.08983, 2024.
  • [79] M. Pisano, P. Ly, A. Sanders, B. Yao, D. Wang, T. Strzalkowski, and M. Si, “Bergeron: Combating adversarial attacks through a conscience-based alignment framework,” arXiv preprint arXiv:2312.00029, 2023.
  • [80] H. Kim, S. Yuk, and H. Cho, “Break the breakout: Reinventing lm defense against jailbreak attacks with self-refinement,” arXiv preprint arXiv:2402.15180, 2024.
  • [81] S. Ge, C. Zhou, R. Hou, M. Khabsa, Y.-C. Wang, Q. Wang, J. Han, and Y. Mao, “Mart: Improving llm safety with multi-round automatic red-teaming,” arXiv preprint arXiv:2311.07689, 2023.
  • [82] M. Wang, N. Zhang, Z. Xu, Z. Xi, S. Deng, Y. Yao, Q. Zhang, L. Yang, J. Wang, and H. Chen, “Detoxifying large language models via knowledge editing,” arXiv preprint arXiv:2403.14472, 2024.
  • [83] I. Provilkov, D. Emelianenko, and E. Voita, “Bpe-dropout: Simple and effective subword regularization,” arXiv preprint arXiv:1910.13267, 2019.
  • [84] B. Cao, Y. Cao, L. Lin, and J. Chen, “Defending against alignment-breaking attacks via robustly aligned llm,” arXiv preprint arXiv:2309.14348, 2023.
  • [85] X. Hu, P.-Y. Chen, and T.-Y. Ho, “Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes,” arXiv preprint arXiv:2403.00867, 2024.
  • [86] Z. Zhang, J. Yang, P. Ke, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioritization,” 2023.
  • [87] Y. Zhang, L. Ding, L. Zhang, and D. Tao, “Intention analysis makes llms a good jailbreak defender,” 2024.
  • [88] Y. Mo, Y. Wang, Z. Wei, and Y. Wang, “Studious bob fight back against jailbreaking via prompt adversarial tuning,” arXiv preprint arXiv:2402.06255, 2024.
  • [89] A. Zhou, B. Li, and H. Wang, “Robust prompt optimization for defending language models against jailbreaking attacks,” arXiv preprint arXiv:2401.17263, 2024.
  • [90] C. Zheng, F. Yin, H. Zhou, F. Meng, J. Zhou, K.-W. Chang, M. Huang, and N. Peng, “Prompt-driven llm safeguarding via directed representation optimization,” arXiv preprint arXiv:2401.18018, 2024.
  • [91] A. Helbling, M. Phute, M. Hull, and D. H. Chau, “Llm self defense: By self examination, llms know they are being tricked,” arXiv preprint arXiv:2308.07308, 2023.
  • [92] Y. Zeng, Y. Wu, X. Zhang, H. Wang, and Q. Wu, “Autodefense: Multi-agent llm defense against jailbreak attacks,” arXiv preprint arXiv:2403.04783, 2024.
  • [93] R. Bhardwaj and S. Poria, “Red-teaming large language models using chain of utterances for safety-alignment,” arXiv preprint arXiv:2308.09662, 2023.
  • [94] Y. Li, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [95] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “Badnets: Evaluating backdooring attacks on deep neural networks,” IEEE Access, vol. 7, pp. 47230–47244, 2019.
  • [96] W. Lyu, X. Lin, S. Zheng, L. Pang, H. Ling, S. Jha, and C. Chen, “Task-agnostic detector for insertion-based backdoor attacks,” arXiv preprint arXiv:2403.17155, 2024.
  • [97] J. Wang, J. Li, Y. Li, X. Qi, M. Chen, J. Hu, Y. Li, B. Li, and C. Xiao, “Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment,” arXiv preprint arXiv:2402.14968, 2024.
  • [98] A. Hasan, I. Rugina, and A. Wang, “Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning,” arXiv preprint arXiv:2401.10862, 2024.
  • [99] M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” arXiv preprint arXiv:2306.11695, 2023.
  • [100] J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, and D. Wagner, “Jatmo: Prompt injection defense by task-specific finetuning,” arXiv preprint arXiv:2312.17673, 2023.
  • [101] Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, L. Zhao, T. Zhang, and Y. Liu, “Jailbreaking chatgpt via prompt engineering: An empirical study (2023),” Preprint at https://arxiv. org/abs/2305.13860.
  • [102] M. Gupta, C. Akiri, K. Aryal, E. Parker, and L. Praharaj, “From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy,” IEEE Access, 2023.
  • [103] A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [104] D. Glukhov, I. Shumailov, Y. Gal, N. Papernot, and V. Papyan, “Llm censorship: A machine learning challenge or a computer security problem?,” arXiv preprint arXiv:2307.10719, 2023.
  • [105] N. Inie, J. Stray, and L. Derczynski, “Summon a demon and bind it: A grounded theory of llm red teaming in the wild,” arXiv preprint arXiv:2311.06237, 2023.
  • [106] S. Singh, F. Abri, and A. S. Namin, “Exploiting large language models (llms) through deception techniques and persuasion principles,” in 2023 IEEE International Conference on Big Data (BigData), pp. 2508–2517, IEEE, 2023.
  • [107] W. Zhou, X. Wang, L. Xiong, H. Xia, Y. Gu, M. Chai, F. Zhu, C. Huang, S. Dou, Z. Xi, et al., “Easyjailbreak: A unified framework for jailbreaking large language models,” arXiv preprint arXiv:2403.12171, 2024.
  • [108] J. Gei**, A. Stein, M. Shu, K. Saifullah, Y. Wen, and T. Goldstein, “Coercing llms to do and reveal (almost) anything,” arXiv preprint arXiv:2402.14020, 2024.
  • [109] S. Banerjee, S. Layek, R. Hazra, and A. Mukherjee, “How (un) ethical are instruction-centric responses of llms? unveiling the vulnerabilities of safety guardrails to harmful queries,” arXiv preprint arXiv:2402.15302, 2024.
  • [110] F. Jiang, Z. Xu, L. Niu, Z. Xiang, B. Ramasubramanian, B. Li, and R. Poovendran, “Artprompt: Ascii art-based jailbreak attacks against aligned llms,” arXiv preprint arXiv:2402.11753, 2024.
  • [111] J. Ye, S. Li, G. Li, C. Huang, S. Gao, Y. Wu, Q. Zhang, T. Gui, and X. Huang, “Toolsword: Unveiling safety issues of large language models in tool learning across three stages,” arXiv preprint arXiv:2402.10753, 2024.
  • [112] R. K. Sharma, V. Gupta, and D. Grossman, “Spml: A dsl for defending language models against prompt attacks,” arXiv preprint arXiv:2402.11755, 2024.
  • [113] A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Emmons, O. Watkins, et al., “A strongreject for empty jailbreaks,” arXiv preprint arXiv:2402.10260, 2024.
  • [114] Z. Xu, Y. Liu, G. Deng, Y. Li, and S. Picek, “Llm jailbreak attack versus defense techniques–a comprehensive study,” arXiv preprint arXiv:2402.13457, 2024.
  • [115] N. Varshney, P. Dolin, A. Seth, and C. Baral, “The art of defending: A systematic evaluation and analysis of llm defense strategies on safety and over-defensiveness,” arXiv preprint arXiv:2401.00287, 2023.
  • [116] Y. Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language models via typographic visual prompts,” 2023.
  • [117] J. Zhang, Q. Yi, and J. Sang, “Towards adversarial attack on vision-language pre-training models,” 2022.
  • [118] D. Han, X. Jia, Y. Bai, J. Gu, Y. Liu, and X. Cao, “Ot-attack: Enhancing adversarial transferability of vision-language models via optimal transport optimization,” 2023.
  • [119] D. Lu, Z. Wang, T. Wang, W. Guan, H. Gao, and F. Zheng, “Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models,” 2023.
  • [120] E. Shayegani, Y. Dong, and N. Abu-Ghazaleh, “Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,” 2023.
  • [121] Y. Dong, H. Chen, J. Chen, Z. Fang, X. Yang, Y. Zhang, Y. Tian, H. Su, and J. Zhu, “How robust is google’s bard to adversarial image attacks?,” 2023.
  • [122] H. Chen, Y. Zhang, Y. Dong, X. Yang, H. Su, and J. Zhu, “Rethinking model ensemble in transfer-based adversarial attacks,” 2024.
  • [123] Z. Niu, H. Ren, X. Gao, G. Hua, and R. **, “Jailbreaking attack against multimodal large language model,” 2024.
  • [124] X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal, “Visual adversarial examples jailbreak aligned large language models,” 2023.
  • [125] N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, A. Awadalla, P. W. Koh, D. Ippolito, K. Lee, F. Tramer, and L. Schmidt, “Are aligned neural networks adversarially aligned?,” 2023.
  • [126] H. Luo, J. Gu, F. Liu, and P. Torr, “An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models,” 2024.
  • [127] Y. Zhao, T. Pang, C. Du, X. Yang, C. Li, N.-M. Cheung, and M. Lin, “On evaluating adversarial robustness of large vision-language models,” 2023.
  • [128] C. Schlarmann and M. Hein, “On the adversarial robustness of multi-modal foundation models,” 2023.
  • [129] L. Bailey, E. Ong, S. Russell, and S. Emmons, “Image hijacks: Adversarial images can control generative models at runtime,” 2023.
  • [130] Z. Zhou, S. Hu, M. Li, H. Zhang, Y. Zhang, and H. **, “Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning,” 2023.
  • [131] Z. Yin, M. Ye, T. Zhang, T. Du, J. Zhu, H. Liu, J. Chen, T. Wang, and F. Ma, “Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models,” 2024.
  • [132] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial examples and black-box attacks,” 2017.
  • [133] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in machine learning: from phenomena to black-box attacks using adversarial samples,” arXiv preprint arXiv:1605.07277, 2016.
  • [134] C. Xie, Z. Zhang, Y. Zhou, S. Bai, J. Wang, Z. Ren, and A. Yuille, “Improving transferability of adversarial examples with input diversity,” 2019.
  • [135] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” 2018.
  • [136] X. Yang, Y. Dong, T. Pang, H. Su, and J. Zhu, “Boosting transferability of targeted adversarial examples via hierarchical generative networks,” 2022.
  • [137] Y. Chen, K. Sikka, M. Cogswell, H. Ji, and A. Divakaran, “Dress: Instructing large vision-language models to align and interact with humans via natural language feedback,” 2023.
  • [138] Y. Wang, X. Liu, Y. Li, M. Chen, and C. Xiao, “Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting,” 2024.
  • [139] R. Pi, T. Han, Y. Xie, R. Pan, Q. Lian, H. Dong, J. Zhang, and T. Zhang, “Mllm-protector: Ensuring mllm’s safety without hurting performance,” 2024.
  • [140] Y. Gou, K. Chen, Z. Liu, L. Hong, H. Xu, Z. Li, D.-Y. Yeung, J. T. Kwok, and Y. Zhang, “Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation,” 2024.
  • [141] X. Zhang, C. Zhang, T. Li, Y. Huang, X. Jia, X. Xie, Y. Liu, and C. Shen, “A mutation-based method for multi-modal jailbreaking attack detection,” 2023.
  • [142] Y. Zong, O. Bohdal, T. Yu, Y. Yang, and T. Hospedales, “Safety fine-tuning at (almost) no cost: A baseline for vision large language models,” 2024.
  • [143] X. Liu, Y. Zhu, J. Gu, Y. Lan, C. Yang, and Y. Qiao, “Mm-safetybench: A benchmark for safety evaluation of multimodal large language models,” 2024.
  • [144] H. Tu, C. Cui, Z. Wang, Y. Zhou, B. Zhao, J. Han, W. Zhou, H. Yao, and C. Xie, “How many unicorns are in this image? a safety evaluation benchmark for vision llms,” 2023.
  • [145] S. Toyer, O. Watkins, E. A. Mendes, J. Svegliato, L. Bailey, T. Wang, I. Ong, K. Elmaaroufi, P. Abbeel, T. Darrell, et al., “Tensor trust: Interpretable prompt injection attacks from an online game,” arXiv preprint arXiv:2311.01011, 2023.
  • [146] V. Patil, P. Hase, and M. Bansal, “Can sensitive information be deleted from llms? objectives for defending against extraction attacks,” arXiv preprint arXiv:2309.17410, 2023.
  • [147] Y. Deng, W. Zhang, S. J. Pan, and L. Bing, “Multilingual jailbreak challenges in large language models,” arXiv preprint arXiv:2310.06474, 2023.
  • [148] W. Wang, Z. Tu, C. Chen, Y. Yuan, J.-t. Huang, W. Jiao, and M. R. Lyu, “All languages matter: On the multilingual safety of large language models,” arXiv preprint arXiv:2310.00905, 2023.
  • [149] Z.-X. Yong, C. Menghini, and S. H. Bach, “Low-resource languages jailbreak gpt-4,” arXiv preprint arXiv:2310.02446, 2023.
  • [150] L. Shen, W. Tan, S. Chen, Y. Chen, J. Zhang, H. Xu, B. Zheng, P. Koehn, and D. Khashabi, “The language barrier: Dissecting safety challenges of llms in multilingual contexts,” arXiv preprint arXiv:2401.13136, 2024.
  • [151] S. Lermen, C. Rogers-Smith, and J. Ladish, “Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b,” arXiv preprint arXiv:2310.20624, 2023.
  • [152] Q. Zhan, R. Fang, R. Bindu, A. Gupta, T. Hashimoto, and D. Kang, “Removing rlhf protections in gpt-4 via fine-tuning,” arXiv preprint arXiv:2311.05553, 2023.
  • [153] X. Qi, Y. Zeng, T. Xie, P.-Y. Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to!,” arXiv preprint arXiv:2310.03693, 2023.
  • [154] X. Chen, S. Tang, R. Zhu, S. Yan, L. **, Z. Wang, L. Su, X. Wang, and H. Tang, “The janus interface: How fine-tuning in large language models amplifies the privacy risks,” arXiv preprint arXiv:2310.15469, 2023.