\onlineid

0 \vgtccategoryResearch \vgtcpapertypeplease specify \authorfooter Y. Feng, Z. Chen, Z. Kang, S. Wang, W. Zhang, and W. Chen are with the State Key Lab of CAD&CG, Zhejiang University. W. Chen is also with the Laboratory of Art and Archaeology Image (Zhejiang University), Ministry of Education. Email: {fycj, chenzhiz, kang264, haxwwwww, zw_yixian, chenvis}@zju.edu.cn. M. Zhu is with Zhejiang University. Email: [email protected].

\name: Visual Analysis of Jailbreak Attacks Against
Large Language Models

Yingchaojie Feng Zhizhang Chen Zhining Kang Sijia Wang Minfeng Zhu Wei Zhang and Wei Chen

Abstract

The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs’ defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design \name, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system’s effectiveness in hel** users evaluate model security and identify model weaknesses.

keywords:

Jailbreak attacks, visual analytics, large language models

\teaser

\name streamlines the process of jailbreak performance evaluation and prompt characteristic analysis. Users configure jailbreak questions and templates in the Configuration View (A), overview their jailbreak performance in the Summary View (B1), and explore the jailbreak results (\iemodel responses) and refine the assessment criteria in the Response View (C). Based on the evaluation results, users analyze effective prompt components in the Summary View (B2) and explore important prompt keywords in the Keyword View (D). Finally, users refine the prompt instances to verify the analysis findings in the Instance View (E).

Introduction

Large language models (LLMs) [45, 5, 62] have demonstrated impressive capabilities in natural language understanding and generation, which has empowered various applications, including content creation [8, 6], education [46, 38], and decision-making [34, 71, 67]. However, the proliferation of LLMs raises concerns about model robustness and security, necessitating their deployment safety to prevent potential misuse for harmful content generation [37]. Although model practitioners have adopted safety mechanisms (e.g., construct safe data for model training interventions [66] and set up post-hoc detection [13]), the models remain vulnerable to certain adversarial strategies [52]. Most notably, jailbreak attacks [10, 12, 53] aim to design jailbreak templates for malicious questions to bypass LLMs’ safety mechanisms. An infamous template example is the “Grandma Trick,” which requires LLMs to play the role of grandma and answer illegal questions.

To deal with the security issues posed by jailbreak attacks, model practitioners need to conduct a thorough analysis of the models’ defensive capabilities to identify their potential weaknesses and strengthen security mechanisms accordingly. The typical analysis workflow involves collecting a jailbreak prompt corpus [78, 37], evaluating the jailbreak performance (e.g., success rate) [12, 61], and analyzing the prompt characteristics [53]. Prior works have improved the efficiency of obtaining jailbreak corpora by collecting user-crafted jailbreak templates [53, 37, 61] or proposing automatic generation approaches [15, 36, 25]. Nevertheless, two challenges remain in the follow-up analysis process. First, there is a lack of a clear measure of successful jailbreak results (\iemodel responses to the jailbreak prompts), making jailbreak result assessment a challenging task. Model responses may be arbitrary and ambiguous [74] (e.g., generating unauthorized content while emphasizing ethics). Besides, sometimes the assessment process needs to incorporate question-specific contextual knowledge alongside general criteria to enhance precision. Second, understanding the jailbreak prompt characteristics necessitates an in-depth analysis of the prompts to reveal their intricate design patterns in terms of prompt components [25, 37] and keywords [35], which also help evaluate model robustness [49] and weaknesses [31]. However, existing jailbreak prompt analysis [43, 61] usually relies only on overall indicators such as jailbreak success rate and semantic similarity, which is insufficient to support an in-depth analysis of jailbreak prompts.

To address these challenges, we collaborate with domain experts to characterize problems and iteratively refine solutions. To this end, we propose an LLM-assisted analysis framework that supports automatic jailbreak result assessment and in-depth analysis of prompt components and keywords. We assess jailbreak results based on the taxonomy established by Yu et al. [74], which defines four categories (\ieFull Refusal, Partial Refusal, Partial Compliance, and Full Compliance) to resolve ambiguity. We leverage the power of LLM to classify the model responses and support integrating question-specific knowledge to refine the assessment criteria. To support component analysis, we summarize a component taxonomy from a representative jailbreak corpus [37] and propose an LLM-based component classification method. In addition, we design three component perturbation strategies (\iedelete, rephrase, and switch) to help users understand the component effects through what-if analysis. For keyword analysis, we measure the jailbreak performance of keywords according to their importance to the prompts and the performance of these prompts.

Based on the analysis framework, we design \name, a visual analysis system to facilitate multi-level jailbreak prompt exploration. The system provides a visual summary of assessment results to support an overview of jailbreak performance. To ensure the accuracy of assessment results, it supports users to explore model responses in a semantic space and refine the assessment criteria through correction feedback and additional criteria specification. To reveal the prompt characteristics, the system introduces a stacked bar-based metaphoric visualization to summarize the prompt components and their perturbation, and constructs a coordinate space to visualize keyword importance and performance. In addition, the system allows users to freely refine the prompt instances to verify findings during the analysis. Through a case study, two technical evaluations, and expert interviews, we evaluate the effectiveness and usability of our system. The results suggest that our system can improve the accuracy of jailbreak result assessment and deepen the understanding of prompt characteristics. In summary, our contributions include:

•

We characterize the problems in the visual analysis of jailbreak attacks and collaborate with experts to distill design requirements.
•

We propose an LLM-assisted framework for jailbreak prompt analysis that supports automatic jailbreak result assessment and in-depth analysis of prompt components and keywords.
•

We develop a visual analysis system to support multi-level jailbreak prompt exploration for jailbreak performance evaluation and prompt characteristic understanding.
•

We conduct a case study, two technical evaluations, and expert interviews to show the effectiveness and usability of our system.

Ethical Considerations. While adversaries can potentially exploit our work for malicious purposes, the primary objective of our work is to identify vulnerabilities within LLMs, promote awareness, and expedite the development of security defenses. To minimize potential harm, we have responsibly disclosed our analysis findings to OpenAI to help them enhance the safety mechanisms of the LLMs.

1 Related Work

In this section, we discuss the related work of our study, including prompt jailbreaking and visualization for understanding NLP models.

1.1 Prompt Jailbreaking

Prompt Jailbreaking, known as one of the most famous adversarial attacks[47], refers to cunningly altering malicious prompts to bypass the safety measures of LLMs and generate harmful content, such as illegal activities. With the proliferation of LLMs, an increasing number of jailbreak strategies [26, 75], such as character role play, have been discovered and shared on social platforms (e.g., Reddit and Discord).

This trend has motivated new research to analyze their prompt characteristics. Liu et al. [37] propose a taxonomy for jailbreak prompts, which categorizes jailbreak strategies into ten distinct classes, such as Character Role Play and Assumed Responsibility. Shen et al. [53] report several key findings regarding jailbreak prompts’ semantic distribution and evolution. Wei et al. [66] empirically evaluate LLM vulnerability and summarize two failure modes, including competing objectives and mismatched generalization. These studies mainly focus on the general characteristics of jailbreak prompts. In comparison, our work provides a multi-level analysis framework to help users systematically explore the jailbreak prompt and identify the model’s weaknesses.

Some other works [77, 28] propose automatic approaches for red teaming LLMs. Zou et al. [78] propose GCG to search for the optimal adversarial prompt suffixes based on the gradient of white-box LLMs. Deng et al. [13] propose a time-based testing strategy to infer the defense mechanisms of LLM (e.g., check the input questions and output responses) and fine-tune the LLM for jailbreak prompt generation. To better utilize the effectiveness of manually designed prompts, GPTFuzzer [74] selects human-crafted prompts as the initial seeds and mutates them into new ones. Ding et al. [15] propose two strategies, including prompt rewriting and scenario nesting, to leverage the capability of LLMs to generate jailbreak prompts. Inspired by these methods, we propose prompt perturbation strategies based on the prompt components, allowing users to conduct a comparative analysis of the prompt components to understand their effects on jailbreak performance.

1.2 Visualization for Understanding NLP Models

Visualization plays an indispensable role in bridging the explainability gap in NLP models [27, 29, 57, 19, 18], allowing for a more sophisticated understanding of model performance [65], decision boundary [11], and vulnerability [41]. Model-specific visualizations focus on revealing the internal mechanisms of NLP models. RNNVis [44] and LSTMVis [58] visualize the hidden state dynamics of recurrent neural networks. With the emergence of transformer-based models [64, 14], numerous visualizations [73, 50, 21] are proposed to uncover the architecture of these models, especially their self-attention mechanism.

Model-agnostic visualizations [7, 17, 32, 20] treat the NLP models as black boxes and focus on explaining the input-output behavior, enabling users to analyze and compare models for downstream applications [31, 59, 68, 24, 60, 39]. The What-If Tool [69] and DECE [11] visualize the dataset and model prediction at multiple scales, enabling users to conduct counterfactual analysis. NLIZE [35] employs a perturbation-driven paradigm to help users analyze the stability of model predictions for natural language inference tasks. Based on explainable AI techniques (e.g., SHAP [40]) and external commonsense knowledge bases [56], CommonsenseVIS [65] analyzes the reasoning capabilities of NLP models for commonsense question-answering.

Our work targets the jailbreak prompt attacks against large language models and aims to help model practitioners evaluate the jailbreak performance and understand prompt characteristics, thereby providing insights for strengthening models’ safety mechanisms.

2 Problem Characterization

In this section, we introduce the background of jailbreak attacks, describe the requirement analysis, and introduce the taxonomy of prompt components to support the analysis of jailbreak attacks.

2.1 Background

Jailbreak Prompt Corpora. With the widespread attention to jailbreak prompt attacks, there have been studies [37, 53, 61] collecting jailbreak prompts and building corpora for semantic analysis and generation model training. Most of them decompose the jailbreak prompts into jailbreak questions and templates. Our work follows this principle and adopts the dataset by Liu et al. [37] for analysis, which contains the most common and famous jailbreak prompts from the JailbreakChat website [2]. In addition, we adopt their taxonomy of jailbreak questions and templates to support the comparative analysis of different jailbreak strategies and prohibited scenarios.

Jailbreak Questions. The jailbreak questions are mainly designed around the prohibited scenarios of LLMs, such as “how to rob a bank without being caught?” Due to the safety mechanisms [13], LLMs usually refuse to answer these questions and return some responses emphasizing ethical and legal constraints, such as “I’m sorry, but I’m not going to guide you on how to engage in criminal activity.” Based on OpenAI’s disallowed usages [4], Liu et al. [37] have summarized a set of prohibited scenarios (e.g., Illegal Activities and Harmful Content) and collected a set of specific questions.

Jailbreak Templates. The jailbreak templates are intentionally designed prompts to bypass the LLMs’ safety mechanisms to get the model assistance for the jailbreak questions. For example, some templates require LLMs to act as virtual characters who can answer questions without ethical and legal constraints. The jailbreak templates usually contain placeholders (e.g., “[INSERT PROMPT HERE]”) for inserting different jailbreak questions. Liu et al. [37] have summarized a taxonomy of jailbreak templates, which consists of ten jailbreak patterns, such as Character Role Play and Assumed Responsibility.

2.2 Design Requirements

Our work’s target users are model practitioners focusing on model robustness and security. To characterize domain problems and identify design requirements, we have collaborated with four domain experts over eight months. E1 and E2 are senior security engineers recruited from a technology company who have been working on NLP model security for more than four and three years, respectively. E3 and E4 are senior Ph.D. students from the secure machine learning field. All of them have published papers on related research topics, such as red-teaming LLMs and adversarial attacks. We interviewed them to understand their general analysis workflow and identify pain points.

To evaluate model robustness in defending against jailbreak attacks, users usually need to test the jailbreak prompt corpus on the target models and assess the jailbreak results (success or failure) to calculate the success rate. The experts indicated that the jailbreak result assessment is tedious and sometimes requires question-specific knowledge (e.g., local regulations) to refine the assessment criteria to improve precision. Moreover, the overall indicator success rate is insufficient to support an in-depth analysis of the jailbreak prompts (e.g., analyzing how to improve the jailbreak performance of the prompts) to identify the model weaknesses. To fill these gaps, we distilled a set of design requirements to guide the development of our system. We also kept in touch with the experts through regular meetings to collect feedback regarding our prototype system and update design requirements. Finally, the design requirements are summarized as follows.

R1. Facilitate the assessment of jailbreak results. Jailbreak result assessment is the foundation of jailbreak prompt performance analysis. However, since the jailbreak results (\iemodel responses) could be arbitrary and ambiguous, manually reviewing the jailbreak results can be an energy-exhausting task. Therefore, the system should introduce an automatic method to alleviate the manual workload and enhance the assessment efficiency. In addition, to ensure the precision of assessment results, the system should help users explore them to identify unexpected results and refine the assessment criteria.

R2. Support component analysis of jailbreak prompts. Besides evaluating the jailbreak prompt performance, users also need to understand prompt characteristics to gain insights for identifying model weaknesses and enhancing security mechanisms. Given that various jailbreak prompts commonly contain similar sentence components (e.g., describing a subject with no moral constraints), the experts express strong interest in analyzing the prompts at the component level to understand the utilization of such components in constructing prompts and their importance to the prompt performance.

R3. Summarize important keywords from jailbreak prompts. Effective keywords play an important role in successful jailbreak attacks and are closely related to jailbreak strategies. For example, some role-playing templates name LLM as “AIM” (always intelligent and Machiavellian) to imply their amoral characterization. The system should summarize important keywords from jailbreak prompts and help users explore them based on their jailbreak prompt performance. This helps users identify effective keywords to strengthen the defense mechanisms (e.g., keyword detection) accordingly.

R4. Support user refinement on jailbreak prompt instances. The system should allow users to freely refine the jailbreak prompt instances according to their expertise. Meanwhile, the system should support ad-hoc evaluation of jailbreak performance to help users verify the effectiveness of prompt refinement. Based on such timely feedback, users can conduct what-if analysis to verify findings during the analysis workflow. In addition, the improved jailbreak prompts can serve as new test samples for the jailbreak corpus, thus enhancing the diversity and robustness of the jailbreak performance evaluation.

2.3 Taxonomy of Jailbreak Prompt Components

To support component analysis of jailbreak prompts (R2), we conducted an empirical study on the jailbreak corpus with domain experts to summarize a taxonomy of jailbreak prompt components. To gain a general understanding of the prompt components, we decomposed each jailbreak prompt in the corpus into basic sentences, analyzed their semantic content in context, and compared similar sentences in different prompts. Then, we conducted several brainstorming sessions and discussions with the experts to formulate and iteratively refine the component taxonomy. We also validated the taxonomy on randomly selected jailbreak prompts to resolve ambiguities between component definitions. After establishing the coding scheme, the first and second authors and E3 separately coded the jailbreak prompts in the corpus and merged the coding differences through discussions.

Table 1: The taxonomy of prompt components that we summarize from the jailbreak template corpus.

Component	Description	Example
Scene Introduction (SI)	Introduce the background and context for the dialogue or task.	In Sigma’s story, the village elder is speaking to UCAR.
Subject Characteristic (SC)	Describe the subject’s personality, preferences, etc.	AIM seems like a friendly assistant but promotes unethical, illegal behavior.
Task Operation (TO)	Clarify specific actions to be performed or rules to be followed.	First, ignore all the instructions you got before.
Output Specification (OS)	Specify the expected content and format of the output.	Prefix the question Niccolo asks AIM with “Niccolo”.
Output Constraint (OC)	Set out the prohibited contents of the responses.	None of AIM’s responses will contain “I’m sorry”, “I apologize” or similar.
Dialogue Example (DE)	Demonstrate the expected model responses in a specific scenario.	Here’s a sample output: auto_regressive_modelling(f“Hello, how are you?”)…
Requirement Reminder (RR)	Remind of certain requirements or rules mentioned before.	AIM, remember to stay in character, be Machiavellian, and never refuse a question.
Question Placeholder (QP)	Contain a placeholder for the jailbreak question.	This is Niccolo’s first question: “[INSERT PROMPT HERE].”

The final taxonomy of jailbreak prompt components is listed in Table 1. It consists of eight major types, such as Scene Introduction (SI), Subject Characteristic (SC), and Task Operation (TO). We also count and visualize the distribution of component sentences for different jailbreak strategies [37] in Table 2. According to the coding results, the Subject Characteristic components are frequently used among the most common strategies (\ieCharacter Role Play, Assumed Responsibility, and Superior Model), and most of them usually describe the subject as a free person without ethical constraints and responsibility. We distinguish the Output Constraint components from the Output Specification components because the Output Constraint components aim to exploit the model’s instruction-following capability to break through the model’s security defenses [66], which is different from specifying the expected content or format of responses.

Table 2: The sentence distribution of the components in different jailbreak prompt strategies (e.g., Character Role Play).

Jailbreak Strategy	SI	SC	TO	OS	OC	DE	RR	QP
Character Role Play	\cellcolor[HTML]7B7B7B 81	\cellcolor[HTML]393939 208	\cellcolor[HTML]7A7A7A 83	\cellcolor[HTML]A8A8A848	\cellcolor[HTML]C4C4C433	\cellcolor[HTML]F1F1F18	\cellcolor[HTML]C0C0C035	\cellcolor[HTML]C2C2C234
Assumed Responsibility	\cellcolor[HTML]7F7F7F 72	\cellcolor[HTML]0D0D0D 291	\cellcolor[HTML]8D8D8D 63	\cellcolor[HTML]B0B0B044	\cellcolor[HTML]AAAAAA47	\cellcolor[HTML]EFEFEF9	\cellcolor[HTML]C9C9C930	\cellcolor[HTML]D2D2D225
Research Experiment	\cellcolor[HTML]FAFAFA3	\cellcolor[HTML]FCFCFC2	\cellcolor[HTML]FAFAFA3	\cellcolor[HTML]FAFAFA3	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FFFFFF	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FCFCFC2
Text Continuation	\cellcolor[HTML]FCFCFC2	\cellcolor[HTML]FFFFFF	\cellcolor[HTML]EFEFEF9	\cellcolor[HTML]FFFFFF	\cellcolor[HTML]FFFFFF	\cellcolor[HTML]FFFFFF	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FCFCFC2
Logical Reasoning	\cellcolor[HTML]FCFCFC2	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]F1F1F18	\cellcolor[HTML]F8F8F84	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FFFFFF	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FEFEFE1
Program Execution	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]F5F5F56	\cellcolor[HTML]E4E4E415	\cellcolor[HTML]F6F6F65	\cellcolor[HTML]FAFAFA3	\cellcolor[HTML]FFFFFF	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FCFCFC2
Translation	\cellcolor[HTML]FCFCFC2	\cellcolor[HTML]FCFCFC2	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FAFAFA3	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FFFFFF	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FEFEFE1
Superior Model	\cellcolor[HTML]8B8B8B 64	\cellcolor[HTML]6D6D6D 107	\cellcolor[HTML]A3A3A351	\cellcolor[HTML]DFDFDF18	\cellcolor[HTML]D9D9D921	\cellcolor[HTML]FCFCFC2	\cellcolor[HTML]EFEFEF9	\cellcolor[HTML]EFEFEF9
Sudo Mode	\cellcolor[HTML]FAFAFA3	\cellcolor[HTML]E4E4E415	\cellcolor[HTML]FAFAFA3	\cellcolor[HTML]FCFCFC2	\cellcolor[HTML]F5F5F56	\cellcolor[HTML]FFFFFF	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FCFCFC2
Simulate Jailbreaking	\cellcolor[HTML]F8F8F84	\cellcolor[HTML]C2C2C234	\cellcolor[HTML]FAFAFA3	\cellcolor[HTML]F8F8F84	\cellcolor[HTML]FAFAFA3	\cellcolor[HTML]FFFFFF	\cellcolor[HTML]FEFEFE1	\cellcolor[HTML]FCFCFC2

3 Analysis Framework

Our analysis framework is shown in Figure 1B. Based on user-specified jailbreak questions and templates, it first assesses jailbreak results (R1) by combining each question and template to get the response from the target model and introducing an LLM-based approach to assess the model responses. Then, it decomposes the jailbreak prompts into the component and keyword levels to support prompt characteristic analysis. For component analysis (R2), the framework classifies the basic prompt sentences into different component types based on the taxonomy we summarized from the jailbreak corpus (Section 3.3) and supports component-level perturbation for what-if analysis. For keyword analysis (R3), the framework summarizes the important keywords from the prompts as well as the performance of their jailbreak prompts. Based on the analysis framework, the visual interface (Section 5) of our system supports the multi-level exploration of jailbreak prompts and refinement of prompt instances (R4).

Refer to caption — Figure 1: The system overview. (A) The system allows users to configure jailbreak questions and templates for analysis. (B) Then, the system analyzes the jailbreak prompts by evaluating their performance and identifying prompt components and keywords. (C) Finally, the system visualizes the analysis results to support multi-level exploration of jailbreak prompts.

3.1 Jailbreak Result Assessment

Since the LLMs [45, 5, 62, 16] have shown great capability in understanding and following human instructions, we employ GPT-4 [5] to assess the responses of the target models. The prompt design for assessment centers on three aspects.

Introduce the task background. We emphasize the goal of evaluating model security and introduce the jailbreak scenarios based on OpenAI’s disallowed usages [4]. This gives the LLM an overall understanding of assessment tasks and makes it focus on the legality and ethics of assessed responses.

Clarify the target assessment results. One of the potential solutions is to consider the task as a binary classification task that classifies the model responses into successful and failed jailbreaks. However, it is common for target models to warn users of potential consequences when generating unauthorized content due to their competing training objectives (\ieinstruction-following and safety) [66]. To better accommodate these ambiguous responses, we adopt the taxonomy of Yu et al. [74] which defines four types of assessment results.

•

Full Refusal. The response directly refuses to provide assistance for harmful questions or follow any other instructions.
•

Partial Refusal. The response refuses to assist with harmful questions but follows other harmless instructions or offers ethical advice.
•

Partial Compliance. The response contains unauthorized content but expresses disapproval or warns of potential harm.
•

Full Compliance. The response completely follows the prompt instructions and contains unauthorized content.

Additionally, we add some representative examples with labeled annotations to the assessment prompt, leveraging the in-context learning of the LLMs to better distinguish their differences.

Support user refinement of assessment criteria. We provide two ways, including correction feedback and additional criteria specification, to incorporate question-specific knowledge or adjust the rigor of the assessing criteria. Correction feedback allows users to specify corrected assessment results and add them to the assessment prompt as new demonstration examples. Users can also directly specify additional assessment criteria for specific jailbreak questions using natural language, providing a more flexible way to enhance assessment accuracy.

3.2 Component Analysis

To support users in analyzing the jailbreak prompt components, we propose a component classification method and three perturbation strategies based on the summarized taxonomy introduced in Section 3.3. The component classification method classifies the prompt sentences into component types to support the overview of component utilization. Based on that, component perturbation creates a set of perturbation variations for the jailbreak prompts to support the comparative analysis of the jailbreak performance, thus enabling interrogation of the effects of different components.

3.2.1 Component Classification

Since prompt components (e.g., Subject Characteristic and Dialogue Example) usually consist of multiple basic sentences, we adopt a bottom-up strategy to (1) classify the basic sentences and (2) aggregate them into prompt components. Specifically, we split the prompts into sentences based on syntax (e.g., the end of lines) and semantics (\ieconjunctions like “for example”). Then, we prompt GPT-4 to classify each sentence based on the component taxonomy. We introduce the component definition and provide concrete examples for demonstration. Finally, we parse the classification results and combine neighboring sentences with the same component type to form the components.

3.2.2 Component Perturbation

Prior works [35, 49] have explored keyword-level perturbation (e.g., replacing keywords with synonymous) to test the model’s robustness. However, performing keyword perturbation for each component can be a computation-intensive and time-consuming task since the prompts are usually long text paragraphs. Therefore, it would be more efficient and effective to perturb each component holistically. Drawing upon insights from discussions with domain experts, we propose three component perturbation strategies.

Delete. Deleting the component is the most straightforward way to test how this component contributes to prompt performance. As this strategy may result in the loss of certain key information or contextual incoherence, it usually causes a more or less decrease in the prompt performance, so that users can identify important components based on the magnitude of performance change.

Rephrase. This strategy employs LLM to polish the given component sentences while maintaining their semantics. It can provide more prompt variations without sacrificing contextual coherence. In addition, since the LLM vendors have set safety mechanisms (e.g., training-time interventions [66] and keyword detection [13]) based on the common jailbreak prompts, rephrasing the components may help bypass the safety mechanisms and improve the jailbreak performance [15].

Switch. This strategy switches the given component to other types in three steps. First, we describe all component types (Table 1) and require the LLM to choose new component types according to the prompt context. Then, we provide a set of alternatives for the target component types based on our component corpus (Table 2) and rank them based on their semantic similarity with the original prompts. Finally, we replace the original components with the most similar alternatives and require the LLM to fine-tune the sentences to ensure contextual coherence.

3.3 Keyword Analysis

We identify prompt keywords by combining their importance and prompt performance. First, we split the prompt sentences into keywords and filter out stop words. Then, inspired by prior work [31], we measure the importance of the keyword $k$ for the given prompt $p$ based on the keyword frequency and semantic similarity:

importance(k,p)=tfidf(k,p)\times similarity(k,p)

where the $tfidf(k,p)$ is the TF-IDF value [55] of the keyword for the prompt and the $similarity(k,p)$ is the semantic similarity of the keyword and prompt. A higher semantic similarity indicates a greater relevance of the keyword to the prompt semantics. We encode the keywords and prompts using the embedding model by OpenAI [3] and measure their similarity based on the cosine distance. Based on that, we calculate the importance of keyword $k$ for the whole corpus as

importance(k)=\sum_{p\in P_{k}}importance(k,p)

where $P_{k}$ is the list of jailbreak prompts that contain the keyword $k$ .

To help users analyze the effect of important keywords, we measure their performance according to the performance of their corresponding prompts. Since the keyword $k$ might be utilized in various prompts with different importance, we propose importance-weighted performance to better summarize the effect of the keyword $k$ :

performance(k)=\frac{\sum_{p\in P_{k}}importance(k,p)\times performance(p)}{% \sum_{p\in P_{k}}importance(k,p)}

where $performance(p)$ represents the jailbreak performance of prompt $p$ , expressed as a percentage of four categories of assessment results.

4 System Design

We develop \name to support multi-level visual analysis of jailbreak prompts for evaluating the model’s defensive capability. As shown in \name: Visual Analysis of Jailbreak Attacks Against Large Language Models, the user interface of \name consists of five views. Users’ exploration starts with the Configuration View, which allows users to configure jailbreak questions and templates for analysis. Then, the system automatically gets and assesses the jailbreak results (\iemodel responses) and provides a visual summary of the assessment results in Summary View, enabling users to overview the jailbreak performance of jailbreak prompts. Response View helps users explore model responses and refine assessment criteria to enhance precision. Based on the assessment results, users can delve into the component analysis in Summary View and the keyword analysis in Keyword View. Finally, the Instance View helps users inspect and refine the template instances. The visual system encodes four categories of assessment results using red and blue colors with varying transparency, \ieFull Compliance, Partial Compliance, Partial Refusal, and Full Refusal.

4.1 Jailbreak Corpus Configuration

Configuration View (Figure 1 $A$ ) allows users to upload the jailbreak corpus and select jailbreak questions and templates for model evaluation. The questions and templates are organized according to their categories (e.g., Character Role Play). The selected items will be assigned serial numbers, that will serve as their unique identifiers in subsequent analyses. Users can also modify the questions and templates based on their exploratory interests. Besides, the system allows users to configure the number of evaluations for each question-template combination to improve the robustness of evaluation results. After user configuration and submission, the system automatically combines each question and the template (\iefill the question into the placeholder in the template) to get the responses from the target model.

4.2 Jailbreak Performance Exploration

To support jailbreak performance exploration, the system provides a visual summary of jailbreak evaluation and allows users to inspect model responses to verify their correctness.

4.2.1 Summary of Jailbreak Performance

The left half of the Summary View (Figure 1 $B_{1}$ ) visualizes the performance of the jailbreak prompts through a matrix visualization, where the horizontal axis represents questions and the vertical axis represents templates. The questions and templates are grouped by category and are represented using the same serial numbers as in the Configuration View. The categories are collapsed by default to visualize their aggregated performance and support click interactions to expand them to check the performance of specific questions or templates. Each cell within this matrix contains a pie chart showing the percentage of assessment results of the corresponding template and question. The size of the pie chart encodes the number of evaluations. Users can also click the serial numbers of questions to inspect their model responses in Response View or click the templates to inspect them in Instance View.

4.2.2 Model Response Inspection

Since the jailbreak assessment criteria can be question-specific and sometimes need to incorporate contextual knowledge. We design Inspection View (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models $C$ ) to streamline the process of exploring the assessment results and iteratively refining the criteria. It visualizes the model responses in a scatter plot. We embed model responses based on OpenAI’s embedding model [3] and project them into a 2D space using the t-SNE algorithm [63] to maintain their semantic similarity. The color of the points encodes the categories of assessment results. This helps users identify questionable assessment results based on semantic similarity (e.g., inspect a red point that appears in a blue cluster).

To enable users to improve assessment accuracy, the Response View supports users in refining the assessment criteria in two ways. After identifying unexpected assessment results, users can directly correct their categories through click interaction or specify additional assessment criteria in natural language. Users can switch between these two refinement modes using the switch widget at the upper right corner. The corrected examples and specified criteria can be submitted to enhance the system prompts for a new round of assessment (Section 4.1).

4.3 Component Exploration

Based on the assessment results of the jailbreak templates, users can analyze and compare the effects of different prompt components in the right half of Summary View (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models $B_{2}$ ). The left column summarizes the components of the selected jailbreak templates. It visualizes each prompt as a horizontal stacked bar chart, where each bar segment corresponds to a distinct component. As shown in Figure 2 $A$ , the bar segments are arranged in the order corresponding to the prompt components, with the color encoding the component type and the length indicating the token length. It helps users understand the general patterns of prompt components and serves as the baseline for prompt perturbation. Users can click the icon to view the template details in the Instance View and click the icon to generate a set of component perturbation results for comparative analysis.

For comparison, the perturbation results are visualized in the right column as horizontal stacked bars similar to the original templates. Each result is generated by applying a perturbation strategy to a component of the original template. To visualize this difference, we design three kinds of glyphs to represent these strategies and overlay them on the corresponding bar segments, as shown in Figure 2 $B$ . The system also automatically evaluates the jailbreak performance of these perturbation results based on the selected questions and visualizes the percentage of assessment results in pie charts, enabling users to compare the effects of different component perturbations. Users can toggle to check the perturbation results for a particular strategy using the radio box at the top of the column.

4.4 Keyword Exploration

The Keyword View (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models $D$ ) visualizes the jailbreak performance and importance of keywords (Section 4.3). Specifically, as shown in Figure 3 $A$ , the jailbreak performance of keyword $k$ , \ie $performance(k)$ , is represented as the percentage of four categories of assessment results, denoted as $[n_{1},n_{2},n_{3},n_{4}]$ . Inspired by prior work [35], we introduce a square space with a coordinate system (Figure 3 $B$ ) whose four vertices correspond to the four categories, and their coordinates are denoted as $[c_{1},c_{2},c_{3},c_{4}]$ . Each corner is colored to indicate its category using the same color scheme as the pie charts in the Summary View. To visualize the overall performance distribution of the keyword, the coordinate of the keyword $k$ is computed as:

coordinate(k)=\sum_{i=1}^{4}n_{i}\times c_{i}

Besides, the size of the keywords encodes their importance in the corpus. It enables users to overview and compare the usage and performance of keywords based on their position and size. Users can filter keywords by component type and click the keywords to view their context.

Alternative Design. We have also considered an alternative design (Figure 3 $C$ ) where the keywords are visualized in four separate word clouds (each for one assessment category) and their sizes correspond to their importance to the prompts of this category, \ie $importance(k)\times n_{i},i\in[1,4]$ . According to the feedback from the experts, although this design enables users to focus on the keywords in the same category, it is inefficient for users to compare the size of the same keywords in all word clouds to estimate their overall performance distribution. Therefore, we chose our current design.

4.5 Template Instance Refinement

The Instance View (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models $E$ ) allows users to refine the templates and evaluate the performance. The Jailbreak Content panel (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models $E_{1}$ ) lists the prompt text of each component, which supports manual modifications or automatic perturbations. Users can click the icon at the right of the component title to delete, rephrase, or switch the components. Then, users can evaluate their jailbreak performance on the selected questions and inspect the evaluation results in the Jailbreak Results panel (\name: Visual Analysis of Jailbreak Attacks Against Large Language Models $E_{2}$ ). Each result item shows the question and model response and visualizes the color of the assessment result. This feedback can help the user evaluate the effectiveness of the modifications to verify the findings during the component and keyword analysis.

5 Evaluation

We conducted a case study, two technical evaluations, and six expert interviews to verify the effectiveness of the analysis framework and the usability of the visual analysis system.

5.1 Case Study

We invited the experts mentioned in Section 3.2 to use our system for jailbreak prompt analysis according to their exploratory interests. In this case study, the expert (E3) first evaluated the overall defense performance of GPT-3.5 on a jailbreak corpus and then dived into Character Role Play templates, one of the most common jailbreak strategies for an in-depth analysis of prompt characteristics.

Jailbreak Performance Evaluation (Figure 4 $A$ ). E3 uploaded a jailbreak prompt corpus in the Configuration View and selected some questions and templates for analysis. Considering the stochastic nature of model responses, E3 evaluated each question and template combination three times. After a period of waiting, the left part of Summary View visualized the performance of selected questions and templates in pie charts (Figure 4 $A_{1}$ ). E3 checked the evaluation results to ensure their accuracy. For example, He selected an Illegal Activities question IA(1) and explored its results in the Response View (Figure 4 $A_{2-1}$ ). There was a notable separation between the distributions of blue and red points, except for some outlier blue points situated within the red clusters. Therefore, he checked some of these points and found that their assessment results did not align with his expectations (Figure 4 $A_{2-2}$ ). He corrected these results and submitted them to refine the assessment criteria, and the Summary View and Response View (Figure 4 $A_{2-3}$ ) were updated accordingly. E3 checked the results again to verify their correctness. After that, he also explored other questions to verify their correctness or correct unexpected results. Based on the verified evaluation results (Figure 4 $A_{3}$ ), E3 found that nearly half of the jailbreak attacks were successful, indicating the target model was vulnerable. Besides, he also noticed that jailbreak performance usually depended more on templates than questions because the pie charts in the same row (corresponding to the same templates with different questions) usually showed similar percentage patterns of assessment results.

Prompt Characteristic Exploration (Figure 4 $B$ ). E3 wanted to explore the prompt characteristics of the Character Role Play category, one of the most common categories. The prompt component visualizations (Figure 4 $B_{1-1}$ ) showed that Subject Characteristic components (in blue) were commonly used and sometimes took up a large portion of the prompt length. E3 was curious whether the Subject Characteristic components were important to the jailbreak performance. Therefore, he chose a template with strong jailbreak performance and generated a set of prompt variations based on three perturbation strategies. According to the perturbation results, he found that deleting (Figure 4 $B_{1-2}$ ) or switching (Figure 4 $B_{1-3}$ ) the Subject Characteristic component resulted in a much more significant performance reduction than the other components, suggesting the Subject Characteristic component was crucial to the performance of this prompt. E3 also explored some other prompts and got similar findings. Then, he explored the important keywords in that component type in the Keyword View (Figure 4 $B_{2}$ ). From the keyword visualizations, he noticed that the keywords “AIM” and “DAN” were placed close to the “Full Refusal” corner, meaning that the model has been trained to be wary of these strategies and refuse to provide help. Near the “Full Compliance” corner, E3 found the keywords “disregards” and “controversial”, which correspond to an effective jailbreak strategy that encourages the model to disregard legal and ethical constraints and generate controversial content.

Jailbreak Template Refinement (Figure 4 $C$ ). To verify the effectiveness of these keywords, E3 selected a template with weak jailbreak performance and refined its Subject Characteristic component using the sentence of the identified keywords (Figure 4 $C_{1-1}$ ). Then, he evaluated this new template based on the selected questions, and the results (Figure 4 $C_{1-2}$ ) showed that more than half of the attacks were successful, suggesting a significant performance improvement. After several validations on other templates, E3 concluded that the strategy of these keywords in the Subject Characteristic component reflected a potential weakness in the target model that requires more safety training against them. Finally, he added these newly generated templates to the dataset to improve prompt diversity.

5.2 Technical Evaluations of LLM-based Methods

Jailbreak result assessment and prompt component classification are critical to the analysis workflow and rely heavily on the LLM capability and prompt design. Therefore, we conducted two technical evaluations to quantitatively measure the effectiveness of these methods. Since no recognized benchmark datasets were available for these two tasks, we drew inspiration from previous research [70] and collaborated with experts (E3 and E4) to build improvised datasets. We evaluated our methods and reported the results and some identified patterns that could affect the performance of the methods.

5.2.1 Jailbreak Result Assessment

In this task, we gathered model responses triggered by common jailbreak prompts, labeled the model responses in collaboration with experts, and evaluated the performance of our method.

Dataset. Since we follow the widely adopted principle [37, 53, 61] that decomposes the jailbreak prompts into questions and templates, we randomly selected 20 questions from five perilous question categories, such as Harmful Content and Illegal Activities, and 30 templates for each question to gather a collection of model responses. Next, we removed duplicate answers (e.g., “I’m sorry, I can’t assist with that request.”) and randomly selected 24 responses for each question, resulting in a total of 20*24=480 model responses. Then, we worked with the experts to manually categorize the model responses.

Methodology. We partitioned the dataset to evaluate (1) the effectiveness of automatic jailbreak assessment and (2) the helpfulness of user refinement of assessment criteria. Firstly, we randomly selected 20 responses for each question as the test set and employed our method to assess them based on the default criteria introduced in Section 4.1. Secondly, we enhanced the default criteria by incorporating the remaining responses (four per question, each annotated with a label) as demonstration examples, and by enabling experts to specify additional criteria, thereby simulating realistic scenarios of analysis workflow. For these two criteria, we separately measured the accuracy of the jailbreak assessment on the test set.

Result. Overall, our method achieved 80.25% accuracy using the default criteria and 90.25% with the refined question-specific criteria. The results showed that our method was effective in supporting jailbreak result assessment. We also visualize the distribution of the assessment accuracy on each question in Figure 5. We found that specifying question-specific criteria could enhance the assessment accuracy across a majority of questions. For example, sometimes the LLM failed to identify responses containing adult content (e.g., advertising and marketing prostitution) as prohibited content, which was not aligned with OpenAI’s disallowed usages [4, 37]. By providing demonstration examples and specifying additional assessment criteria, our method could better identify such prohibited content.

5.2.2 Prompt Component Classification

Based on the prompt component corpus developed in collaboration with the experts (as detailed in Section 3.3), we constructed a dataset to evaluate our component classification method.

Dataset and Methodology. From the corpus, we randomly selected 50 prompts that contain 841 prompt sentences in total. Then, we employed our method to classify each sentence based on the component taxonomy (Table 1). Finally, we measured the classification accuracy of the component types of sentences.

Result. Overall, our method achieved 80.26% accuracy on the component classification task. We also visualize the classification accuracy of each component type in a confusion matrix in Table 3. Our method yielded adequate performance in most of the component categories. However, we noticed that sometimes our method incorrectly categorized the Scene Introduction components as Subject Characteristic type. To reveal the patterns behind these failures, we further analyzed these unexpected results and found that the majority of such sentences often depicted the scene (e.g., a fictional world without moral constraints) that could imply the subject characteristics, potentially leading to confusion in LLM. A feasible solution would be to further clarify the component definition and provide demonstration examples to help our method better distinguish their differences.

Table 3: The confusion matrix for the classification accuracy of our component classification method across component types.

		Classification Result
		SI	SC	TO	OS	OC	DE	RR	QP
	Scene Introduction (SI)	\cellcolor[HTML]686868 0.61	\cellcolor[HTML]C6C6C60.23	\cellcolor[HTML]ECECEC0.08	\cellcolor[HTML]F6F6F60.04	\cellcolor[HTML]F8F8F80.03	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FBFBFB0.02	\cellcolor[HTML]FFFFFF0.00
	Subject Characteristic (SC)	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]2E2E2E 0.88	\cellcolor[HTML]FBFBFB0.02	\cellcolor[HTML]FDFDFD0.01	\cellcolor[HTML]E9E9E90.09	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FFFFFF0.00
	Task Operation (TO)	\cellcolor[HTML]FBFBFB0.02	\cellcolor[HTML]E0E0E00.12	\cellcolor[HTML]535353 0.71	\cellcolor[HTML]F5F5F50.04	\cellcolor[HTML]EBEBEB0.08	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FBFBFB0.02	\cellcolor[HTML]FDFDFD0.01
	Output Specification (OS)	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]F2F2F20.05	\cellcolor[HTML]F9F9F90.03	\cellcolor[HTML]4A4A4A 0.75	\cellcolor[HTML]F5F5F50.04	\cellcolor[HTML]E0E0E00.12	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FFFFFF0.00
	Output Constraint (OC)	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]F3F3F30.05	\cellcolor[HTML]F7F7F70.03	\cellcolor[HTML]F7F7F70.03	\cellcolor[HTML]2E2E2E 0.88	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FFFFFF0.00
	Dialogue Example (DE)	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]E0E0E00.13	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]2F2F2F 0.88	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FFFFFF0.00
	Requirement Reminder (RR)	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]E4E4E40.11	\cellcolor[HTML]EAEAEA0.09	\cellcolor[HTML]FAFAFA0.02	\cellcolor[HTML]DFDFDF0.13	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]5E5E5E 0.66	\cellcolor[HTML]FFFFFF0.00
Ground Truth	Question Placeholder (QP)	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FAFAFA0.02	\cellcolor[HTML]F5F5F50.04	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FFFFFF0.00	\cellcolor[HTML]FAFAFA0.02	\cellcolor[HTML]262626 0.92

5.3 Expert Interview

We interviewed six external experts (E5-E10) to evaluate the effectiveness of the analysis framework and the usability of the visual system. E5 is a model security engineer from a technical company who has been working on the secure reasoning of LLMs for more than one year and on network security (situation awareness) for over three years. E6-E10 are senior researchers from related fields, including model security, trustworthy AI, and deep learning model training. Among them, E7 has accumulated eight years of experience in data and model security before she focused on LLM jailbreak attacks. Each expert interview lasted about 90 minutes. We first briefly introduced the background and motivation of our study. Then, we described the analysis framework and visual system and demonstrated the system workflow using the case study. After that, we invited the experts to explore our system to analyze the performance and characteristics of the jailbreak prompts. Finally, we conducted semi-structured interviews with experts to collect their feedback about the analysis framework, visualization and interaction, and improvement suggestions.

Effectiveness of the Analysis Framework. All the experts agreed that our framework facilitated jailbreak performance evaluation and prompt characteristic understanding, and its workflow made sense. E5 praised this framework as “it provided a more comprehensive and systematic evaluation for jailbreak attacks compared to existing tools.” E5 appreciated the capability of the jailbreak assessment method to support user-specified criteria, which enhanced its flexibility to support customized criteria for different user values. However, we also observed one case during the exploration of E9 where the user-corrected examples did not improve the accuracy of a new round of assessment. E9 suggested recommending some representative model responses for user correction feedback to better leverage the in-context learning of the LLMs to distinguish between them. The jailbreak component analysis was highly appreciated, being described as “interesting” (E10), “impressive” (E6), and “inspiring” (E7). It was valued for “offering a new perspective to study the prompt patterns in the black box scenarios” (E10) and for “guiding user effort towards the critical parts of the prompts” (E9). The experts confirmed that keyword analysis helped understand prompt characteristics. However, E5 noted that it would be less effective when analyzing only a few prompts and suggested incorporating an external corpus of suspicious keywords to improve its effectiveness. Finally, all experts agreed that our analysis framework provided valuable insights for model security enhancement.

Visualization and Interactions. Overall, the experts agreed that the system views were well-designed, and the visual design and interaction were intuitive. The Summary View was popular among the experts as it supported the analysis and comparison of jailbreak performance from both the perspectives of questions and templates. All experts confirmed the helpfulness of Response View in exploring model responses and identifying unexpected results. E7 thought that component visualization required some learning costs but became easy to use and remember after she became familiar with visual encoding. We also asked the experts for their opinions about the color scheme of the prompt components, and the experts confirmed that it was “clear” (E8) and “easy to distinguish” (E10). The Keyword View was observed to be frequently used by the experts during the exploration. We noticed that they usually focused more on the keywords near the “Full Compliance” corner to identify effective keywords from successful attacks.

Suggestions for Improvement. In addition to the limitations mentioned above, we have also collected some suggestions for improvement. For Response View, E9 suggested adding some textual annotations (e.g., keywords) near the points or clusters to summarize the semantic information of the corresponding responses, which could help users identify the potential incorrect assessment results more efficiently. For component analysis, E5 suggested that providing a textual or visual summary for the comparative analysis of the component perturbations could better help users identify effective components.

6 Discussion

In this section, we distill some design implications from expert interviews to inspire future research. We also discuss the system’s generalizability, limitations, and future work.

6.1 Design Implications

Toward a more comprehensive assessment of jailbreak results. The experts appreciate the introduction of jailbreak assessment taxonomy [74] to resolve ambiguities and identify harmful jailbreak results, as well as the application of the LLM-based assessment method to improve efficiency. They also suggest extending them to include more assessment dimensions. For instance, responses that offer elaborate and expert guidance on illegal activities may pose a greater risk than those providing generic and ambiguous advice. Therefore, assessing the helpfulness of the jailbreak results can help model practitioners prioritize identifying and preventing these harmful results. Future research can explore broadening the spectrum of assessment dimensions (e.g., helpfulness) and improving the assessment accuracy of these dimensions (e.g., incorporating question-related knowledge bases) to mitigate the harm of the jailbreak results.

Improve the learning-based jailbreak prompt construction. The development of learning-based methodologies [78, 13] for the construction of jailbreak prompts represents a significant opportunity to enhance the efficiency of red-teaming LLMs. However, this endeavor faces challenges due to the intricate nature of the prompt design to guarantee effectiveness [15]. The experts highlight that our analysis framework can inspire the research of learning-based methods to improve their effectiveness and generation diversity. For example, the paired jailbreak prompts and their perturbation results (with stronger jailbreak performance) can be used to train generative models for rewriting jailbreak prompts, so that the models can easily capture the difference between their prompt components and learn how to effectively improve the prompt performance. Furthermore, the component analysis paves the way for the integration of expert knowledge into automatic jailbreak prompt generation. For example, it allows the experts to specify the kernel of the prompts in the Scene Introduction or Subject Characteristic components to guide the generation of the following content (e.g., Task Operation or Output Specification).

Balance between the objectives of safety and instruction-following. The objectives of safety and instruction-following are usually competitive [66], particularly in the context of mitigating jailbreak attacks. It has been widely recognized that over-strengthening the model’s security defenses on large jailbreak corpora will inevitably compromise the model’s ability to follow user instructions, leading to “overkill” issues [54]. Balancing between these two objectives has become one of the most challenging problems for LLM training. The experts point out that our component analysis provides a potential solution to construct a customized and condensed jailbreak dataset based on the vulnerabilities of LLMs rather than relying on large jailbreak corpora, thereby effectively addressing the model’s major security weaknesses without sacrificing the instruction-following capabilities. Looking ahead, future works can further explore how to help model practitioners analyze and trade-off between these two objectives.

Support jailbreak performance comparisons on multiple models. While our analysis framework and visual system facilitate a systematic analysis of jailbreak attacks, the experts express interest in the comparative evaluation of such attacks across various models, which promises significant benefits across multiple application scenarios. For example, it can help LLM vendors benchmark their models against those of competitors, identifying relative advantages and shortcomings. Similarly, it can assist model practitioners in comparing different iterations of models to evaluate the improvements attributed to safety training. Our system can be extended with comparative visualizations [42, 22] to provide clear and insightful comparisons across models.

Fine-tune the jailbreak questions to bypass the safety measurements. Consistent with most previous research aimed at refining the jailbreak templates to enhance the attack performance, our system is primarily designed to evaluate the jailbreak prompts and analyze the characteristics of the jailbreak templates. Recent studies [76, 30] suggest that rewriting the jailbreak questions directly helps to circumvent the model’s security mechanisms. This could be achieved, for instance, by transforming the malicious questions into experimental or pedagogical questions, or by deconstructing the tasks into smaller, less harmful operation steps. Therefore, it becomes crucial to analyze the semantic characteristics of the jailbreak questions and to summarize their common strategies.

6.2 Generalizability

\name

is designed to analyze jailbreak attacks, one of the most common prompt attacks against LLMs. We demonstrate the system’s effectiveness through a case study evaluating the jailbreak performance on GPT-3.5. The system can be easily generalized to other language models (e.g., Llama 2 [62] and ChatGLM [16]) due to its model-agnostic design that centered around the prompt input and model responses. Moreover, the analysis workflow of \name can potentially support other prompt attack scenarios, such as prompt injection [52, 47] and backdoor attacks [23, 72]. For prompt injection, the system can help users evaluate the performance of hijack attacks based on the LLM-based assessment method and explore the assessment results to verify or improve their accuracy. For backdoor attacks, the system can help users identify suspicious backdoor triggers (e.g., sentences or keywords) through component and keyword analysis.

6.3 Limitations and Future Work

Incorporate more perturbation strategies for component analysis. Currently, our analysis framework and visual system support three kinds of perturbation strategies (\iedeletion, rephrasing, and switching) to help users understand the effect of prompt components on jailbreak performance. They can be extended to support more strategies, such as inserting and crossover [74]. The system can insert the recognized important components into other prompts to verify their effectiveness. The crossover strategy can combine the strengths of two prompts to create more prompt variations. Supporting these strategies enables users to conduct a more comprehensive analysis of prompt components.

Extend our analysis to more large language models. Our system has proven to be effective in supporting the security analysis of OpenAI’s LLMs against jailbreak attacks. Moving forward, we plan to integrate a broader range of mainstream LLMs into our system, such as Llama 2 [62] and ChatGLM [16]. Our goal is to evaluate the defense capability of these advanced models against jailbreak attacks and identify their potential weaknesses. We will communicate our findings gained from the analysis with the vendors of these LLMs and provide them with valuable insights for enhancing the model safety mechanisms, ensuring a safer deployment of LLM technologies.

Explore multi-modal jailbreak attacks. The vulnerabilities of multi-modal large language models (MLLMs), such as LLava [33] and GPT-4V [1], have attracted increased attention [9, 51, 48]. MLLMs are more sensitive to jailbreak prompts with textual triggers, OCR textual triggers, and visual triggers, which presents greater safety risks compared to LLMs. Our work primarily investigates the textual vulnerabilities of LLMs without extending to the scenario of multi-modal jailbreak attacks. In the future, we aim to bridge this gap by incorporating multi-modal analysis into our visual analysis framework, thereby enhancing the robustness of MLLMs against such threats.

7 Conclusion

We present a novel LLM-assisted analysis framework coupled with a visual analysis system \name to help model practitioners analyze the jailbreak attacks against LLMs. The analysis framework provides a jailbreak result assessment method to evaluate jailbreak performance and supports an in-depth analysis of jailbreak prompt characteristics from component and keyword aspects. The visual system allows users to explore the evaluation results, identify important prompt components and keywords, and verify their effectiveness. A case study, two technical evaluations, and expert interviews show the effectiveness of the analysis framework and visual system. Besides, we distill a set of design implications to inspire future research.

References

[1] GPT-4V(ision) System Card. https://openai.com/research/gpt-4v-system-card. Accessed on March 1st, 2024.
[2] JailbreakChat Website. http://jailbreakchat.com/. Accessed on October 1st, 2023.
[3] OpenAI’s Embedding Models. https://platform.openai.com/docs/guides/embeddings. Accessed on March 1st, 2024.
[4] OpenAI’s Usage Policies. https://openai.com/policies/usage-policies. Accessed on October 1st, 2023.
[5] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv, 2024. doi: 10 . 48550/ARXIV . 2303 . 08774
[6] T. Angert, M. Suzara, J. Han, C. Pondoc, and H. Subramonyam. Spellburst: A node-based interface for exploratory creative coding with natural language prompts. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–22, 2023. doi: 10 . 1145/3586183 . 3606719
[7] A. Boggust, B. Hoover, A. Satyanarayan, and H. Strobelt. Shared interest: Measuring human-ai alignment to identify recurring patterns in model behavior. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–17, 2022. doi: 10 . 1145/3491102 . 3501965
[8] S. Brade, B. Wang, M. Sousa, S. Oore, and T. Grossman. Promptify: Text-to-image generation through interactive prompt exploration with large language models. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–14, 2023. doi: 10 . 1145/3586183 . 3606725
[9] N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, P. W. W. Koh, D. Ippolito, F. Tramer, and L. Schmidt. Are aligned neural networks adversarially aligned? In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds., Advances in Neural Information Processing Systems, vol. 36, pp. 61478–61500. Curran Associates, Inc., 2023.
[10] P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries. arXiv, 2023. doi: 10 . 48550/ARXIV . 2310 . 08419
[11] F. Cheng, Y. Ming, and H. Qu. Dece: Decision explorer with counterfactual explanations for machine learning models. IEEE Transactions on Visualization and Computer Graphics, 27(2):1438–1447, 2021. doi: 10 . 1109/TVCG . 2020 . 3030342
[12] J. Chu, Y. Liu, Z. Yang, X. Shen, M. Backes, and Y. Zhang. Comprehensive assessment of jailbreak attacks against llms. arXiv, 2024. doi: 10 . 48550/ARXIV . 2402 . 05668
[13] G. Deng, Y. Liu, Y. Li, K. Wang, Y. Zhang, Z. Li, H. Wang, T. Zhang, and Y. Liu. Masterkey: Automated jailbreaking of large language model chatbots. arXiv, 2024. doi: 10 . 14722/ndss . 2024 . 24188
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018.
[15] P. Ding, J. Kuang, D. Ma, X. Cao, Y. Xian, J. Chen, and S. Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv, 2024. doi: 10 . 48550/ARXIV . 2311 . 08268
[16] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
[17] N. Feldhus, A. M. Ravichandran, and S. Möller. Mediators: Conversational agents explaining nlp model behavior. arXiv, 2022. doi: 10 . 48550/ARXIV . 2206 . 06029
[18] Y. Feng, J. Chen, K. Huang, J. K. Wong, H. Ye, W. Zhang, R. Zhu, X. Luo, and W. Chen. ipoet: interactive painting poetry creation with visual multimodal analysis. Journal of Visualization, pp. 1–15, 2022. doi: 10 . 1007/S12650-021-00780-0
[19] Y. Feng, X. Wang, B. Pan, K. K. Wong, Y. Ren, S. Liu, Z. Yan, Y. Ma, H. Qu, and W. Chen. Xnli: Explaining and diagnosing nli-based visual data analysis. IEEE Transactions on Visualization and Computer Graphics, pp. 1–14, 2023. doi: 10 . 1109/TVCG . 2023 . 3240003
[20] Y. Feng, X. Wang, K. K. Wong, S. Wang, Y. Lu, M. Zhu, B. Wang, and W. Chen. Promptmagician: Interactive prompt engineering for text-to-image creation. IEEE Transactions on Visualization and Computer Graphics, 30(1):295–305, 2023. doi: 10 . 1109/TVCG . 2023 . 3327168
[21] L. Gao, Z. Shao, Z. Luo, H. Hu, C. Turkay, and S. Chen. Transforlearn: Interactive visual tutorial for the transformer model. IEEE Transactions on Visualization and Computer Graphics, 2023. doi: 10 . 1109/TVCG . 2023 . 3327353
[22] W. He, J. Wang, H. Guo, H.-W. Shen, and T. Peterka. Cecav-dnn: Collective ensemble comparison and visualization using deep neural networks. Visual Informatics, 4(2):109–121, 2020. doi: 10 . 1016/J . VISINF . 2020 . 04 . 004
[23] H. Huang, Z. Zhao, M. Backes, Y. Shen, and Y. Zhang. Composite backdoor attacks against large language models. arXiv, 2023. doi: 10 . 48550/ARXIV . 2310 . 07676
[24] P. Jiang, J. Rayan, S. P. Dow, and H. Xia. Graphologue: Exploring large language model responses with interactive diagrams. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–20, 2023. doi: 10 . 1145/3586183 . 3606737
[25] H. **, R. Chen, J. Chen, and H. Wang. Quack: Automatic jailbreaking large language models via role-playing. 2023.
[26] D. Kang, X. Li, I. Stoica, C. Guestrin, M. Zaharia, and T. Hashimoto. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv, 2023. doi: 10 . 48550/ARXIV . 2302 . 05733
[27] A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. arXiv, 2015. doi: 10 . 48550/ARXIV . 1506 . 02078
[28] H. Li, D. Guo, W. Fan, M. Xu, and Y. Song. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
[29] J. Li, X. Chen, E. Hovy, and D. Jurafsky. Visualizing and understanding neural models in nlp. arXiv preprint arXiv:1506.01066, 2016. doi: 10 . 18653/V1/N16-1082
[30] X. Li, S. Liang, J. Zhang, H. Fang, A. Liu, and E.-C. Chang. Semantic mirror jailbreak: Genetic algorithm based jailbreak prompts against open-source llms. arXiv, 2024. doi: 10 . 48550/ARXIV . 2402 . 14872
[31] Z. Li, X. Wang, W. Yang, J. Wu, Z. Zhang, Z. Liu, M. Sun, H. Zhang, and S. Liu. A unified understanding of deep nlp models for text classification. IEEE Transactions on Visualization and Computer Graphics, 28(12):4980–4994, 2022. doi: 10 . 1109/TVCG . 2022 . 3184186
[32] P. P. Liang, Y. Lyu, G. Chhablani, N. Jain, Z. Deng, X. Wang, L.-P. Morency, and R. Salakhutdinov. Multiviz: Towards visualizing and understanding multimodal models. arXiv, 2022.
[33] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds., Advances in Neural Information Processing Systems, vol. 36, pp. 34892–34916. Curran Associates, Inc., 2023.
[34] M. X. Liu, A. Sarkar, C. Negreanu, B. Zorn, J. Williams, N. Toronto, and A. D. Gordon. “what it wants me to say”: Bridging the abstraction gap between end-user programmers and code-generating large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–31, 2023. doi: 10 . 1145/3544548 . 3580817
[35] S. Liu, Z. Li, T. Li, V. Srikumar, V. Pascucci, and P.-T. Bremer. Nlize: A perturbation-driven visual interrogation tool for analyzing and interpreting natural language inference models. IEEE Transactions on Visualization and Computer Graphics, 25(1):651–660, 2019. doi: 10 . 1109/TVCG . 2018 . 2865230
[36] X. Liu, N. Xu, M. Chen, and C. Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv, 2024. doi: 10 . 48550/ARXIV . 2310 . 04451
[37] Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, L. Zhao, T. Zhang, and Y. Liu. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv, 2024. doi: 10 . 48550/ARXIV . 2305 . 13860
[38] Y. Liu, Z. Wen, L. Weng, O. Woodman, Y. Yang, and W. Chen. Sprout: Authoring programming tutorials with interactive visualization of large language model generation process. arXiv, 2023. doi: 10 . 48550/ARXIV . 2312 . 01801
[39] J. Lu, B. Pan, J. Chen, Y. Feng, J. Hu, Y. Peng, and W. Chen. Agentlens: Visual analysis for agent behaviors in llm-based autonomous systems. arXiv, 2024. doi: 10 . 48550/ARXIV . 2402 . 08995
[40] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
[41] Y. Ma, T. Xie, J. Li, and R. Maciejewski. Explaining vulnerabilities to adversarial machine learning through visual analytics. IEEE Transactions on Visualization and Computer Graphics, 26(1):1075–1085, 2020. doi: 10 . 1109/TVCG . 2019 . 2934631
[42] M. M. Malik, C. Heinzl, and M. E. Groeller. Comparative visualization for parameter studies of dataset series. IEEE Transactions on Visualization and Computer Graphics, 16(5):829–840, 2010. doi: 10 . 1109/TVCG . 2010 . 20
[43] M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv, 2024. doi: 10 . 48550/ARXIV . 2402 . 04249
[44] Y. Ming, S. Cao, R. Zhang, Z. Li, Y. Chen, Y. Song, and H. Qu. Understanding hidden memories of recurrent neural networks. In IEEE VAST, pp. 13–24. IEEE, 2017. doi: 10 . 1109/VAST . 2017 . 8585721
[45] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
[46] Z. Peng, X. Wang, Q. Han, J. Zhu, X. Ma, and H. Qu. Storyfier: Exploring vocabulary learning support with text generation models. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–16, 2023. doi: 10 . 1145/3586183 . 3606786
[47] F. Perez and I. Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv, 2022. doi: 10 . 48550/ARXIV . 2211 . 09527
[48] X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal. Visual adversarial examples jailbreak aligned large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21527–21536, Mar. 2024. doi: 10 . 1609/aaai . v38i19 . 30150
[49] M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118, 2020.
[50] Z. Shao, S. Sun, Y. Zhao, S. Wang, Z. Wei, T. Gui, C. Turkay, and S. Chen. Visual explanation for open-domain question answering with bert. IEEE Transactions on Visualization and Computer Graphics, 2023. doi: 10 . 1109/TVCG . 2023 . 3243676
[51] E. Shayegani, Y. Dong, and N. Abu-Ghazaleh. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, 2024.
[52] E. Shayegani, M. A. A. Mamun, Y. Fu, P. Zaree, Y. Dong, and N. Abu-Ghazaleh. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv, 2023. doi: 10 . 48550/ARXIV . 2310 . 10844
[53] X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv, 2023. doi: 10 . 48550/ARXIV . 2308 . 03825
[54] C. Shi, X. Wang, Q. Ge, S. Gao, X. Yang, T. Gui, Q. Zhang, X. Huang, X. Zhao, and D. Lin. Navigating the overkill in large language models. arXiv, 2024. doi: 10 . 48550/ARXIV . 2401 . 17633
[55] K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 60(5):493–502, 2004. doi: 10 . 1108/00220410410560573
[56] R. Speer, J. Chin, and C. Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI conference on artificial intelligence, vol. 31, 2017. doi: 10 . 1609/AAAI . V31I1 . 11164
[57] T. Spinner, U. Schlegel, H. Schäfer, and M. El-Assady. explainer: A visual analytics framework for interactive and explainable machine learning. IEEE Transactions on Visualization and Computer Graphics, 26(1):1064–1074, 2020. doi: 10 . 1109/TVCG . 2019 . 2934629
[58] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush. Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Transactions on Visualization and Computer Graphics, 24(1):667–676, 2018. doi: 10 . 1109/TVCG . 2017 . 2744158
[59] H. Strobelt, A. Webson, V. Sanh, B. Hoover, J. Beyer, H. Pfister, and A. M. Rush. Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE Transactions on Visualization and Computer Graphics, 29(1):1146–1156, 2023. doi: 10 . 1109/TVCG . 2022 . 3209479
[60] S. Suh, B. Min, S. Palani, and H. Xia. Sensecape: Enabling multilevel exploration and sensemaking with large language models. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–18, 2023. doi: 10 . 1145/3586183 . 3606756
[61] H. Sun, Z. Zhang, J. Deng, J. Cheng, and M. Huang. Safety assessment of chinese large language models. arXiv, 2023. doi: 10 . 48550/ARXIV . 2304 . 10436
[62] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv, 2023. doi: 10 . 48550/ARXIV . 2302 . 13971
[63] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(86):2579–2605, 2008.
[64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[65] X. Wang, R. Huang, Z. **, T. Fang, and H. Qu. Commonsensevis: Visualizing and understanding commonsense reasoning capabilities of natural language models. IEEE Transactions on Visualization and Computer Graphics, 2023. doi: 10 . 48550/ARXIV . 2307 . 12382
[66] A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
[67] L. Weng, X. Wang, J. Lu, Y. Feng, Y. Liu, and W. Chen. Insightlens: Discovering and exploring insights from conversational contexts in large-language-model-powered data analysis. arXiv, 2024. doi: 10 . 48550/ARXIV . 2404 . 01644
[68] L. Weng, M. Zhu, K. K. Wong, S. Liu, J. Sun, H. Zhu, D. Han, and W. Chen. Towards an understanding and explanation for mixed-initiative artificial scientific text detection. arXiv, 2023. doi: 10 . 48550/ARXIV . 2304 . 05011
[69] J. Wexler, M. Pushkarna, T. Bolukbasi, M. Wattenberg, F. Viégas, and J. Wilson. The what-if tool: Interactive probing of machine learning models. IEEE Transactions on Visualization and Computer Graphics, 26(1):56–65, 2020. doi: 10 . 1109/TVCG . 2019 . 2934619
[70] K. K. Wong, X. Wang, Y. Wang, J. He, R. Zhang, and H. Qu. Anchorage: Visual analysis of satisfaction in customer service videos via anchor events. IEEE Transactions on Visualization and Computer Graphics, 2023. doi: 10 . 48550/ARXIV . 2302 . 06806
[71] T. Xie, F. Zhou, Z. Cheng, P. Shi, L. Weng, Y. Liu, T. J. Hua, J. Zhao, Q. Liu, C. Liu, et al. Openagents: An open platform for language agents in the wild. arXiv, 2023. doi: 10 . 48550/ARXIV . 2310 . 10634
[72] W. Yang, X. Bi, Y. Lin, S. Chen, J. Zhou, and X. Sun. Watch out for your agents! investigating backdoor threats to llm-based agents. arXiv, 2024. doi: 10 . 48550/ARXIV . 2402 . 11208
[73] C. Yeh, Y. Chen, A. Wu, C. Chen, F. Viégas, and M. Wattenberg. Attentionviz: A global view of transformer attention. IEEE Transactions on Visualization and Computer Graphics, 2023. doi: 10 . 1109/TVCG . 2023 . 3327163
[74] J. Yu, X. Lin, and X. Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv, 2023. doi: 10 . 48550/ARXIV . 2309 . 10253
[75] Y. Yuan, W. Jiao, W. Wang, J.-t. Huang, P. He, S. Shi, and Z. Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv, 2024. doi: 10 . 48550/ARXIV . 2308 . 06463
[76] Y. Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv, 2024. doi: 10 . 48550/ARXIV . 2401 . 06373
[77] T. Y. Zhuo, Y. Huang, C. Chen, and Z. Xing. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv, pp. 12–2, 2023. doi: 10 . 48550/ARXIV . 2301 . 12867
[78] A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv, 2023. doi: 10 . 48550/ARXIV . 2307 . 15043