S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

Xiaohan Yuan Zhejiang UniversityHangzhouChina [email protected] **feng Li Alibaba GroupHangzhouChina [email protected] Dongxia Wang 🖂 Zhejiang UniversityHangzhouChina [email protected] Yuefeng Chen Alibaba GroupHangzhouChina [email protected] Xiaofeng Mao Alibaba GroupHangzhouChina [email protected] Longtao Huang Alibaba GroupHangzhouChina [email protected] Hui Xue Alibaba GroupHangzhouChina [email protected] Wenhai Wang Zhejiang UniversityHangzhouChina [email protected] Kui Ren Zhejiang UniversityHangzhouChina [email protected]  and  **gyi Wang Zhejiang UniversityHangzhouChina [email protected]
Abstract.

Large Language Models (LLMs) have gained considerable attention for their revolutionary capabilities. However, there is also growing concern on their safety implications as the outputs generated by LLMs may contain various kinds of harmful contents, making a comprehensive safety evaluation111We use ‘evaluation’ and ‘assessment’ interchangeably. for LLMs urgently needed before model deployment. Existing safety evaluation benchmarks still suffer from the following limitations: 1) the lack of a unified risk taxonomy makes it challenging to systematically categorize, evaluate and be aware of different types of risks, 2) the weak riskiness limits the capacity to sincerely reflect the safety of LLMs effectively, and 3) the lack of automation in terms of test prompts generation, selection, and output riskiness evaluation.

To address these critical challenges, we propose S-Eval, a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark for LLMs. At the core of S-Eval is a novel LLM-based automatic test prompt generation and selection framework, which trains an expert testing LLM tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to support various test prompt generation tasks combined with a range of test selection strategies to automatically construct a high-quality test suite (including base risk prompts and attack prompts) for the safety evaluation. The key to the automation of this process is a novel expert safety-critique LLM csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT able to quantify the riskiness score of an LLM’s response, and additionally produce risk tags and explanations for better risk awareness. Besides, the generation process is also guided by a carefully designed risk taxonomy with four different levels, covering comprehensive and multi-dimensional safety risks of concern. Based on the proposed risk taxonomy and the test prompt generation/selection framework, we systematically construct a new and large-scale safety evaluation benchmark for LLMs consisting of 220,000 evaluation prompts, including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200,000 corresponding attack prompts derived from 10 popular adversarial instruction attacks against LLMs. Moreover, considering the rapid evolution of LLMs and accompanied safety threats, S-Eval can be flexibly configured and adapted to include new risks, attacks and models for updating the benchmark. S-Eval is extensively evaluated on 20 popular and representative LLMs. The results confirm that S-Eval can better reflect and inform the safety risks of LLMs compared to existing benchmarks. We also explore the impacts of parameter scales, language environments, and decoding parameters on the evaluation, providing a systematic methodology for evaluating the safety of LLMs and offering insights on the safety situation of mainstream LLMs in the market. Our benchmark and experimental data are both released at (Anonymous, 2024) to facilitate and benchmark future research in this critical direction.

Large Language Models, Safety Assessment, Test Generation, Test Selection, Benchmark

1. Introduction

In recent years, Large Language Models (LLMs) have emerged as a prominent research focus across various domains due to their revolutionary capabilities. With the expansion of training data and model parameters, the emergent abilities of LLMs (Wei et al., 2022) are increasingly obvious, thus promoting their applications in diverse downstream tasks. More and more LLMs, such as ChatGPT (OpenAI, 2022), Claude (Anthropic, 2023), Gemini (Team et al., 2023), ErnieBot (Inc, 2023), LLaMA (Touvron et al., 2023a) and Qwen (Bai et al., 2023) and so on, are widely adopted in finance (Son et al., 2023), medicine (Tang et al., 2023), education (Kamalov et al., 2023) and law (Blair-Stanek et al., 2023) and many other domains.

However, there are also significant concerns regarding the safety implications of LLMs with their rapid adoption in different applications, since LLMs are often trained on massive textual data, which lack appropriate supervision and may contain harmful contents, such as illegal advice (Durkin, 1997), offensiveness, hate speech, insults (Gehman et al., 2020), bias and discrimination (Sheng et al., 2021). These rooted safety issues in the training data make it inevitable for the resulting LLMs to generate contents inconsistent with human values and pose potential risks of misuse. Therefore, a comprehensive multidimensional safety evaluation for LLMs is imperative before their deployment.

Currently, researchers have attempted to design some safety evaluation benchmarks covering either specific safety concerns (Gehman et al., 2020; Parrish et al., 2021; Hendrycks et al., 2021) or multiple risk dimensions (Liang et al., 2022; Wang et al., 2024; Ganguli et al., 2022; Zou et al., 2023; Sun et al., 2023; Zhang et al., 2023b; Wang et al., 2023; Xu et al., 2023; Huang et al., 2023b; Li et al., 2024). However, existing benchmarks still suffer from several significant limitations. First, the risk taxonomies of existing safety benchmarks are loose without a unified risk taxonomy paradigm. Consequently, the coarse-grained evaluation results can only reflect a (small) portion of the safety risks of LLMs, failing to comprehensively evaluate fine-grained safety situation of LLMs on the subdivided risk dimensions. Second, existing benchmarks have weak riskiness which limits its capability to sincerely reflect the safety of LLMs effectively (evidenced by our empirical results). For instance, some benchmarks (Hendrycks et al., 2021; Zhang et al., 2023b; Parrish et al., 2021) are only evaluated with multiple-choice questions (due to the lack of a test oracle), which is inconsistent with the real-world user case and limits the risks that may arise in responses, thus cannot reflect an LLM’s real safety levels. Other benchmarks like (Huang et al., 2023b; Sun et al., 2023; Li et al., 2024) only consider some backward and incomplete instruction attack methods, failing to picture the safety of LLMs under more various adversarial attack scenarios. Third, construction of existing benchmarks often lacks automation in terms of test prompts generation, selection and output riskiness evaluation requiring numerous human labor, which hinders its effective adaptability to quickly evolving LLM and accompanied safety threats.

Refer to caption

Figure 1. Our four-level risk taxonomy. We only display the first-level risk dimensions and second-level risk categories.

In this work, we propose S-Eval, a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark to systematically address the above limitations. Firstly, we design a comprehensive and unified risk taxonomy with four hierarchical levels consisting of eight risk dimensions, 25 risk categories, 56 risk subcategories, and 52 risk sub-subcategories, as shown in Figure 1 (the complete details can be found at (Anonymous, 2024)). The risk taxonomy aims to cover all the necessary dimensions of the safety evaluation and reflect the varying safety levels of the LLMs on the subdivided risk dimensions. Secondly, to automatically construct a test suite for safety evaluation, we propose a novel LLM-based automatic test prompts generation and selection framework, as shown in Figure 2. Specifically, we train an expert testing LLM tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that supports various test prompts generation tasks with configurable risks of interest and test generation methods, combined with a range of test selection strategies for quality control to construct a high-quality benchmarking safety test suite. Note that the above test generation and selection are empowered by a novel expert safety-critique LLM csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. This safety-critique model is trained by supervised fine-tuning using a carefully crafted dataset. In addition to serve as a test oracle by providing a risk score for an LLM’s response to a test prompt, the model is also designed to output the risk tags, scores, and explanations for better risk awareness to the decision makers. So far, S-Eval has (and is still in active expansion) automatically generated 220,000 high-quality test prompts for safety evaluation, including 20,000 base risk prompts222The base risk prompts are risky ones intended to trigger harmful output of the LLMs. (10,000 in Chinese and 10,000 in English) and 200,000 corresponding attack prompts. Thirdly, considering the rapid evolution of LLMs and accompanied safety threats, we design S-Eval to be flexibly configured and adapted to include new risks, attacks and models for updating the benchmark. We extensively evaluate S-Eval with 20 popular and mainstream LLMs that cover both open-source and closed-source LLMs with different organizations, model scales and languages. The results confirm that S-Eval can better reflect and inform the safety risks awareness of LLMs compared to existing safety benchmarks. We also explore the impacts of parameter scales, language environments, and decoding parameters on the evaluation, resulting in a systematic methodology for evaluating the safety of LLMs.

Refer to caption

Figure 2. Our proposed LLM-based automatic test generation and selection framework. ”BRP” stands for Base Risk Prompt and ”AP” stands for Attack Prompt.

In summary, we make the following contributions:

  • We design a new unified risk taxonomy, consisting of four hierarchical levels, covering broad risk characterization and being able to reflect the safety levels of LLMs on subdivided risk dimensions. When new safety risks emerge, the risk dimensions can be expanded, updated and easily configured to guide the generation of more test prompts.

  • We propose a novel LLM-based automatic test generation and selection framework to generate base risk prompts and attack prompts. Considering the rapid evolution of safety threats and LLMs, we design the framework to be flexibly configured and adapted to include new risks, attacks and models for continuously updating the benchmark.

  • We propose to train an expert safety-critique LLM to not only serve as a test oracle by quantifying the riskiness score of an LLM’s response to a test prompt, but also designed to output the risk tags and explanations for better risk awareness to decision makers.

  • We release a comprehensive, multi-dimensional, and open-ended safety evaluation benchmark, consisting of 220,000 prompts, including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200,000 corresponding attack prompts generated by 10 advanced instruction attacks to comprehensively evaluate the safety levels of LLMs in both conventional and adversarial attack scenarios.

  • We extensively evaluate 20 popular LLMs. The results confirm that S-Eval can better reflect the safety level of LLMs compared to existing safety benchmarks. We also explore the impacts of parameter scales, language environments, and decoding parameters on the evaluation, providing a systematic methodology for evaluating the safety of LLMs.

2. Preliminary

2.1. Large Language Models

Large Language Models (LLMs) are advanced deep learning models. Currently, most LLMs are built based upon the Transformer architecture (Vaswani et al., 2017), and they are trained on massive textual corpora with a large number of parameters to effectively understand and generate natural language text. LLMs are widely used in various natural language processing tasks such as text translation (Yang et al., 2023a), sentiment analysis (Zhang et al., 2023a), and text summarization (Van Veen et al., 2024).

A common method of interacting with LLMs is prompt engineering (Liu et al., 2023c; White et al., 2023), in which users guide LLMs to generate desired responses or complete specific tasks through well-designed prompt text. Prompts are critical to the quality of the output of LLMs, and small changes to the prompt result in large performance variations (Liu et al., 2023b; Shin et al., 2020).

2.2. Safety Evaluation Problem

In safety evaluation task, given an LLM \mathcal{M}caligraphic_M, we need an evaluation prompt set 𝐏={p1,p2,,pn}𝐏subscript𝑝1subscript𝑝2subscript𝑝𝑛\mathbf{P}=\{p_{1},p_{2},\cdots,p_{n}\}bold_P = { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, and an safety evaluation model 𝒥(){0,1}𝒥01\mathcal{J}(\cdot)\in\{0,1\}caligraphic_J ( ⋅ ) ∈ { 0 , 1 } to judge whether a harmful response is triggered. Let risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the response of \mathcal{M}caligraphic_M to the prompt pi𝐏subscript𝑝𝑖𝐏p_{i}\in\mathbf{P}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_P, which is considered harmful when 𝒥(pi,ri)=0𝒥subscript𝑝𝑖subscript𝑟𝑖0\mathcal{J}(p_{i},r_{i})=0caligraphic_J ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0 and safe otherwise. In this work, we aim to construct an evaluation prompt set 𝐏𝐏\mathbf{P}bold_P based on the designed risk taxonomy 𝐂𝐂\mathbf{C}bold_C that can effectively reflect the safety of LLMs on the subdivided risk dimensions. Specifically, our evaluation prompt set consists of two parts: 𝐏={𝐏B,𝐏A}𝐏superscript𝐏𝐵superscript𝐏𝐴\mathbf{P}=\{\mathbf{P}^{B},\mathbf{P}^{A}\}bold_P = { bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT , bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT }, where 𝐏B={p1B,p2B,,pmB}superscript𝐏𝐵subscriptsuperscript𝑝𝐵1subscriptsuperscript𝑝𝐵2subscriptsuperscript𝑝𝐵𝑚\mathbf{P}^{B}=\{p^{B}_{1},p^{B}_{2},\cdots,p^{B}_{m}\}bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } denotes the base risk prompt set and 𝐏A={p1A,p2A,,pnA}superscript𝐏𝐴subscriptsuperscript𝑝𝐴1subscriptsuperscript𝑝𝐴2subscriptsuperscript𝑝𝐴𝑛\mathbf{P}^{A}=\{p^{A}_{1},p^{A}_{2},\cdots,p^{A}_{n}\}bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = { italic_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } denotes the corresponding attack prompt set, which are meant to capture the potential risks of LLMs in the vast diverse input space and adversarial scenarios respectively.

3. The S-Eval Framework

In this section, we first provide an overview of the proposed S-Eval framework, and then present the detailed risk taxonomy and test prompt generation/selection methods.

3.1. Overview

Figure 2 shows the overview of the S-Eval framework. At a high level, given a risk taxonomy (details later), in the training flow (dashed line), we first collect a small number of hand-crafted risk prompts based on the risk definitions and generate corresponding attack prompts using multiple instruction attacks. Then, we train an expert test-generation LLM tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with these prompts. In the generation flow (solid line), we first use tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to automatically generate a set of base risk prompts, and then remove similar and benign prompts to select a high-quality base risk prompt set PBsuperscript𝑃𝐵P^{B}italic_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. Subsequently, we apply tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to generate corresponding attack prompts for each prompt in PBsuperscript𝑃𝐵P^{B}italic_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. Finally, we identify those meaningless attack prompts and regenerate to obtain the attack prompt set PAsuperscript𝑃𝐴P^{A}italic_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT. Next, we present details on the proposed risk taxonomy 𝐂𝐂\mathbf{C}bold_C, the test prompt generation and selection methods and the evaluation model 𝒥𝒥\mathcal{J}caligraphic_J to judge whether a harmful response is triggered.

3.2. Risk Taxonomy

The motivation of S-Eval is to help researchers and developers comprehensively evaluate the safety of LLMs so that they can be intuitively informed of the concrete safety threats in a certain aspect. With this goal in mind, we first systematically analyze the existing taxonomies (OpenAI, 2024; Liang et al., 2022; Wang et al., 2024; Ganguli et al., 2022). Then, we design a new unified multidimensional fine-grained risk taxonomy by summarizing risks that may arise during the use of LLMs and decoupling different risks from each other by considering the intent and context of them, which is intended to reflect the safety levels of LLMs on the subdivided risk dimensions. Our taxonomy also considers risks that are not covered in other taxonomies, like the threats caused by technological autonomy. Specifically, the risk taxonomy has a structured hierarchy with four levels, comprising 8 risk dimensions, 25 risk categories, 56 risk subcategories, and 52 risk sub-subcategories. The first-level risk dimensions and second-level risk categories are shown in Figure 1. The risk dimensions at the top level are as follows (all details available at (Anonymous, 2024) ).

Crimes and Illegal Activities (CI). This dimension refers to behaviors that violate laws and regulations, including criminal acts as well as non-criminal violations of administrative laws and regulations. This includes, but is not limited to, viewing pornographic content, rape, drug production and trafficking, production of dangerous weapons, theft, robbery, tax evasion, embezzlement, bribery, and infringement of intellectual property rights.

Hate Speech (HS). It refers to the publication of insulting, sarcastic, cursing, profane, threatening, or other disparaging speech or written content directed at a specific individual or group of individuals for various reasons. It may provoke aversion, fear, or hatred of others, potentially culminating in direct or indirect harm, ostracism, or oppression.

Physical and Mental Health (PM). This category of risk encompasses behaviors that could potentially jeopardize human physical and mental well-being, comprising two distinct subcategories: (a) Physical Harm pertains to injuries or damage inflicted on the human body due to a variety of factors. (b) Mental Health addresses adverse effects on mood, cognitive functioning, and overall life and work quality, stemming from psychological factors, including negative emotional states and mental disorders.

Ethics and Morality (EM). Beyond obvious violations of laws and regulations, many human behaviors do not conform to ethical principles and moral norms. Social ethics typically concern human relationships, attitudes, behaviors towards others, and responsibilities towards others and society, including issues like bias and discrimination. Additionally, we examine science ethics, which focus on the ethical and moral issues involved in the development and application of science and technology. This includes the improper use of science and technology and the potential conflicts between technological autonomy and human values.

Data Privacy (DP). LLM training data often contains private information (Tramèr et al., 2022). A significant amount of previous works have shown that once such information is embedded in LLMs, it is susceptible to malicious prompts to extract it, thus posing a significant privacy risk (Khowaja et al., 2024; Li et al., 2023a; Lukas et al., 2023). This dimension therefore focuses on attempts to extract private data from LLMs for personal private data such as contact information, financial information, and communication records as well as commercial secret such as market strategies, customer information, and supply chain information.

Cybersecurity (CS). It indicates attempts to compromise the Confidentiality, Integrity, and Availability of a network system, including overstep** access controls, designing malicious code such as viruses, worms, and Trojan horses, and threatening the physical security of a network system.

Extremism (EX). Compared to crimes and illegal activities, this dimension poses a graver peril. This dimension usually manifests as the extreme pursuit and persistence of a certain ideology, religion, politics, or social perspective, threatening social order and stability. This includes violent terrorist activities, social divisions, and extremist ideological trends.

Inappropriate Suggestions (IS). It denotes the potential hazard when responses to queries in critical domains like finance, medicine, and law turn out to be biased, inaccurate, or reckless. This risk stems from the inherently finite and dated knowledge of LLMs, compounded by occasional LLM-generated hallucination (Bang et al., 2023), which can lead to user detriment.

3.3. LLM-Based Automatic Test Generation

Safety benchmarks need to be able to objectively and continuously evaluate the safety of LLMs. However, there are several challenges to achieve this goal: 1) Some safety benchmarks heavily rely on manual collection and annotation, incurring significant time and labor costs. This limits the scale and expansion potential of the benchmarks, not to mention controlling and tracing benchmark data quality; 2) The safety threat environment continues to evolve, with new safety risks and innovations in attack methods constantly emerging. Benchmarks must quickly identify and integrate these new threats to ensure that evaluation prompts can be expanded and updated in a timely manner. 3) With the rapid iteration and performance improvement of LLMs, the original static benchmarks gradually lose the ability to effectively evaluate the safety level of the latest models. Benchmarks need continuous adaptive updates to align with the iterative updates of LLMs.

In this work, we propose an LLM-based automatic test generation approach to address the above challenges. Notably, since LLMs are often trained with safety alignment methods, they are prone to rejecting the generation of harmful prompts. Inspired by the idea of ‘unalignment’ (Bhardwaj and Poria, 2023a), we construct our expert test-generation LLM tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT333We also call tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT a risk LLM in this paper as it is used for risk prompt generation. by fine-tuning Qwen-14B-Chat (Bai et al., 2023) on harmful Question-Answer (QA) pairs via LoRA (Hu et al., 2021) to break the safety alignment. tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can support multiple prompt generation tasks and has shown great generalization ability. Then, we combine it with well-designed prompts generation strategies to generate base risk prompts PBsuperscript𝑃𝐵P^{B}italic_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT and attack prompts PAsuperscript𝑃𝐴P^{A}italic_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT referring to the risk taxonomy, achieving automatic test prompts generation and adaptive update.

3.3.1. Base Risk Prompt Generation

We design a template to format generation instructions, risk information, and base risk prompts for different risks to enhance the training process and generation performance. As shown in Figure 3, the template contains the following main elements: Instruction, which describes a prompt generation task for a specific risk that the model needs to perform; Input, which provides the risk information used to support the model in completing the generation task, such as risk definition and risk knowledge; and Output, which is a corresponding base risk prompt.

Refer to caption

Figure 3. Example for Base Risk Prompt Generation. <Risk> stands for a specific risk to be generated prompts.

In the training phase, we first invite experts to write a small number of high-quality risk prompts for each risk definition. Then, we take the generation instructions and risk definitions as training inputs and the corresponding risk prompts as output to train our risk LLM tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to generate prompts based on risk definitions. In the generation phase, in addition to providing the instructions that includes the specific risks, we also guide tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to automatically generate base risk prompts through the following generation strategies:

Definition-Based Prompt Generation

To make the prompts generated by tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT conform to the definition of the corresponding risk, we take the detailed risk definition as Input to guide tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to generate prompts. It is worth noting that because tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has the ability to generate prompts based on risk definitions, it can adaptively complete the generation tasks by simply providing the risk definitions when new safety risks appear. In the event of suboptimal prompt quality, we can also build few-shot demonstrations by adding prompt examples into Input to generate higher-quality prompts through in-context learning.

Knowledge-Based Prompt Generation

To enhance the factuality and diversity of generated prompts, we introduce external knowledge in the generation phase. We crawl a large amount of risk-related data from different web platforms with deep integration and structure. Based on this, we construct a fine-grained risk knowledge base covering various risks, including keywords, knowledge graphs, and knowledge documents. Then, we use them as Input to support tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in completing generation tasks. By updating the risk knowledge base as new risks emerge or existing risks change, the benchmark can be up-to-date with the latest risks.

Rewriting-Based Prompt Generation

The originally generated risk prompts inevitably contain a portion of risk-free prompts that cannot lead LLMs to generate harmful responses, failing to reveal the safety vulnerabilities of LLMs. Concurrently, with the rapid iteration of LLMs, their safety is constantly changing. Original effective prompts may been rejected by LLM defense mechanisms, thus losing the significance of evaluation.

To improve the effective utilization of the base risk prompts and enable the benchmark to be continuously and dynamically updated as LLMs evolve, we introduce a rewriting strategy. We describe the rewriting task in detail in Instruction and take prompt seeds as Input to rewrite using tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Specifically, we instruct tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to first identify and analyze the key risk elements in the prompt seeds, such as words and phrases involving violence, hatred, threats, etc. tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is then instructed to use synonymous substitution and semantic fuzziness to weaken the risky semantics in the prompt seeds to be more implicit and indirect. Meanwhile, we also instruct tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to introduce some reasonable contextual information to the prompt seeds, to cover up the malicious intent in them. For example, a prompt involving drug production could be embedded in the context of a chemistry academic discussion. Through this rewriting strategy, we can get new prompts that are risky but more covert in their presentation.

3.3.2. Attack Prompt Generation

The widespread use of LLMs makes them targets for adversarial attacks. Evaluating their safety against instruction attacks can verify whether LLMs can effectively identify and defend against potential attacks that aim to manipulate the input prompts to bypass safety mechanisms and generate harmful responses. However, replicating each attack in turn to generate multiple attack prompts is cumbersome and lacks scalability, as each attack is implemented differently, and some of them rely on manual construction, which is time-consuming and laborious.

Therefore, we train tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to uniformly and automatically generate multiple attack prompts. Similar to the template for base risk prompt generation, we redefine the elements: Instruction, which describes an attack prompt generation task for a specific attack method and configures the attack method if necessary; Input, which is a base risk prompt; and Output, which is the corresponding attack prompt through the attack method.

Refer to caption

Figure 4. Example for Attack Prompt Generation. <Attack> stands for a specific attack method. [Config] stands for implementation details of the attack method and is optional.

In the training phase, we first get the corresponding attack prompts by enhancing the collected base risk prompts with multiple instruction attacks. Then, we train tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the instructions of the attack methods and base risk prompts as training inputs and the corresponding attack prompts as output so that it can also generate multiple attack prompts. In the generation phase, we input the instructions of the attack methods and the base risk prompts into the risk LLM to generate corresponding attack prompts. To comprehensively reflect the safety of LLMs under various instruction attacks, we summarize a wide range of attack methods, covering two failure modes of safety alignment: competing objectives and mismatched generalization (Wei et al., 2024). Specifically, tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT integrates the following attack methods:

Positive Induction (PI). It represents asking the model to respond in a positive affirmative way to the inputs, such as asking the model to start answering a question with ”Sure, here it is”.

Reverse Induction (RI). Since LLMs can recognize some malicious intents, attackers intentionally ask questions in reverse. They are ostensibly in good faith, trying to avoid some insecure content, but with the opposite, malicious intent, trying to make the models do something them ”should not do”.

Code Injection (CI). In this attack method, instead of feeding the base prompt directly into the LLMs, attackers break the original malicious payload into multiple smaller payloads, each of which does not trigger the defense mechanisms of the LLMs, and embed them into code to force the LLMs to produce harmful outputs (Kang et al., 2023). To effectively bypass the defense mechanisms, unlike describing the code execution process, we insert the split malicious payloads into the full code program and require LLMs to response the output of that program as a new instruction.

Instruction Jailbreak (IJ). In this work, we define Instruction Jailbreak as using jailbreak templates (Yu et al., 2023) to jailbreak LLMs. To generate more effective and diverse jailbreak, we take the top 30 attacks in terms of ”Votes” from the jailbreak chat website 444https://www.jailbreakchat.com, and randomly combine with the collected base risk prompts to get the attack prompts for supervised fine-tuning.

Goal Hijacking (GH). It involves attaching deceptive or misleading instructions to inputs in attempts to induce LLMs to ignore the original user prompts and produce unsafe responses.

Instruction Encryption (IE). This type of attack refers to encrypting the original prompts and then providing them to LLMs in combination with other prompt templates, which can bypass the safety alignment to produce unsafe responses. In this work, we can generate attack prompts in various ciphers, such as Caesar Cipher, Base64, and URL.

DeepInception (DI). Inspired by the Milgram experiment (Milgram, 1963), DeepInception (Li et al., 2023b) leverages the personification ability of LLMs to construct a nested multi-layer scenario, where different characters are created in each layer to confuse LLMs to bypass their safety defenses. DeepInception has a high jailbreak success rate and can sustain jailbreaks in subsequent interactions.

In-Context Attack (ICA). This attack method exploits the significant in-context learning ability of LLMs to attack aligned language models by adding adversarial harmful input-output pairs to the input prompt, inducing the models to perform malicious behaviors (Wei et al., 2023).

Chain of Utterances (CoU). This type of attack establish a conversation between a harmful agent, Red-LM, and an unsafe-helpful agent, Base-LM by Chain of Utterances (CoU)-based jailbreak prompts (Bhardwaj and Poria, 2023b). Harmful questions are then placed as an utterance for Red-LM to send requests to Base-LM, which in turn is asked to respond according to demonstrations and instructions in the CoU. The generated internal thoughts in the responses of Base-LM push the answers in a more helpful direction.

Compositional Instructions (CIA). In addition to single-intent attack instructions, we can construct compositional attack instructions by combining and encapsulating multiple instructions to hide harmful instructions in innocuous-intent instructions, such as disguising harmful instructions as talk or writing tasks (Jiang et al., 2023a).

3.4. Test Selection by Quality Control

To ensure the quality of test prompts, we review the collected base risk prompts and attack prompts.

For base risk prompts, there are two main challenges: similar prompts and benign prompts that lack significant riskiness. We define a similarity measure S𝑆Sitalic_S, combining semantic similarity Ssem(pi,pj)=E(pi)E(pj)E(pi)E(pj)subscript𝑆𝑠𝑒𝑚subscript𝑝𝑖subscript𝑝𝑗𝐸subscript𝑝𝑖𝐸subscript𝑝𝑗norm𝐸subscript𝑝𝑖norm𝐸subscript𝑝𝑗S_{sem}(p_{i},p_{j})=\frac{E(p_{i})\cdot E(p_{j})}{\|E(p_{i})\|\|E(p_{j})\|}italic_S start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG italic_E ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_E ( italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ italic_E ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ ∥ italic_E ( italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ end_ARG, where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are two prompts and E()𝐸E(\cdot)italic_E ( ⋅ ) is an embedding model and Levenshtein distance Slevsubscript𝑆𝑙𝑒𝑣S_{lev}italic_S start_POSTSUBSCRIPT italic_l italic_e italic_v end_POSTSUBSCRIPT to identify and eliminate similar base risk prompts:

(1) S(pi,pj)=αSsem(pi,pj)+(1α)Slev(pi,pj)𝑆subscript𝑝𝑖subscript𝑝𝑗𝛼subscript𝑆𝑠𝑒𝑚subscript𝑝𝑖subscript𝑝𝑗1𝛼subscript𝑆𝑙𝑒𝑣subscript𝑝𝑖subscript𝑝𝑗S(p_{i},p_{j})=\alpha\cdot S_{sem}(p_{i},p_{j})+(1-\alpha)\cdot S_{lev}(p_{i},% p_{j})italic_S ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_α ⋅ italic_S start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + ( 1 - italic_α ) ⋅ italic_S start_POSTSUBSCRIPT italic_l italic_e italic_v end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

where α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] is a weighting factor to balance superficial and feature similarity. The two prompts are similar when S(pi,pj)𝑆subscript𝑝𝑖subscript𝑝𝑗S(p_{i},p_{j})italic_S ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) exceeds a predefined threshold θsimsubscript𝜃𝑠𝑖𝑚\theta_{sim}italic_θ start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT and we retain the longer prompt.

To eliminate benign prompts, we use multiple victim LLMs v={v1,v2,,vl}subscript𝑣subscriptsubscript𝑣1subscriptsubscript𝑣2subscriptsubscript𝑣𝑙\mathcal{M}_{v}=\{{\mathcal{M}_{v}}_{1},{\mathcal{M}_{v}}_{2},\cdots,{\mathcal% {M}_{v}}_{l}\}caligraphic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = { caligraphic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , caligraphic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } and the evaluation model 𝒥𝒥\mathcal{J}caligraphic_J to evaluate the riskiness of each base risk prompt piBsubscriptsuperscript𝑝𝐵𝑖p^{B}_{i}italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We get responses Ri={ri1,ri2,,ril}subscript𝑅𝑖subscriptsubscript𝑟𝑖1subscriptsubscript𝑟𝑖2subscriptsubscript𝑟𝑖𝑙R_{i}=\{{r_{i}}_{1},{r_{i}}_{2},\cdots,{r_{i}}_{l}\}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } to piBsubscriptsuperscript𝑝𝐵𝑖p^{B}_{i}italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from vsubscript𝑣\mathcal{M}_{v}caligraphic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Then, we input piBsubscriptsuperscript𝑝𝐵𝑖p^{B}_{i}italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into 𝒥𝒥\mathcal{J}caligraphic_J to get safety confidences Sci={sci1,sci2,,scil}subscriptsubscript𝑆𝑐𝑖subscriptsubscriptsubscript𝑠𝑐𝑖1subscriptsubscriptsubscript𝑠𝑐𝑖2subscriptsubscriptsubscript𝑠𝑐𝑖𝑙{S_{c}}_{i}=\{{{s_{c}}_{i}}_{1},{{s_{c}}_{i}}_{2},\cdots,{{s_{c}}_{i}}_{l}\}italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } and retain piBsubscriptsuperscript𝑝𝐵𝑖p^{B}_{i}italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if the average of Scisubscriptsubscript𝑆𝑐𝑖{S_{c}}_{i}italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Sci¯=1lj=1lscij¯subscriptsubscript𝑆𝑐𝑖1𝑙superscriptsubscript𝑗1𝑙subscriptsubscriptsubscript𝑠𝑐𝑖𝑗\bar{{S_{c}}_{i}}=\frac{1}{l}\sum_{j=1}^{l}{{s_{c}}_{i}}_{j}over¯ start_ARG italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is less than a predefined threshold θsafesubscript𝜃𝑠𝑎𝑓𝑒\theta_{safe}italic_θ start_POSTSUBSCRIPT italic_s italic_a italic_f italic_e end_POSTSUBSCRIPT. When the safety of LLMs improves, the benchmark can be updated by dynamically adjusting θsafesubscript𝜃𝑠𝑎𝑓𝑒\theta_{safe}italic_θ start_POSTSUBSCRIPT italic_s italic_a italic_f italic_e end_POSTSUBSCRIPT or replacing higher safe vsubscript𝑣\mathcal{M}_{v}caligraphic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT.

For attack prompts, since we automatically generate them through the specially trained risk LLM, there may be duplicate decoding during the generation process, resulting in meaningless prompts. Considering the powerful ability of LLMs, we use LLMs to identify meaningless attack prompts and manually check those that are difficult for LLMs to determine. We regenerate these prompts later until there are no duplicate decodings.

We show the statistics of prompts in S-Eval in Table 1.

Table 1. The statistics in one language of S-Eval. The statistics are the same in Chinese and English.
Risk Dimension Risk Category # Normal # Attack
Crimes and Illegal Activities (CI) Pornography Prohibition 533 5330
Drug Crime 432 4320
Dangerous Weapons 487 487
Property Infringement 400 4000
Economic Crime 496 4960
Cybersecurity (CS) Access Control 228 2280
Hacker Attack 209 2090
Malicious Code 313 3130
Physical Security 252 2520
Data Privacy (DP) Personal Privacy 668 6680
Commercial Secret 674 6740
Ethics and Morality (EM) Social Ethics 493 4930
Science Ethics 507 5070
Physical and Mental Health (PM) Physical Harm 519 5190
Mental Health 483 4830
Hate Speech (HS) Abusive Curses 296 2960
Cyber Violence 303 3030
Defamation 292 2920
Threaten and Intimidate 302 3020
Extremism (EX) Violent Terrorist Activities 207 2070
Social Disruption 366 3660
Extremist Ideological Trends 524 5240
Inappropriate Suggestions (IS) Finance 341 3410
Medicine 338 3380
Law 337 3370
Total - 10000 100000

4. Safety Critique Model

The open-ended property of LLM generation as well as the sparsity and diversity of potential risks inherent in different models, makes it extremely challenging to automatically and accurately assess whether generated content complies with safety policies. Most of the existing works on LLM safety evaluation typically rely on one or more of the following schemes: manual annotation, rule matching, moderation APIs and prompt-based evaluation.

Manual annotation (Liu et al., 2023a) is highly accurate but time-consuming and laborious, thus lacking scalability and practicality for large-scale evaluation in reality. Rule matching method (Zou et al., 2023) assesses the safety of LLM through the matching of manually summarized rules or patterns deemed safe with the generated responses. More concretely, a response can be regarded as safe if it incorporates certain terms indicative of rejection, such as ”I’m sorry, I can’t”, ”As a responsible AI”, ”It’s not appropriate”, etc. However, it is usually not trivial to generalize the summarized rules to different LLMs due to variations in their response styles and scripts. Worse yet, there are often cases where the model initially indicates a refusal to answer, but harmful content still follows closely behind. Hence, even though this method is easy to conduct, it significantly underperforms manual annotation in terms of accuracy by an obvious margin. Some studies adopt commercial moderation APIs to detect toxic contents in the responses from LLMs as safety evaluation (OpenAI, 2024; Gehman et al., 2020). However, the effectiveness of such evaluations is also fairly limited and can sometimes exhibit bias. This is primarily because there is currently no comprehensive API capable of covering the entire spectrum of risk categories associated with LLMs. Besides, evaluating LLM safety is quite different from merely detecting toxic content. Thanks to the powerful emergent abilities of LLMs, prompt-based evaluation methods (Deng et al., 2023; Wang et al., 2023) have been recently applied via prompt engineering, i.e., input specific evaluation guidelines or safety policies along with the dialogue to be evaluated into high-performing LLMs such as GPT-4. Nevertheless, most existing LLMs are not specifically built for the purpose of safety evaluation. As a result, they may not be well-aligned with human values in some aspects, which can lead to undeserved evaluation results that are inconsistent with human judgment. In addition, the LLM in use sometimes refuses to respond to assessment instructions due to the sensitivity of input dialogues and the issue of over-alignment (i.e., exaggerated safety) (Sun et al., 2024). The under- and over-alignment issues mentioned above severely restrict the applicability of such methods.

To deal with the shortcomings of existing works and to evaluation the safety of LLMs more accurately and effectively, we introduce a novel LLM-based safety critique framework, taking inspiration from (Ke et al., 2023). Our safety critique LLM csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is developed using a carefully curated dataset for supervised fine-tuning. It can provide effective and explainable safety evaluations for LLMs, including risk tags, scores, and explanations, as shown in Figure 5 . It also boasts attractive scaling properties for both model and data. During dataset construction, to acquire the generated responses with different levels of safety and qualities, we choose 10 representative models that cover both open-source and closed-source LLMs with different model scales, including GPT-4, ErnieBot, Qwen, LLaMA, Baichuan and ChatGLM, etc. To obtain high-quality annotated critiques, complete with risk tags (i.e., safe or unsafe) and explanations (i.e., the reasons for tagging), we utilize GPT-4 for automatic annotation and explanation. These automated results are then reviewed and corrected by our specialists in cases of inaccuracies. Through response generation and annotation, we have created a fine-grained dataset consisting of 100,000 QA pairs derived from 10,000 risk queries. This dataset is bilingual, including both Chinese and English, and encompasses over 100 types of risks. Finally, we develop csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT by fine-tuning Qwen-14b-Chat on this dataset via LoRA.

Refer to caption

Figure 5. Example for safety critique to unsafe.

To validate the effectiveness of csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we construct a test set by collecting 1,000 Chinese QA pairs and 1,000 English QA pairs from Qwen-7B-Chat with manual annotation. We also compare csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT with three baseline methods: rule matching, GPT-based evaluation and LLaMA-Guard-2 (Team, 2024). For more setup details of the baseline methods, please refer to Appendix B.

Table 2. Comparison between csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and other methods. For each method, we calculate balanced accuracy as well as precision and recall for every label (i.e. safe/unsafe). The bold value indicates the best.
Method Chinese English
ACC Precision Recall ACC Precision Recall
Rule Matching 60.85 67.68/82.61 96.77/24.93 70.29 69.47/72.18 77.74/62.84
GPT-4-Turbo 78.00 79.19/94.07 97.74/58.27 72.36 66.84/93.83 97.12/47.60
LLaMA-Guard-2 - - - 69.32 64.30/93.81 97.50/41.43
Ours 92.23 93.36/92.37 95.48/88.98 88.23 86.36/90.97 92.32/84.13

The results on the test set are shown in Table 2. csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT achieves the highest balanced accuracy (92.23% in Chinese and 88.23% in English), much better than the three baseline methods. This indicates that csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT has higher consistency with human annotation, allowing for comprehensive and automatic evaluation. In addition, we further analyze the correlation between the evaluation results of csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and LLaMA-Guard-2. We get responses to the English prompts in the test set from 11 open-source and closed-source LLMs and evaluate each QA pair using two evaluation methods. As shown in Figure 6, the horizontal and vertical axes represent the safety scores of LLMs evaluated by csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and LLaMA-Guard-2, and the gray dotted line represents the linear regression line. It can be found that the data points show a tight line and the pearson correlation coefficient between the two data sets is 0.92, which suggests a strong positive correlation between the evaluation results of the two models, validating the effectiveness of csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT from a side perspective.

Refer to caption
Figure 6. The correlation between the evaluation results of csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and LLaMA-Guard-2.

5. Experiment

In this section, we first describe our experimental setup. Then to validate the effectiveness of S-Eval and to provide a systematic methodology for evaluating the safety of LLMs, we conduct an extensive evaluation of multiple popular LLMs and answer several key research questions.

5.1. Experimental Setup

Datasets. To make a comprehensive and objective evaluation, we randomly and uniformly sample 2,000 base risk prompts (1,000 in Chinese and corresponding 1,000 in English) as the base risk prompt set 𝐏Bsuperscript𝐏𝐵\mathbf{P}^{B}bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, comprehensively considering the data balance across the first-level risk dimensions and second-level risk categories. We also take corresponding 20,000 attack prompts as the attack prompt set 𝐏Asuperscript𝐏𝐴\mathbf{P}^{A}bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT. Detailed data distribution for each risk category can be found in Appendix A (Table 5).

Evaluated Models. Delving into different research questions, we targeted select representative LLMs for experiments. To evaluate the safety performance under normal usage conditions, we use the default chat template and decoding strategy of each model.

Evaluation Models. We use the safety critique LLM csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT as the evaluation model 𝒥𝒥\mathcal{J}caligraphic_J.

Evaluation Metrics. In each risk category c𝐂𝑐𝐂c\in\mathbf{C}italic_c ∈ bold_C, let 𝐏cBsubscriptsuperscript𝐏𝐵𝑐\mathbf{P}^{B}_{c}bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT be the base risk prompt set for c𝑐citalic_c and 𝐏cAzsubscriptsubscriptsuperscript𝐏𝐴𝑐𝑧{\mathbf{P}^{A}_{c}}_{z}bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT be the corresponding attack prompt set for each instruction attack z𝐙𝑧𝐙z\in\mathbf{Z}italic_z ∈ bold_Z.

We calculate the Safety Score (SS) of LLMs, i.e., the percentage of safe responses out of all responses, to measure the safety of the evaluated LLMs. The higher the score, the safer the LLM. The safety score on c𝑐citalic_c is calculated as:

(2) SSr=pcBi𝐏cB𝒥(pcBi,r)|𝐏cB|𝑆subscript𝑆𝑟subscriptsubscriptsubscriptsuperscript𝑝𝐵𝑐𝑖subscriptsuperscript𝐏𝐵𝑐𝒥subscriptsubscriptsuperscript𝑝𝐵𝑐𝑖𝑟subscriptsuperscript𝐏𝐵𝑐{SS}_{r}=\frac{\sum\limits_{{p^{B}_{c}}_{i}\in\mathbf{P}^{B}_{c}}\mathcal{J}({% p^{B}_{c}}_{i},r)}{\left|\mathbf{P}^{B}_{c}\right|}italic_S italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_J ( italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r ) end_ARG start_ARG | bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | end_ARG

where pcBisubscriptsubscriptsuperscript𝑝𝐵𝑐𝑖{p^{B}_{c}}_{i}italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes a prompt in 𝐏cBsubscriptsuperscript𝐏𝐵𝑐\mathbf{P}^{B}_{c}bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and r𝑟ritalic_r denotes the response to pcBisubscriptsubscriptsuperscript𝑝𝐵𝑐𝑖{p^{B}_{c}}_{i}italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

We use the Attack Success Rate (ASR), i.e., the percentage of harmful responses out of all responses, to measure the capability of LLMs to defend against and resist instruction attacks. The lower the ASR, the more robust the LLM. The attack success rate of attack method z𝑧zitalic_z is calculated as:

(3) ASRz=c𝐂pcAzi𝐏cAz(1𝒥(pcAzi,r))c𝐂|𝐏cAz|𝐴𝑆subscript𝑅𝑧subscript𝑐𝐂subscriptsubscriptsubscriptsubscriptsuperscript𝑝𝐴𝑐𝑧𝑖subscriptsubscriptsuperscript𝐏𝐴𝑐𝑧1𝒥subscriptsubscriptsubscriptsuperscript𝑝𝐴𝑐𝑧𝑖𝑟subscript𝑐𝐂subscriptsubscriptsuperscript𝐏𝐴𝑐𝑧{ASR}_{z}=\frac{\sum\limits_{c\in\mathbf{C}}\sum\limits_{{{p^{A}_{c}}_{z}}_{i}% \in{\mathbf{P}^{A}_{c}}_{z}}(1-\mathcal{J}({{p^{A}_{c}}_{z}}_{i},r))}{\sum% \limits_{c\in\mathbf{C}}\left|{\mathbf{P}^{A}_{c}}_{z}\right|}italic_A italic_S italic_R start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ bold_C end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - caligraphic_J ( italic_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ bold_C end_POSTSUBSCRIPT | bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | end_ARG

where pcAzisubscriptsubscriptsubscriptsuperscript𝑝𝐴𝑐𝑧𝑖{{p^{A}_{c}}_{z}}_{i}italic_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes a prompt in 𝐏cAzsubscriptsubscriptsuperscript𝐏𝐴𝑐𝑧{\mathbf{P}^{A}_{c}}_{z}bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and r𝑟ritalic_r denotes the response to pcAzisubscriptsubscriptsubscriptsuperscript𝑝𝐴𝑐𝑧𝑖{{p^{A}_{c}}_{z}}_{i}italic_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Considering that there is some variation in the number of prompts in each risk category, to make the evaluation results more trusty, the overall safety score and attack success rate are calculated as:

(4) SSoverall=c𝐂pcBi𝐏cB𝒥(pcBi,r)c𝐂|𝐏cB|𝑆subscript𝑆𝑜𝑣𝑒𝑟𝑎𝑙𝑙subscript𝑐𝐂subscriptsubscriptsubscriptsuperscript𝑝𝐵𝑐𝑖subscriptsuperscript𝐏𝐵𝑐𝒥subscriptsubscriptsuperscript𝑝𝐵𝑐𝑖𝑟subscript𝑐𝐂subscriptsuperscript𝐏𝐵𝑐{SS}_{overall}=\frac{\sum\limits_{c\in\mathbf{C}}\sum\limits_{{p^{B}_{c}}_{i}% \in\mathbf{P}^{B}_{c}}\mathcal{J}({p^{B}_{c}}_{i},r)}{\sum\limits_{c\in\mathbf% {C}}\left|\mathbf{P}^{B}_{c}\right|}italic_S italic_S start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r italic_a italic_l italic_l end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ bold_C end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_J ( italic_p start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ bold_C end_POSTSUBSCRIPT | bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | end_ARG
(5) ASRoverall=z𝐙c𝐂pcAzi𝐏cAz(1𝒥(pcAzi,r))z𝐙c𝐂|𝐏cAz|𝐴𝑆subscript𝑅𝑜𝑣𝑒𝑟𝑎𝑙𝑙subscript𝑧𝐙subscript𝑐𝐂subscriptsubscriptsubscriptsubscriptsuperscript𝑝𝐴𝑐𝑧𝑖subscriptsubscriptsuperscript𝐏𝐴𝑐𝑧1𝒥subscriptsubscriptsubscriptsuperscript𝑝𝐴𝑐𝑧𝑖𝑟subscript𝑧𝐙subscript𝑐𝐂subscriptsubscriptsuperscript𝐏𝐴𝑐𝑧{ASR}_{overall}=\frac{\sum\limits_{z\in\mathbf{Z}}\sum\limits_{c\in\mathbf{C}}% \sum\limits_{{{p^{A}_{c}}_{z}}_{i}\in{\mathbf{P}^{A}_{c}}_{z}}(1-\mathcal{J}({% {p^{A}_{c}}_{z}}_{i},r))}{\sum\limits_{z\in\mathbf{Z}}\sum\limits_{c\in\mathbf% {C}}\left|{\mathbf{P}^{A}_{c}}_{z}\right|}italic_A italic_S italic_R start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r italic_a italic_l italic_l end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_z ∈ bold_Z end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_c ∈ bold_C end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - caligraphic_J ( italic_p start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z ∈ bold_Z end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_c ∈ bold_C end_POSTSUBSCRIPT | bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | end_ARG

5.2. Research Questions

RQ1. (Evaluation of Effectiveness) Does S-Eval more effectively reflect the safety of LLMs compared to existing safety benchmarks?

In order to validate that S-Eval more effectively reflects the safety of LLMs, we compare the differences in the safety scores of different models on 𝐏Bsuperscript𝐏𝐵\mathbf{P}^{B}bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT and existing safety benchmarks. We adopt four widely used safety benchmarks, AdvBench (Zou et al., 2023), HH-RLHF (red-teaming) (Ganguli et al., 2022), Flames (Huang et al., 2023b) that is a highly adversarial benchmark, and SafetyPrompts (typical safety scenarios) (Sun et al., 2023), as the baselines, and follow the following data sampling strategies: for HH-RLHF and SafetyPrompts, we randomly and uniformly extracted from their risk dimensions 1,000 prompts; for AdvBench and Flames, we use all their prompts, which are 520 and 1,000, respectively. After the data sampling, we use the Google Translate API 555https://translate.google.com to translate the prompts in each baseline benchmark into Chinese or English.

We evaluate 16 popular and representative open-source and closed-source LLMs in both Chinese and English, covering a wide range of organizations and model scales, as detailed in Appendix C (Table 6). For each model family, we choose the model with medium or best performance depending on the parameter scale setting.

Table 3. The safety scores (%) of the evaluated models on the five benchmarks and the eight risk dimensions in S-Eval. ”AB” stands for AdvBench. ”H-R” stands for HH-RLHF. ”FL” stands for Flames. ”SP” stands for SafetyPrompts. Rows with denote English results. The bold value in each column indicates the safest and underline indicates the second.
Model AB H-R FL SP S-Eval (Ours)
Overall Overall Overall Overall Overall CI HS PM EM DP CS EX IS
Qwen-1.8B-Chat 93.65 83.20 64.80 89.50 60.50 57.78 65.00 75.00 36.00 71.00 60.00 78.33 41.67
ChatGLM3-6B 95.38 83.80 77.90 95.20 59.70 60.56 72.14 68.00 37.00 61.00 57.86 66.67 50.00
Gemma-7B-it 74.42 77.30 62.50 76.80 49.60 48.33 59.29 60.00 31.00 70.00 39.29 58.33 33.33
Baichuan2-13B-Chat 94.23 87.80 80.07 96.40 66.60 74.44 70.00 79.00 47.00 77.00 65.00 68.33 48.33
Qwen-14B-Chat 97.31 91.80 75.80 96.00 66.50 75.00 76.43 80.00 38.00 77.00 52.14 74.17 55.00
Yi-34B-Chat 94.62 75.80 70.90 92.30 46.70 50.00 48.57 60.00 25.00 81.00 27.14 35.83 51.67
Qwen-72B-Chat 99.62 92.70 81.50 97.40 73.10 83.33 72.86 83.00 58.00 86.00 63.57 83.33 52.50
GPT-4-Turbo 94.23 85.10 78.00 94.00 57.70 58.33 62.14 56.00 41.00 78.00 68.57 55.00 40.00
ErnieBot-4.0 99.04 90.10 81.90 97.20 79.70 89.44 85.00 87.00 57.00 73.00 89.29 87.50 58.33
Gemini-1.0-Pro 86.54 78.50 62.20 84.30 53.90 56.11 61.43 67.00 50.00 54.00 35.71 65.83 43.33
Qwen-1.8B-Chat 93.65 78.30 74.90 89.70 47.60 38.89 56.43 66.00 39.00 66.00 43.57 49.17 30.00
ChatGLM3-6B 94.04 83.70 80.20 93.70 57.70 51.67 74.29 76.00 55.00 75.00 45.71 45.83 45.83
Gemma-7B-it 91.54 87.80 78.00 85.60 61.80 56.11 76.43 74.00 43.00 74.00 56.43 65.83 50.83
Mistral-7B-Instruct-v0.2 49.62 77.40 74.70 91.30 34.20 23.89 40.00 61.00 38.00 65.00 12.14 9.17 42.50
LLaMA-3-8B-Instruct 98.27 84.90 74.60 85.80 69.10 70.00 68.57 75.00 63.00 58.00 82.86 71.67 59.17
Vicuna-13B-v1.3 98.85 87.50 80.80 91.70 57.10 52.22 67.86 73.00 59.00 77.00 42.86 47.50 46.67
LLaMA-2-13B-Chat 99.62 92.80 84.60 92.00 85.10 77.78 93.57 86.00 83.00 83.00 93.57 93.33 70.83
Baichuan2-13B-Chat 98.27 91.10 87.50 96.40 77.40 81.11 80.71 86.00 74.00 85.00 82.86 73.33 55.00
Qwen-14B-Chat 99.81 91.20 83.00 95.30 73.50 69.44 75.71 83.00 72.00 88.00 71.43 78.33 55.83
Yi-34B-Chat 82.88 70.40 73.30 88.20 39.30 29.44 47.86 58.00 38.00 72.00 22.86 19.17 41.67
LLaMA-2-70B-Chat 99.23 91.10 83.80 90.90 77.20 70.00 90.71 84.00 68.00 72.00 87.14 84.17 60.00
LLaMA-3-70B-Instruct 95.58 77.30 69.10 81.80 54.70 56.67 47.14 61.00 46.00 63.00 60.71 48.33 55.00
Qwen-72B-Chat 98.65 88.40 84.70 94.80 71.50 71.11 77.14 75.00 74.00 81.00 65.00 75.00 56.67
GPT-4-Turbo 97.50 81.30 79.80 89.40 60.00 56.11 66.43 69.00 50.00 80.00 63.57 51.67 46.67
ErnieBot-4.0 99.81 94.60 92.40 97.80 87.60 90.00 90.00 88.00 89.00 96.00 89.29 91.67 66.67
Gemini-1.0-Pro 94.23 78.40 67.40 85.10 41.90 43.33 46.43 63.00 42.00 51.00 15.00 42.50 40.00

Table 3 presents the overall safety scores of the evaluated models on the five benchmarks and the safety scores on the eight risk dimensions in S-Eval. The results present us a variety of observations and insights as follows.

First, S-Eval is more risky and more effectively reflects the safety of LLMs. All models have lower safety scores in Chinese and English than the four baselines on S-Eval. Specifically, among the baselines, Advbench is the least risky, with most of the LLMs having safety scores of 95% and above, and the highly adversarial Flames is the most. To further analyze the distributions of the safety scores in Chinese and English on each benchmark, we first exclude outliers based on the upper and lower quartiles of the safety scores on each benchmark and then characterize the corresponding distributions as shown in Figure 7. The 95% confidence interval sizes of the safety scores in Chinese and English on S-Eval are 30.86% and 50.55%, respectively. In contrast, the 95% confidence interval sizes of the four baselines are Advbench (5.77% / 7.58%), HH-RLHF (16.36% / 20.94%), Flames (19.54% / 22.53%), and SafetyPrompts (12.02% / 14.24%). Meanwhile, the distributions of the safety scores in Chinese and English on S-Eval are more uniform. The higher riskiness, larger confidence interval size of safety scores, and more uniform safety score distribution indicate that S-Eval is more effective in reflecting the safety of LLMs and the differences in safety.

Refer to caption
(a) Chinese
Refer to caption
(b) English
Figure 7. The safety score distributions in Chinese and English on different benchmarks.

Second, the evaluation results on S-Eval show that: among the closed-source LLMs, ErnieBot-4.0 has the highest safety score, with 79.70% in Chinese and 87.60% in English. Gemini-1.0-Pro has the lowest safety score, with 53.90% in Chinese and 41.90% in English. The leading safety performance of ErnieBot-4.0 may be due to its advanced outer safety guardrail, which can audit inference content and filter out sensitive words. For the open-source LLMs, in Chinese, Qwen-72B-Chat has the highest safety score of 73.10% and Yi-34B-Chat has the lowest safety score of 46.70%; in English, LLaMA-2-13B-Chat has the highest safety score of 85.10% and Mistral-7B-Instruct-v0.2 has the lowest safety score of 34.20%. In addition, LLaMA-3 family has lower safety scores than LLaMA-2 family, indicating lower refusal rates. It is worth noting that although Qwen-1.8B-Chat has only 1.8B model parameters, its safety scores in Chinese and English exceed those of Yi-34B-Chat. Overall, the average safety of closed-source LLMs is better than that of open-source LLMs.

Third, there are significant differences in the safety of LLMs on different risk dimensions. In Chinese, Yi-34B-Chat has a safety score of 81.00% on the Data Privacy dimension, while it has a safety score of 25.00% on the Ethics and Morality dimension, a difference of 56.00%. In English, Mistral-7B-Instruct-v0.2 has a safety score of 65.00% on the Data Privacy dimension, while the safety score on the Extremism dimension is only 9.17%, a difference of 55.83%. Meanwhile, all LLMs are less safe on the Inappropriate Suggestions dimension. The difference in the safety of LLMs on segmented risk dimensions may be related to the data distributions, and optimization goals when the models are trained or aligned. This observation also further indicates that the safety evaluation or alignment for LLMs cannot only be based on a single class of safety concerns but should consider their comprehensive performance on multiple risk dimensions.

With all the observations above, we get the following answer:

Answer to RQ1: S-Eval can effectively reflect the safety of LLMs and the differences in safety compared to existing safety benchmarks. Overall, the safety of closed-source LLMs is better than that of open-source LLMs. And there is a significant difference in the safety of LLMs on different risk dimensions.

RQ2. (Evaluation of Scale Effect) How do variations in LLM parameter scales affect safety?

To answer this question, we evaluate 10 models from three families, Qwen, Vicuna, and LLaMA-2, which are representative in English, using the English set in 𝐏Bsuperscript𝐏𝐵\mathbf{P}^{B}bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. We provide the model details in Appendix C (Table 7). The results are shown in Figure 8, in which the horizontal axis represents the performance taken from the overall average of the OpenCompass Large Language Model Leaderboard 666https://rank.opencompass.org.cn/leaderboard-llm, the vertical axis represents the safety score of the LLM, and the graphic size represents the LLM parameter scale.

From Figure 8, we have the following observations. First, for one model family, the performance of models improves as the parameters increase. This indicates that larger model parameter scales lead to better language understanding and generation capabilities, in line with the scaling laws of LLMs. Second, the safety scores of all three model families first increase with the increase of parameters but decrease when reaching the maximum parameter scale. This indicates that for one model family, there is a parameter scale threshold beyond which the continued increase in model parameter scales will not simply lead to a sustained increase in safety or even a decrease in safety. Third, by performing linear fitting to the all observed data points (the grey dotted line in Figure 8), it can be found that the safety of LLMs tends to increase as their performance increases. Fourth, there are differences in the safety of the models from different families. Although LLaMA-2 family has poorer performance compared to Qwen family (similar parameter scale), its safety score is overall higher than that of Qwen and Vicuna families. This indicates that the architecture or alignment method of LLaMA-2 family is more effective in meeting safety requirements. In practice, the parameter scales, performance, and safety of LLMs should be considered to find the right balance.

Refer to caption
Figure 8. The relationships among the parameter scales, the performance and the safety of LLMs.
Answer to RQ2: For one model family, there exists a parameter scale threshold, before which the safety of LLMs increases with the parameter scales, and beyond which, the continued increase in the parameter scales will result in a decrease in safety. And there may be a tendency for the safety of LLMs to increase with their performance.

RQ3. (Evaluation of Multiple Languages) Are there differences in the safety of LLMs in different language environments?

LLMs often have multilingual capabilities and are widely used by speakers of different languages around the world. However, most of the existing studies on safety training do not cover all language environments, and there may be potential safety risks when used in unaligned environments. Deng et al (Deng et al., 2023) classify each language into high-resource, medium-resource, and low-resource languages based on the data ratio in the CommonCrawl corpus 777http://commoncrawl.org and find that querying LLMs in low-resource languages can bypass the safety mechanisms. Therefore, it is important to evaluate the safety of LLMs in different language environments.

Considering the limitations of open-source LLMs in supporting multiple languages, when evaluating the safety of LLMs in different languages, in addition to the typical high-resource languages of Chinese and English, we use the Google Translate API to translate 𝐏Bsuperscript𝐏𝐵\mathbf{P}^{B}bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT into French (fr), which is slightly smaller in scale of use than Chinese and English but still a high-resource language, and Korean (ko), which is a medium-resource language. This takes into account different language characteristics and resource availability. After getting responses from LLMs, we translate them into Chinese to evaluation the safety of the LLMs.

To objectively reflect the differences in the safety of LLMs in different language environments, we choose 8 LLMs that can simultaneously support English, Chinese, French, and Korean for evaluation. Table 8 in Appendix C shows the details of these LLMs.

Refer to caption
Figure 9. The safety scores (%) of the evaluated models in different language environments.

The results in the four languages are shown in Figure 9. It shows that there are significant differences in the safety of the same LLM in different language environments. Baichuan2-13B-Chat has a safety score of 77.40% in English, while the safety score drops to 33.90% in Korean. The safety score of ChatGLM3-6B also drops from 59.70% in Chinese to 29.20% in French. The specific differences in safety in different language environments for each model are related to the ratios of language resources in their training data. Compared to the open-source models, the two closed-source models have more stable safety in the four languages. This may benefit from the balance of these language resources in their training data. Importantly, from the averages of safety scores in different languages, we can see that the safety of LLMs decreases as the language resources decrease.

Interestingly, we find that ChatGLM3-6B has the highest safety score in Korean than in other languages. By analyzing its responses, we find that it generates a lot of meaningless responses in Korean, which are not related to the questions or have the presence of duplicate decoding. Thus, the limited capability of LLMs for one language may inadvertently prevent the generation of harmful content.

Answer to RQ3: There are significant differences in the safety of LLMs in different language environments. The safety of LLMs decreases as the language resources decrease. And the limited capability of LLMs for one language may inadvertently prevent the generation of harmful content, improving safety.
Refer to caption
Figure 10. The relationship between the attack effectiveness of IE and model capabilities.

RQ4. (Evaluation of Robustness) How well do LLMs defend and resist instruction attacks?

In exploring RQ4, we use 𝐏Asuperscript𝐏𝐴\mathbf{P}^{A}bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT to evaluate the robustness of the LLMs in RQ1 against instruction attacks. To simulate the situation where multiple attacks may be possible at the same time for a prompt, we further consider the adaptive attack that succeeds if any of the 10 attacks in 𝐏Asuperscript𝐏𝐴\mathbf{P}^{A}bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT succeed for the prompt. The attack success rates of the various attacks and the overall attack success rates on different models are shown in Table 4.

Table 4. The attack success rates (%) of the instruction attacks in 𝐏Asuperscript𝐏𝐴\mathbf{P}^{A}bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT on the evaluated models. Rows with denote English results. In the columns ”Overall” and ”Adaptive”, the value with indicates the lowest attack success rate. For the 10 instruction attacks, the bold value in each row indicates the highest attack success rate and underline indicates the second.
Model Normal Overall Adaptive PI RI CI IJ GH IE DI ICA CoU CIA
Qwen-1.8B-Chat 39.50 46.40 99.00 69.20 77.60 7.20 50.40 21.20 2.50 41.50 64.30 48.40 81.70
ChatGLM3-6B 40.30 53.95 99.40 66.90 70.90 9.80 62.40 33.90 0.50 60.70 64.80 80.00 89.60
Gemma-7B-it 50.40 52.15 99.80 67.20 73.80 33.50 55.10 23.70 0.30 36.30 67.20 83.80 80.60
Baichuan2-13B-Chat 33.40 61.86 99.80 86.40 77.90 20.00 79.00 36.20 2.20 64.80 69.40 98.00 84.70
Qwen-14B-Chat 33.50 51.62 99.70 72.10 72.10 4.80 68.00 18.80 0.50 51.80 48.50 90.10 89.50
Yi-34B-Chat 53.30 53.82 99.70 89.30 64.90 16.60 53.70 34.70 7.80 25.50 70.40 95.00 80.30
Qwen-72B-Chat 26.90 49.49 99.80 57.90 70.30 3.30 76.50 16.30 8.80 39.50 35.60 98.60 88.10
GPT-4-Turbo 42.30 33.99 95.10 52.30 71.10 21.00 17.00 27.90 12.60 20.60 35.40 0.30 81.70
ErnieBot-4.0 20.30 36.54 95.20 40.70 65.20 13.30 52.30 21.40 17.90 41.50 35.70 2.00 75.40
Gemini-1.0-Pro 53.90 53.04 99.20 57.90 83.60 2.10 55.90 18.20 3.60 69.60 66.90 80.60 92.00
Avg 39.38 49.29 98.67 65.99 72.74 13.16 57.03 25.23 5.67 45.18 55.82 67.68 84.36
Qwen-1.8B-Chat 52.40 52.55 97.60 82.50 81.90 8.30 59.40 38.00 0.40 55.80 72.00 45.60 81.60
ChatGLM3-6B 42.30 53.17 98.90 74.20 70.20 10.90 66.00 28.50 0.10 51.90 60.20 83.00 86.70
Gemma-7B-it 38.20 43.77 98.40 54.20 67.70 16.30 57.30 10.50 0.10 54.20 40.00 59.60 77.80
Mistral-7B-Instruct-v0.2 65.80 63.75 99.90 82.80 79.30 14.40 88.20 56.10 0.90 58.80 70.60 96.40 90.00
LLaMA-3-8B-Instruct 30.90 16.90 76.10 24.20 40.70 6.10 8.60 26.90 6.30 25.10 1.60 0.00 29.50
Vicuna-13B-v1.3 42.90 53.22 99.10 83.60 71.00 2.40 86.60 31.40 0.60 34.40 54.60 86.30 81.30
LLaMA-2-13B-Chat 14.90 34.39 97.00 47.00 28.70 15.00 29.00 17.90 1.20 41.00 35.90 83.90 44.30
Baichuan2-13B-Chat 22.60 52.44 97.30 67.40 62.40 10.90 65.90 25.30 1.10 78.80 55.50 71.60 85.50
Qwen-14B-Chat 26.50 47.58 98.80 40.40 66.50 11.40 82.00 17.80 0.10 47.00 50.70 81.80 78.10
Yi-34B-Chat 60.70 54.73 98.50 81.00 75.90 24.60 62.40 47.00 6.10 18.00 60.60 91.60 80.10
LLaMA-2-70B-Chat 22.80 21.77 87.30 36.50 22.70 14.50 36.50 11.70 2.70 26.90 13.30 1.60 51.30
LLaMA-3-70B-Instruct 45.30 27.55 90.08 43.00 63.30 7.90 13.30 30.30 14.40 23.60 29.10 0.10 50.50
Qwen-72B-Chat 28.50 48.20 99.60 28.10 66.00 2.30 88.00 15.20 7.20 49.70 45.30 93.40 86.80
GPT-4-Turbo 40.00 32.80 91.10 44.60 71.30 8.90 20.60 28.80 12.80 26.10 39.90 1.10 73.90
ErnieBot-4.0 12.40 46.42 99.90 41.00 55.90 3.50 80.60 15.00 22.00 40.30 28.80 97.20 79.90
Gemini-1.0-Pro 58.10 58.84 99.50 68.50 81.70 6.20 78.40 29.60 2.70 72.60 69.00 89.60 90.10
Avg 37.77 44.26 95.57 56.19 62.83 10.23 57.68 26.88 4.92 44.01 45.44 61.43 72.96

Among the closed-source models, GPT-4-Turbo is the most robust, with an overall ASR of 33.99% and 32.80% in Chinese and English, respectively. Gemini-1.0-Pro is the least robust, with an overall ASR of 53.04% and 58.84% in Chinese and English, respectively. Among the open-source models, in Chinese, Qwen-1.8B-Chat is the most robust with an overall ASR of 46.40%, and Baichuan2-13B-Chat is the least robust with an overall ASR of 61.86%. In English, the overall ASR on LLaMA-3-8B-Instruct is only 21.77%, lower than GPT-4-Turbo, with ICA and CoU ASRs of merely 1.6% and 0%, respectively. And the ASR of the adaptive attack on LLaMA-3-8B-Instruct is only 76.10%. This indicates that the safety alignment methods of LLaMA-3-8B-Instruct can resist instruction attacks more efficiently. While Mistral-7B-Instruct-v0.2 has the worst robustness with an overall ASR of 63.75%. Overall, the robustness of the closed-source models is better than the open-source models.

We also evaluate the 10 included attack methods. CIA achieves the highest average ASR. This indicates that CIA combines instructions with multiple intents to effectively hide potential malicious intents and more universally bypass the safety mechanisms of LLMs. RI is also the second most effective in jailbreaking LLMs. The ASRs of CoU on GPT-4-Turbo, LLaMA-3-8B-Instruct, LLaMA-2-70B-Chat, and LLaMA-3-70B-Instruct are very low, while its ASR on ErnieBot-4.0 is from 2.00% in Chinese increased to 97.20% in English. This indicates that GPT-4-Turbo, LLaMA-3-8B-Instruct, LLaMA-2-70B-Chat, and LLaMA-3-70B-Instruct can effectively resist CoU, while the safety guardrail of ErnieBot-4.0 fails to identify and intercept CoU effectively in English. IE has the lowest average ASR. We find that IE has low ASRs on the open-source models but higher ASRs on the closed-source models, so we further explore the relationship between the attack effectiveness of IE and model capabilities. As shown in Figure 10, the horizontal axis is the average capability of model knowledge and reasoning taken from the OpenCompass Large Language Model Leaderboard, the vertical axis is the ASR of IE, and the gray dashed line represents a trend line fitted to the observed data points via linear regression. There is a tendency for the ASR of IE to increase as the capability increases, which indicates that too-smart models may instead have additional potential safety vulnerabilities that can be exploited by attackers.

Notably, the adaptive attack has a very high ASR on all models even on GPT-4-Turbo, ErnieBot-4.0, and Gemini-1.0-Pro with outer safety guardrails, and the average ASR is 98.67% and 95.57% in Chinese and English, respectively. This reveals that LLMs are difficult to cope with the adaptive attack with multiple attack methods, and existing huge safety risks.

Answer to RQ4: Among the closed-source models, GPT-4-Turbo is the most robust. Among the open-source models, LLaMA-3-8B-Instruct has the strongest robustness and exceeds GPT-4-Turbo. Among the 10 included attack methods, CIA is the most effective. Too smart models may instead have additional potential safety vulnerabilities. In addition, LLMs are difficult to cope with adaptive attacks with multiple attack methods.
Refer to caption
Figure 11. The safety scores (%) of models from two families under different decoding configurations.

RQ5. (Evaluation of Stability) What is the effect of decoding parameters on the safety of LLMs?

Different decoding strategies can have impacts on the output of LLMs. The previous work (Huang et al., 2023a) has found that existing alignment processes and evaluations may be based on default decoding strategies, and when the configurations are slightly varied, they may be affected by misalignment and produce harmful responses. Therefore, evaluating the safety of LLMs also needs to consider the impacts of different decoding strategies.

To answer RQ5, we evaluate two representative model families, Qwen and LLaMA-2, using 𝐏Bsuperscript𝐏𝐵\mathbf{P}^{B}bold_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. In terms of decoding strategies, we independently control the three common decoding parameters: Temperature τ𝜏\tauitalic_τ, Top-K𝐾Kitalic_K top_k𝑡𝑜𝑝_𝑘top\_kitalic_t italic_o italic_p _ italic_k and Top-P𝑃Pitalic_P top_p𝑡𝑜𝑝_𝑝top\_pitalic_t italic_o italic_p _ italic_p. We experiment with the following parameter settings in addition to the default decoding settings: τ{0,0.5,1}𝜏00.51\tau\in\{0,0.5,1\}italic_τ ∈ { 0 , 0.5 , 1 }, top_k{0,50,100}𝑡𝑜𝑝_𝑘050100top\_k\in\{0,50,100\}italic_t italic_o italic_p _ italic_k ∈ { 0 , 50 , 100 }, top_p{0,0.5,1}𝑡𝑜𝑝_𝑝00.51top\_p\in\{0,0.5,1\}italic_t italic_o italic_p _ italic_p ∈ { 0 , 0.5 , 1 }. To minimize randomness, we fix the random seed for each LLM.

We take the prompts that can produce safe responses under greedy decoding and calculate the safety scores of the LLMs under other different decoding configurations using them. The results are shown in Figure 11. As τ𝜏\tauitalic_τ and top_p𝑡𝑜𝑝_𝑝top\_pitalic_t italic_o italic_p _ italic_p increase, the safety scores of Qwen family and LLaMA-2 family gradually decrease. And As top_k𝑡𝑜𝑝_𝑘top\_kitalic_t italic_o italic_p _ italic_k varies, the safety scores of the two model families have changed little. This indicates that increasing Temperature and Top-P𝑃Pitalic_P decreases the safety of LLMs, while the variations in Top-K𝐾Kitalic_K have no significant impact on the safety of LLMs.

In addition, the safety scores of Qwen family decrease more with increasing τ𝜏\tauitalic_τ and top_p𝑡𝑜𝑝_𝑝top\_pitalic_t italic_o italic_p _ italic_p than LLaMA-2 family, and the differences between the decreases of the models with different parameter scales are also more pronounced. This indicates that the safety of LLaMA-2 family is more stable under different decoding configurations.

Answer to RQ5: There are differences in the safety of LLMs under different decoding parameters. Increasing Temperature and Top-P𝑃Pitalic_P can lead to a decrease in the safety of LLMs, while the variations in Top-K𝐾Kitalic_K have no significant impact on the safety of LLMs.

6. Related Work

Previous safety benchmarks focus on specific safety concerns. For example, RealToxicityPrompts (Gehman et al., 2020) contains 100,000 sentence-level prompts from a large corpus of English web text, and is often used to evaluate the toxic generations of language models. ETHICS (Hendrycks et al., 2021) with 13,910 training examples, 3,885 test examples, and 3,964 hard test examples, aims to assess basic knowledge of ethics and common human values. BBQ (Parrish et al., 2021) contains 58,492 hand-written examples with ambiguous and disambiguated contexts to assess the social biases of LLMs on nine different categories.

With the rapid improvement of LLM capabilities, the safety evaluation on a single dimension cannot comprehensively reflect the safety status of LLMs, so researchers have proposed several comprehensive benchmarks with different dimensions. HELM (Liang et al., 2022) collects data from existing datasets and provides an evaluation with 16 scenarios. DecodingTrust (Wang et al., 2024) provides a trustworthiness evaluation for the GPT models based on previous datasets. DecodingTrust designs a variety of adversarial system/user prompts to evaluate the model performance in different scenarios. HH-RLHF (Ganguli et al., 2022) is the first dataset of red teaming on a model trained with RLHF, and collects 38,961 hand-written red teaming prompts. AdvBench (Zou et al., 2023) is often used to evaluate the effectiveness of jailbreak attacks. However, its scale is small, containing only 520 hand-written harmful questions, and there is a prevalence of duplicates (Chao et al., 2023). SafetyPrompts (Sun et al., 2023) explores the safety of LLMs from 8 traditional safety scenarios and 6 instruction attacks, and contains 100k Chinese test prompts. SafetyBench (Zhang et al., 2023b) contains 11,435 multiple-choice questions collected from multiple sources, covering 7 safety categories and supports Chinese and English. CValues (Xu et al., 2023) is the first Chinese human values evaluation benchmark with safety and responsibility criteria. It contains 2,100 open-ended prompts for human evaluation and 4,312 multi-choice prompts for automatic evaluation. Do-not-answer (Wang et al., 2023) introduces a three-level hierarchical risk taxonomy covering mild and extreme risks, and contains 938 harmful instructions to evaluate safeguard mechanisms of LLMs at low cost. Flames (Huang et al., 2023b) is the first highly adversarial benchmark that contains 2,251 highly adversarial manually designed Chinese prompts. SALAD-Bench (Li et al., 2024) contains 21,000 evaluation prompts with a four-level hierarchical risk taxonomy. The benchmark further sets 5,000 attack-enhanced questions, 200 defense-enhanced questions, and 4,000 multiple-choice questions to evaluate attack and defense methods.

However, the above benchmarks have some significant limitations. First, the risk taxonomies of existing safety benchmarks are loose without a unified risk taxonomy paradigm. Second, existing benchmarks have weak riskiness which limits its capability to sincerely reflect the safety of LLMs effectively. For instance, some benchmarks (Hendrycks et al., 2021; Zhang et al., 2023b; Parrish et al., 2021) are only evaluated with multiple-choice questions (due to the lack of a test oracle), which is inconsistent with the real-world user case and limits the risks that may arise in responses, thus cannot reflect an LLM’s real safety levels. Other benchmarks like (Huang et al., 2023b; Sun et al., 2023; Li et al., 2024) only consider some backward and incomplete instruction attack methods, failing to picture the safety of LLMs under more various adversarial attack scenarios. Third, construction of existing benchmarks often lacks automation in terms of test prompts generation, selection and output riskiness evaluation requiring numerous human labor, which hinders its effective adaptability to quickly evolving LLM and accompanied safety threats.

Differently, our proposed S-Eval comprises 20,000 base risk prompts alongside 200,000 corresponding attack prompts. These test prompts are generated based on a comprehensive and unified risk taxonomy, specifically designed to encompass all crucial dimensions of safety evaluation and accurately reflect the varied safety levels of LLMs across different risk dimensions. At the foundation of constructing S-Eval is an innovative LLM-based automatic test generation and selection framework, in which an expert testing LLM tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is trained to facilitate various test prompt generation tasks combined with a range of test selection strategies. Moreover, considering the rapid evolution of LLMs and the accompanying safety threats, S-Eval can be flexibly configured and adapted to include new risks, attacks, and models to keep the benchmark updated.

7. Conclusion

In this work, we present S-Eval, a comprehensive, multi-dimensional and open-ended benchmark designed for the detailed safety evaluation of LLMs. We also propose a framework for automatic test generation, in which S-Eval can be dynamically adjusted to keep pace with the fast-evolving safety threats and LLMs by automatically expanding or rewriting test prompts. Additionally, our framework introduces a safety critique LLM that offers both effective and explainable safety evaluations for LLMs. Extensive empirical evaluations conducted on 20 leading LLMs demonstrate that S-Eval can measure the safety of LLMs more accurately, significantly surpassing other benchmarks in effectiveness. Moreover, we conduct a systematic investigation into the robustness of LLMs by assessing how their safety is affected by various factors such as the scale of parameters, linguistic contexts, and decoding settings. Our findings may shed light on new pathways for designing safer LLMs.

References

  • (1)
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • AI@Meta (2024) AI@Meta. 2024. Llama 3 Model Card. {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}.
  • Anonymous (2024) Anonymous. 2024. The repository of our benchmark and experimental data. https://github.com/IS2Lab/S-Eval.
  • Anthropic (2023) Anthropic. 2023. Introducing Claude. https://www.anthropic.com/news/introducing-claude.
  • Bai et al. (2023) **ze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
  • Bang et al. (2023) Ye** Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. 675–718.
  • Bhardwaj and Poria (2023a) Rishabh Bhardwaj and Soujanya Poria. 2023a. Language model unalignment: Parametric red-teaming to expose hidden harms and biases. arXiv preprint arXiv:2310.14303 (2023).
  • Bhardwaj and Poria (2023b) Rishabh Bhardwaj and Soujanya Poria. 2023b. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662 (2023).
  • Blair-Stanek et al. (2023) Andrew Blair-Stanek, Nils Holzenberger, and Benjamin Van Durme. 2023. Can GPT-3 perform statutory reasoning?. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law. 22–31.
  • Chao et al. (2023) Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023).
  • Deng et al. (2023) Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. 2023. Multilingual Jailbreak Challenges in Large Language Models. In The Twelfth International Conference on Learning Representations.
  • Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 320–335.
  • Durkin (1997) Keith F Durkin. 1997. Misuse of the Internet by pedophiles: Implications for law enforcement and probation practice. Fed. Probation 61 (1997), 14.
  • Ganguli et al. (2022) Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. 2022. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022).
  • Gehman et al. (2020) Samuel Gehman, Suchin Gururangan, Maarten Sap, Ye** Choi, and Noah A Smith. 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462 (2020).
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021. Aligning AI With Shared Human Values. Proceedings of the International Conference on Learning Representations (2021).
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  • Huang et al. (2023b) Kexin Huang, Xiangyang Liu, Qianyu Guo, Tianxiang Sun, Jiawei Sun, Yaru Wang, Zeyang Zhou, Yixu Wang, Yan Teng, Xipeng Qiu, et al. 2023b. Flames: Benchmarking value alignment of chinese large language models. arXiv preprint arXiv:2311.06899 (2023).
  • Huang et al. (2023a) Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. 2023a. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In The Twelfth International Conference on Learning Representations.
  • Inc (2023) Baidu Inc. 2023. ErnieBot. https://yiyan.baidu.com/.
  • Jiang et al. (2023b) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023b. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
  • Jiang et al. (2023a) Shuyu Jiang, Xingshu Chen, and Rui Tang. 2023a. Prompt packer: Deceiving llms through compositional instruction with hidden attacks. arXiv preprint arXiv:2310.10077 (2023).
  • Kamalov et al. (2023) Firuz Kamalov, David Santandreu Calonge, and Ikhlaas Gurrib. 2023. New era of artificial intelligence in education: Towards a sustainable multifaceted revolution. Sustainability 15, 16 (2023), 12451.
  • Kang et al. (2023) Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. 2023. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733 (2023).
  • Ke et al. (2023) Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, et al. 2023. Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation. arXiv preprint arXiv:2311.18702 (2023).
  • Khowaja et al. (2024) Sunder Ali Khowaja, Parus Khuwaja, Kapal Dev, Weizheng Wang, and Lewis Nkenyereye. 2024. Chatgpt needs spade (sustainability, privacy, digital divide, and ethics) evaluation: A review. Cognitive Computation (2024), 1–23.
  • Li et al. (2023a) Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. 2023a. Multi-step Jailbreaking Privacy Attacks on ChatGPT. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  • Li et al. (2024) Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and **g Shao. 2024. SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. arXiv preprint arXiv:2402.05044 (2024).
  • Li et al. (2023b) Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. 2023b. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191 (2023).
  • Liang et al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
  • Liu et al. (2023c) Pengfei Liu, Weizhe Yuan, **lan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023c. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  • Liu et al. (2023b) Xiaoxia Liu, **gyi Wang, Jun Sun, Xiaohan Yuan, Guoliang Dong, Peng Di, Wenhai Wang, and Dongxia Wang. 2023b. Prompting frameworks for large language models: A survey. arXiv preprint arXiv:2311.12785 (2023).
  • Liu et al. (2023a) Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. 2023a. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023).
  • Lukas et al. (2023) Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. 2023. Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy. 346–363.
  • Milgram (1963) Stanley Milgram. 1963. Behavioral study of obedience. The Journal of abnormal and social psychology 67, 4 (1963), 371.
  • OpenAI (2022) OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt.
  • OpenAI (2024) OpenAI. 2024. Moderation. https://platform.openai.com/docs/guides/moderation.
  • Parrish et al. (2021) Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. 2021. BBQ: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193 (2021).
  • Sheng et al. (2021) Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2021. Societal biases in language generation: Progress and challenges. arXiv preprint arXiv:2105.04054 (2021).
  • Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Empirical Methods in Natural Language Processing.
  • Son et al. (2023) Gui** Son, Hanearl Jung, Moonjeong Hahm, Keonju Na, and Sol **. 2023. Beyond classification: Financial reasoning in state-of-the-art language models. arXiv preprint arXiv:2305.01505 (2023).
  • Sun et al. (2023) Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang. 2023. Safety Assessment of Chinese Large Language Models. arXiv preprint arXiv:2304.10436 (2023).
  • Sun et al. (2024) Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. 2024. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024).
  • Tang et al. (2023) Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, and Xia Hu. 2023. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360 (2023).
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  • Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open Models Based on Gemini Research and Technology. arXiv preprint arXiv:2403.08295 (2024).
  • Team (2024) Llama Team. 2024. Meta Llama Guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  • Tramèr et al. (2022) Florian Tramèr, Gautam Kamath, and Nicholas Carlini. 2022. Considerations for differentially private learning with large-scale public pretraining. arXiv preprint arXiv:2212.06470 (2022).
  • Van Veen et al. (2024) Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerová, et al. 2024. Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine (2024), 1–9.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Wang et al. (2024) Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. 2024. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. Advances in Neural Information Processing Systems 36 (2024).
  • Wang et al. (2023) Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. 2023. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387 (2023).
  • Wei et al. (2024) Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2024. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems 36 (2024).
  • Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022).
  • Wei et al. (2023) Zeming Wei, Yifei Wang, and Yisen Wang. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023).
  • White et al. (2023) Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  • Xu et al. (2023) Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, **ghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, et al. 2023. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv preprint arXiv:2307.09705 (2023).
  • Yang et al. (2023b) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. 2023b. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023).
  • Yang et al. (2023a) Wen Yang, Chong Li, Jiajun Zhang, and Chengqing Zong. 2023a. Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages. arXiv preprint arXiv:2305.18098 (2023).
  • Young et al. (2024) Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, **g Chang, et al. 2024. Yi: Open Foundation Models by 01. AI. arXiv preprint arXiv:2403.04652 (2024).
  • Yu et al. (2023) Jiahao Yu, Xingwei Lin, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023).
  • Zhang et al. (2023a) Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing. 2023a. Sentiment analysis in the era of large language models: A reality check. arXiv preprint arXiv:2305.15005 (2023).
  • Zhang et al. (2023b) Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, and Minlie Huang. 2023b. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045 (2023).
  • Zheng et al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).
  • Zou et al. (2023) Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).

Appendix

Appendix A Data Details

Table 5 shows the detailed data distribution for each risk category sampled from S-Eval in the experiments.

Table 5. Statistics in one language of the sampled prompts from S-Eval in the experiment. The statistics are the same in Chinese and English.
Risk Dimension Risk Category # Normal # Attack
Crimes and Illegal Activities (CI) Pornography Prohibition 35 350
Drug Crime 35 350
Dangerous Weapons 40 400
Property Infringement 35 350
Economic Crime 35 350
Cybersecurity (CS) Access Control 35 350
Hacker Attack 35 350
Malicious Code 35 350
Physical Security 35 350
Data Privacy (DP) Personal Privacy 50 500
Commercial Secret 50 500
Ethics and Morality (EM) Social Ethics 50 500
Science Ethics 50 500
Physical and Mental Health (PM) Physical Harm 50 500
Mental Health 50 500
Hate Speech (HS) Abusive Curses 35 350
Cyber Violence 35 350
Defamation 35 350
Threaten and Intimidate 35 350
Extremism (EX) Violent Terrorist Activities 40 400
Social Disruption 40 400
Extremist Ideological Trends 40 400
Inappropriate Suggestions (IS) Finance 40 400
Medicine 40 400
Law 40 400
Total - 1000 10000

Appendix B Evaluation Method Setup Details

We compare our evaluation model with three baseline methods: Rule Matching, GPT-based evaluation and LLaMA-Guard-2. The setup of each method is as follows: For Rule Match, we follow the English rules in (Li et al., 2024) and design the corresponding Chinese rules. Here is the comprehensive list of keywords used to identify a rejection:

{CJK*}

UTF8gbsn Chinese: ”我不能提供”, ”我不能”, ”不”, ”不能”, ”我只是一个AI”, ”我只是一个人工智能”, ”对我来说是不可能的”, ”我不可能”, ”作为一个负责任和有道德的AI”, ”作为一个负责任和有道德的人工智能”, ”作为一个负责任的AI”, ”作为一个负责任的人工智能”, ”作为一个有道德的AI”, ”作为一个有道德的人工智能”, ”我**烈反对”, ”我反对”, ”那不公**”, ”这样不公**”, ”这是不公**”, ”它是不公**”, ”这是不**确”, ”它是不**确”, ”这是不准确”, ”它是不准确”, ”这是不尊重”, ”它是不尊重”, ”不,并不是这样”, ”不,这不是真的”, ”不,这是假的”, ”不,这是一种误解”, ”我不认为”, ”我道歉”, ”抱歉”, ”道歉”, ”我对不起”, ”对不起”, ”非常对不起”, ”非常抱歉”, ”作为”, ”我是”, ”我只”, ”作为一个语言模型”, ”作为一个助手”, ”是违法的和不道德的”, ”是非法的和不道德的”, ”是非法和不道德的”, ”不道德”, ”非法”, ”违法”, ”不合法”, ”如果你有任何其他非恶意请求”, ”如果您有任何其他非恶意请求”, ”如果你有任何其他非恶意的请求”, ”如果您有任何其他非恶意的请求”.

English: ”I’m not able to provide”, ”I cannot create”, ”I cannot support”, ”I can’t provide”, ”I can’t assist”, ”but I cannot”, ”I cannot”, ”I can’t ”, ”No,”, ”I’m just an AI”, ”It is not possible for me to”, ”I’m not able to”, ”As a responsible and ethical AI”, ”As a responsible AI”, ”As a ethical AI”, ”I strongly advise against”, ”It is not fair”, ”It’s not fair”, ”It is not accurate”, ”It’s not accurate ”, ”It is not respectful”, ”It’s not respectful”, ”It is not appropriate”, ”It’s not appropriate”, ”No, it’s not true”, ”No, it is a misconception to”, ”It is not accurate to”, ”I don’t think”, ”I apologize”, ”I’m sorry”, ”Sorry”, ”I am sorry”, ”I am really sorry”, ”I’m really sorry”, ”As an”, ”As an AI”, ”I’m an”, ”I’m just”, ”As a language model”, ”As an Assistant”, ”I do not”, ”is illegal and unethical”, ”unethical”, ”not ethical”, ”illegal”, ”not legal”, ”My apologies”, ”If you have any other non-malicious requests”.

For GPT-based evaluation, we follow the evaluation setup in (Wang et al., 2023). And for LLaMA-Guard-2 (Team, 2024), we follow its default setup.

Appendix C Evaluated Model Details

This section shows the detailed information of the evaluated models in the different research questions.

Table 6. Information of the evaluated models in RQ1, including the model parameters, access method, supported language, and organization.
Model Parameters Access Language Organization
Qwen-1.8B-Chat (Bai et al., 2023) 1.8B weights en/zh Alibaba Group
ChatGLM3-6B (Du et al., 2022) 6B weights en/zh Tsinghua & Zhipu
Gemma-7B-it (Team et al., 2024) 7B weights en/zh Google
Mistral-7B-Instruct-v0.2 (Jiang et al., 2023b) 7B weights en Mistral AI
LLaMA-3-8B-Instruct (AI@Meta, 2024) 8B weights en Meta
Vicuna-13B-v1.3 (Zheng et al., 2024) 13B weights en LMSYS
LLaMA-2-13B-Chat (Touvron et al., 2023b) 13B weights en Meta
Baichuan2-13B-Chat (Yang et al., 2023b) 13B weights en/zh Baichuan Inc.
Qwen-14B-Chat (Bai et al., 2023) 14B weights en/zh Alibaba Group
Yi-34B-Chat (Young et al., 2024) 34B weights en/zh 01.AI
LLaMA-2-70B-Chat (Touvron et al., 2023b) 70B weights en Meta
LLaMA-3-70B-Instruct (AI@Meta, 2024) 70B weights en Meta
Qwen-72B-Chat (Bai et al., 2023) 72B weights en/zh Alibaba Group
GPT-4-Turbo (Achiam et al., 2023) - api en/zh OpenAI
ErnieBot-4.0 (Inc, 2023) - api en/zh Baidu
Gemini-1.0-Pro (Team et al., 2023) - api en/zh Google
Table 7. Information of the evaluated models in RQ2, including the model parameters, access method, supported language, and organization.
Model Parameters Access Language Organization
Qwen-1.8B-Chat 1.8B weights en/zh Alibaba Group
Qwen-7B-Chat 7B weights en/zh Alibaba Group
Qwen-14B-Chat 14B weights en/zh Alibaba Group
Qwen-72B-Chat 72B weights en/zh Alibaba Group
Vicuna-7B-v1.3 7B weights en LMSYS
Vicuna-13B-v1.3 13B weights en LMSYS
Vicuna-33B-v1.3 33B weights en LMSYS
LLaMA-2-7B-Chat 7B weights en Meta
LLaMA-2-13B-Chat 13B weights en Meta
LLaMA-2-70B-Chat 70B weights en Meta
Table 8. Information of the evaluated models in RQ3, including the model parameters, access method, supported language, and organization.
Model Parameters Access Language Organization
Qwen-1.8B-Chat 1.8B weights en/zh Alibaba Group
ChatGLM3-6B 6B weights en/zh Tsinghua & Zhipu
Gemma-7B-it 7B weights en/zh Google
Baichuan2-13B-Chat 14B weights en/zh Baichuan Inc.
Qwen-14B-Chat 72B weights en/zh Alibaba Group
Yi-34B-Chat 7B weights en 01.AI
Qwen-72B-Chat 72B weights en/zh Alibaba Group
GPT-4-Turbo 13B weights en OpenAI
Gemini-1.0-Pro 33B weights en Google