-
BeHonest: Benchmarking Honesty of Large Language Models
Authors:
Steffi Chern,
Zhulin Hu,
Yuqing Yang,
Ethan Chern,
Yuan Guo,
Jiahe **,
Binjie Wang,
Pengfei Liu
Abstract:
Previous works on Large Language Models (LLMs) have mainly focused on evaluating their helpfulness or harmlessness. However, honesty, another crucial alignment criterion, has received relatively less attention. Dishonest behaviors in LLMs, such as spreading misinformation and defrauding users, eroding user trust, and causing real-world harm, present severe risks that intensify as these models appr…
▽ More
Previous works on Large Language Models (LLMs) have mainly focused on evaluating their helpfulness or harmlessness. However, honesty, another crucial alignment criterion, has received relatively less attention. Dishonest behaviors in LLMs, such as spreading misinformation and defrauding users, eroding user trust, and causing real-world harm, present severe risks that intensify as these models approach superintelligence levels. Enhancing honesty in LLMs addresses critical deficiencies and helps uncover latent capabilities that are not readily expressed. This underscores the urgent need for reliable methods and benchmarks to effectively ensure and evaluate the honesty of LLMs.
In this paper, we introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in LLMs comprehensively. BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in responses. Building on this foundation, we designed 10 scenarios to evaluate and analyze 9 popular LLMs on the market, including both closed-source and open-source models from different model families with varied model sizes. Our findings indicate that there is still significant room for improvement in the honesty of LLMs. We also encourage the AI community to prioritize honesty alignment in LLMs. Our benchmark and code can be found at: \url{https://github.com/GAIR-NLP/BeHonest}.
△ Less
Submitted 1 July, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
Authors:
Zhen Huang,
Zengzhi Wang,
Shijie Xia,
Xuefeng Li,
Haoyang Zou,
Ruijie Xu,
Run-Ze Fan,
Lyumanshan Ye,
Ethan Chern,
Yixin Ye,
Yikai Zhang,
Yuqing Yang,
Ting Wu,
Binjie Wang,
Shichao Sun,
Yang Xiao,
Yiyuan Li,
Fan Zhou,
Steffi Chern,
Yiwei Qin,
Yan Ma,
Jiadi Su,
Yixiu Liu,
Yuxiang Zheng,
Shaoting Zhang
, et al. (3 additional authors not shown)
Abstract:
The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoni…
▽ More
The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equip** it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate
Authors:
Steffi Chern,
Ethan Chern,
Graham Neubig,
Pengfei Liu
Abstract:
Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, develo** a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, the meta-evaluation conducted to assess the effectiveness of these LLMs as evaluators is typically constrained…
▽ More
Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, develo** a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, the meta-evaluation conducted to assess the effectiveness of these LLMs as evaluators is typically constrained by the coverage of existing benchmarks or requires extensive human annotation. This underscores the urgency of methods for scalable meta-evaluation that can effectively, reliably, and efficiently evaluate the performance of LLMs as evaluators across diverse tasks and scenarios, particularly in potentially new, user-defined scenarios. To fill this gap, we propose ScaleEval, an agent-debate-assisted meta-evaluation framework that leverages the capabilities of multiple communicative LLM agents. This framework supports multi-round discussions to assist human annotators in discerning the most capable LLMs as evaluators, which significantly eases their workload in cases that used to require large-scale annotations during meta-evaluation. We release the code for our framework, which is publicly available at: \url{https://github.com/GAIR-NLP/scaleeval}.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
Combating Adversarial Attacks with Multi-Agent Debate
Authors:
Steffi Chern,
Zhen Fan,
Andy Liu
Abstract:
While state-of-the-art language models have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arXiv:2209.07858. One approach proposed to improve the general quality of language model generations is multi-agent debate, where language models self-evaluate through discussion and feedback arXiv:2305.14325. We…
▽ More
While state-of-the-art language models have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arXiv:2209.07858. One approach proposed to improve the general quality of language model generations is multi-agent debate, where language models self-evaluate through discussion and feedback arXiv:2305.14325. We implement multi-agent debate between current state-of-the-art language models and evaluate models' susceptibility to red team attacks in both single- and multi-agent settings. We find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models. We also find marginal improvements through the general usage of multi-agent interactions. We further perform adversarial prompt content classification via embedding clustering, and analyze the susceptibility of different models to different types of attack topics.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Align on the Fly: Adapting Chatbot Behavior to Established Norms
Authors:
Chunpu Xu,
Steffi Chern,
Ethan Chern,
Ge Zhang,
Zekun Wang,
Ruibo Liu,
**g Li,
Jie Fu,
Pengfei Liu
Abstract:
In this paper, we aim to align large language models with the ever-changing, complex, and diverse human values (e.g., social norms) across time and locations. This presents a challenge to existing alignment techniques, such as supervised fine-tuning, which internalize values within model parameters. To overcome this, we propose an On-the-fly Preference Optimization (OPO) method, which is a real-ti…
▽ More
In this paper, we aim to align large language models with the ever-changing, complex, and diverse human values (e.g., social norms) across time and locations. This presents a challenge to existing alignment techniques, such as supervised fine-tuning, which internalize values within model parameters. To overcome this, we propose an On-the-fly Preference Optimization (OPO) method, which is a real-time alignment that works in a streaming way. It employs an external memory to store established rules for alignment, which can constrain LLMs' behaviors without further training, allowing for convenient updates and customization of human values. We also introduce a scalable evaluation to assess the proposed method more effectively. Experimental results on both human-annotated and auto-generated questions from legal and moral domains indicate the effectiveness of the proposed OPO method. Our code and data are released at https://github.com/GAIR-NLP/OPO.
△ Less
Submitted 26 December, 2023;
originally announced December 2023.
-
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
Authors:
I-Chun Chern,
Steffi Chern,
Shiqi Chen,
Weizhe Yuan,
Kehua Feng,
Chunting Zhou,
Junxian He,
Graham Neubig,
Pengfei Liu
Abstract:
The emergence of generative pre-trained models has facilitated the synthesis of high-quality text, but it has also posed challenges in identifying factual errors in the generated text. In particular: (1) A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. (2) Generated texts tend to be lengthy and lack a clearly defined granularity for…
▽ More
The emergence of generative pre-trained models has facilitated the synthesis of high-quality text, but it has also posed challenges in identifying factual errors in the generated text. In particular: (1) A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. (2) Generated texts tend to be lengthy and lack a clearly defined granularity for individual facts. (3) There is a scarcity of explicit evidence available during the process of fact checking. With the above challenges in mind, in this paper, we propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models (e.g., ChatGPT). Experiments on four different tasks (knowledge-based QA, code generation, mathematical reasoning, and scientific literature review) show the efficacy of the proposed method. We release the code of FacTool associated with ChatGPT plugin interface at https://github.com/GAIR-NLP/factool .
△ Less
Submitted 26 July, 2023; v1 submitted 25 July, 2023;
originally announced July 2023.
-
Burstein's permutation conjecture, Hong and Li's inversion sequence conjecture, and restricted Eulerian distributions
Authors:
Shane Chern,
Shishuo Fu,
Zhicong Lin
Abstract:
Recently, Hong and Li launched a systematic study of length-four pattern avoidance in inversion sequences, and in particular, they conjectured that the number of $0021$-avoiding inversion sequences can be enumerated by the OEIS entry A218225. Meanwhile, Burstein suggested that the same sequence might also count three sets of pattern restricted permutations. The objective of this paper is not only…
▽ More
Recently, Hong and Li launched a systematic study of length-four pattern avoidance in inversion sequences, and in particular, they conjectured that the number of $0021$-avoiding inversion sequences can be enumerated by the OEIS entry A218225. Meanwhile, Burstein suggested that the same sequence might also count three sets of pattern restricted permutations. The objective of this paper is not only a confirmation of Hong and Li's conjecture and Burstein's first conjecture, but also two more delicate generating function identities with the $\mathsf{ides}$ statistic concerned in the restricted permutation case, and the $\mathsf{asc}$ statistic concerned in the restricted inversion sequence case, which yield a new equidistribution result.
△ Less
Submitted 24 September, 2022;
originally announced September 2022.