Skip to main content

Showing 1–7 of 7 results for author: Chern, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13261  [pdf, other

    cs.CL cs.AI

    BeHonest: Benchmarking Honesty of Large Language Models

    Authors: Steffi Chern, Zhulin Hu, Yuqing Yang, Ethan Chern, Yuan Guo, Jiahe **, Binjie Wang, Pengfei Liu

    Abstract: Previous works on Large Language Models (LLMs) have mainly focused on evaluating their helpfulness or harmlessness. However, honesty, another crucial alignment criterion, has received relatively less attention. Dishonest behaviors in LLMs, such as spreading misinformation and defrauding users, eroding user trust, and causing real-world harm, present severe risks that intensify as these models appr… ▽ More

    Submitted 1 July, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

  2. arXiv:2406.12753  [pdf, other

    cs.CL cs.AI

    OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

    Authors: Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang , et al. (3 additional authors not shown)

    Abstract: The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoni… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 44 pages

  3. arXiv:2401.16788  [pdf, other

    cs.CL cs.AI

    Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

    Authors: Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu

    Abstract: Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, develo** a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, the meta-evaluation conducted to assess the effectiveness of these LLMs as evaluators is typically constrained… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

  4. arXiv:2401.05998  [pdf, other

    cs.CL cs.AI

    Combating Adversarial Attacks with Multi-Agent Debate

    Authors: Steffi Chern, Zhen Fan, Andy Liu

    Abstract: While state-of-the-art language models have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arXiv:2209.07858. One approach proposed to improve the general quality of language model generations is multi-agent debate, where language models self-evaluate through discussion and feedback arXiv:2305.14325. We… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

  5. arXiv:2312.15907  [pdf, other

    cs.CL

    Align on the Fly: Adapting Chatbot Behavior to Established Norms

    Authors: Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, **g Li, Jie Fu, Pengfei Liu

    Abstract: In this paper, we aim to align large language models with the ever-changing, complex, and diverse human values (e.g., social norms) across time and locations. This presents a challenge to existing alignment techniques, such as supervised fine-tuning, which internalize values within model parameters. To overcome this, we propose an On-the-fly Preference Optimization (OPO) method, which is a real-ti… ▽ More

    Submitted 26 December, 2023; originally announced December 2023.

  6. arXiv:2307.13528  [pdf, other

    cs.CL cs.AI

    FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios

    Authors: I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu

    Abstract: The emergence of generative pre-trained models has facilitated the synthesis of high-quality text, but it has also posed challenges in identifying factual errors in the generated text. In particular: (1) A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. (2) Generated texts tend to be lengthy and lack a clearly defined granularity for… ▽ More

    Submitted 26 July, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

  7. arXiv:2209.12137  [pdf, ps, other

    math.CO cs.DM

    Burstein's permutation conjecture, Hong and Li's inversion sequence conjecture, and restricted Eulerian distributions

    Authors: Shane Chern, Shishuo Fu, Zhicong Lin

    Abstract: Recently, Hong and Li launched a systematic study of length-four pattern avoidance in inversion sequences, and in particular, they conjectured that the number of $0021$-avoiding inversion sequences can be enumerated by the OEIS entry A218225. Meanwhile, Burstein suggested that the same sequence might also count three sets of pattern restricted permutations. The objective of this paper is not only… ▽ More

    Submitted 24 September, 2022; originally announced September 2022.

    Comments: 19 pages, comments are welcome

    MSC Class: 05A05; 05A15