Evaluating LLMs at Detecting Errors in LLM Responses

Kamoi, Ryo; Das, Sarkar Snigdha Sarathi; Lou, Renze; Ahn, Jihyun Janice; Zhao, Yilun; Lu, Xiaoxin; Zhang, Nan; Zhang, Yusen; Zhang, Ranran Haoran; Vummanthala, Sujeeth Reddy; Dave, Salika; Qin, Shaobo; Cohan, Arman; Yin, Wenpeng; Zhang, Rui

Computer Science > Computation and Language

arXiv:2404.03602 (cs)

[Submitted on 4 Apr 2024]

Title:Evaluating LLMs at Detecting Errors in LLM Responses

Authors:Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, Rui Zhang

View PDF HTML (experimental)

Abstract:With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g., word sorting) or limited error types (e.g., faithfulness in summarization). This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs. ReaLMistake contains three challenging and meaningful tasks that introduce objectively assessable errors in four categories (reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge), eliciting naturally observed and diverse errors in responses of GPT-4 and Llama 2 70B annotated by experts. We use ReaLMistake to evaluate error detectors based on 12 LLMs. Our findings show: 1) Top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans. 2) Explanations by LLM-based error detectors lack reliability. 3) LLMs-based error detection is sensitive to small changes in prompts but remains challenging to improve. 4) Popular approaches to improving LLMs, including self-consistency and majority vote, do not improve the error detection performance. Our benchmark and code are provided at this https URL.

Comments:	Benchmark and code: this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2404.03602 [cs.CL]
	(or arXiv:2404.03602v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.03602

Submission history

From: Ryo Kamoi [view email]
[v1] Thu, 4 Apr 2024 17:19:47 UTC (906 KB)

Computer Science > Computation and Language

Title:Evaluating LLMs at Detecting Errors in LLM Responses

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Evaluating LLMs at Detecting Errors in LLM Responses

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators