Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Li, Wangyue; Li, Liangzhi; Xiang, Tong; Liu, Xiao; Deng, Wei; Garcia, Noa

Computer Science > Computation and Language

arXiv:2403.17752 (cs)

[Submitted on 26 Mar 2024 (v1), last revised 23 May 2024 (this version, v3)]

Title:Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Authors:Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, Noa Garcia

View PDF HTML (experimental)

Abstract:Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) due to their simplicity and efficiency. However, there are concerns about whether MCQs can truly measure LLM's capabilities, particularly in knowledge-intensive scenarios where long-form generation (LFG) answers are required. The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy, which we undertake in this paper by evaluating nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English. We identify a significant issue: LLMs exhibit an order sensitivity in bilingual MCQs, favoring answers located at specific positions, i.e., the first position. We further quantify the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. Our results reveal a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, we propose two methods to quantify the consistency and confidence of LLMs' output, which can be generalized to other QA evaluation benchmarks. Notably, our analysis challenges the idea that the higher the consistency, the greater the accuracy. We also find MCQs to be less reliable than LFGQs in terms of expected calibration error. Finally, the misalignment between MCQs and LFGQs is not only reflected in the evaluation performance but also in the embedding space. Our code and models can be accessed at this https URL.

Comments:	LREC-COLING 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2403.17752 [cs.CL]
	(or arXiv:2403.17752v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.17752

Submission history

From: Wangyue Li [view email]
[v1] Tue, 26 Mar 2024 14:43:48 UTC (12,812 KB)
[v2] Thu, 28 Mar 2024 09:57:05 UTC (12,811 KB)
[v3] Thu, 23 May 2024 13:32:25 UTC (12,811 KB)

Computer Science > Computation and Language

Title:Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators