Zero-shot NLG evaluation through Pairware Comparisons with LLMs

Liusie, Adian; Manakul, Potsawee; Gales, Mark J. F.

Computer Science > Computation and Language

arXiv:2307.07889v1 (cs)

[Submitted on 15 Jul 2023 (this version), latest version 6 Feb 2024 (v3)]

Title:Zero-shot NLG evaluation through Pairware Comparisons with LLMs

Authors:Adian Liusie, Potsawee Manakul, Mark J. F. Gales

View PDF

Abstract:Evaluating Natural Language Generation (NLG) outputs is crucial but laborious and expensive. While various automatic NLG assessment methods have been proposed, they often are quite task-specific and have to be engineered with a particular domain and attribute in mind. In this work, we propose a robust zero-shot approach to NLG evaluation using pairwise comparative judgment with open-source Large Language Models (LLMs). The motivation for this approach is that even as humans, it is easier to determine which of two options are better, than it is to independently objectively score each option. We use this insight and leverage the emergent abilities of LLMs, where we probe FlanT5 to determine which of two candidate responses is better, rather than assigning absolute scores. Our results demonstrate that comparative assessment is a more effective approach than absolute scoring, enabling smaller open-source LLMs to achieve comparable performance to larger public access APIs. We evaluate systems on both summary evaluation and dialogue response generation, and show that opensource LLMs can lead to good correlations with human scores for a range of different attributes.

Comments:	7 pages, 5 figures, 2 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2307.07889 [cs.CL]
	(or arXiv:2307.07889v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2307.07889

Submission history

From: Adian Liusie [view email]
[v1] Sat, 15 Jul 2023 22:02:12 UTC (7,816 KB)
[v2] Wed, 16 Aug 2023 14:55:35 UTC (8,353 KB)
[v3] Tue, 6 Feb 2024 17:05:58 UTC (8,371 KB)

Computer Science > Computation and Language

Title:Zero-shot NLG evaluation through Pairware Comparisons with LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Zero-shot NLG evaluation through Pairware Comparisons with LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators