Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

Ruan, Jie; Pu, Xiao; Gao, Mingqi; Wan, Xiaojun; Zhu, Yuesheng

Computer Science > Computation and Language

arXiv:2406.07967 (cs)

[Submitted on 12 Jun 2024]

Title:Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

Authors:Jie Ruan, Xiao Pu, Mingqi Gao, Xiaojun Wan, Yuesheng Zhu

View PDF HTML (experimental)

Abstract:Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and time-consuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking.Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18% top-ranked system recognition accuracy and ranks first or ranks second on 90.91% of the human metrics with 0.83 overall inter-system ranking Kendall correlation.Code and data are publicly available online.

Comments:	With Appendix
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2406.07967 [cs.CL]
	(or arXiv:2406.07967v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.07967

Submission history

From: Jie Ruan [view email]
[v1] Wed, 12 Jun 2024 07:44:36 UTC (1,134 KB)

Computer Science > Computation and Language

Title:Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators