Better than Random: Reliable NLG Human Evaluation with
Constrained Active Sampling

Jie Ruan, Xiao Pu, Mingqi Gao, Xiaojun Wan, Yuesheng Zhu

Abstract

Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and time-consuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking. Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18% top-ranked system recognition accuracy and ranks first or ranks second on 90.91% of the human metrics with 0.83 overall inter-system ranking Kendall correlation. Code and data are publicly available online.

Refer to caption — Figure 1: Conducting human evaluations on different sample subsets (Sub) can obtain different inter-system rankings. The lower part shows the same sampling method obtains different subsets at different sampling times. The upper part shows the ranking obtained from the corresponding subsets. “Sys” represents system and “GT” represents Ground Truth.

Introduction

Evaluation of NLG systems remains challenging. The reason is that similar content in text can often be expressed in various ways, and the same output of the NLG system may need to satisfy multiple goals in different aspects (2020; 2022). Hence, reliable automatic metrics are complex to design (2017; 2009). Human evaluation is generally considered to be a more reliable evaluation way in natural language generation tasks (2020; 2018; 2015). However, human judgment is viewed as expensive, time-consuming, and lacks standardized evaluation procedures (2020; 2020; 2022).

To save labor and costs, human evaluation is usually performed on a small subset sampled from the dataset in practice. Researchers compare the average scores of the systems on this subset to obtain a ranking between the systems. However, different sample subsets will lead to different rankings of the systems. We re-evaluated 137 real NLG evaluation setups on 44 human metrics across 16 datasets and 5 NLG tasks. Results show that 87.5% of datasets have different inter-system rankings across 5 times of random sampling. Since research is driven by evaluation, focusing on the final ranking of systems, it is vital to design a more reliable evaluation method to obtain the correct inter-system ranking.

We randomly select 1404 papers from ACL, EMNLP and COLING in the past 2 years and find that 270 papers select a subset of the dataset for manual evaluation to save labor and cost (details are in the Survey section of the Appendix). The survey results show that random sampling is the most vital sampling method, accounting for 60.7%, and the rest 39.3% of the papers do not mention the sampling method they used. Random sampling is widely used in human evaluation sampling for its simplicity. However, random sampling can be risky (2022). On the one hand, random sampling can lead to clustered selection, a phenomenon in which randomly selected samples are uncommonly close together in a population (as shown in the black and purple circle in Figure 1). On the other hand, random sampling may have the risk of data manipulation. Researchers can choose samples at will or conduct multiple random sampling to select a favorite subset, which will lead to unfair evaluation results. Since different sampling subsets may result in different inter-system rankings in human judgment, it is difficult to reliably select the best system. We urgently need a better sampling method to deliver reliable human evaluation with low labor and cost.

In this paper, we focus on improving the reliability of the gold standard human evaluation with limited cost and time used for human annotation. Specifically, we explore the problem of clustered selection and data manipulation for manual evaluation sampling and propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. The proposed CASF consists of a Learner, a Systematic Sampler and a Constrained Controller. CASF obtains a representative subset of samples in multiple sampling phases. In each sampling phase, the Learner predicts the quality score for samples and feeds the quality score of each sample to the Systematic Sampler. Then, the Systematic Sampler and the Constrained Controller work together to select representative samples with lower redundancy for the sampling phase. Samples collected in each phase are not duplicates of those collected in previous phases, and will be directly subjected to human evaluation, and the newly labeled ones will also be used to update the Learner.

The main contributions are as follows: 1) We investigate and experimentally analyze the sampling problem for the gold standard human evaluation in natural language generation. 2) We propose a Constrained Active Sampling Framework (CASF) for the sampling problem in manual evaluation. The proposed CASF can solve the problem of clustered selection and data manipulation for human evaluation sampling. 3) We re-evaluate 137 real NLG evaluation setups on 44 human evaluation metrics across 16 datasets and 5 NLG tasks. Experiment results demonstrate the proposed method ranks first or ranks second on 90.91% of the human metrics and receives 93.18% top-ranked system recognition accuracy. To ease the adoption of reliable sampling, we release a constrained active sampling tool. We strongly recommend using CASF to sample test instances for human evaluation. Our tool, code and data are publicly available online.¹¹1https://github.com/EnablerRx/CASF

Methodology

Problem Statement

The goal of sampling in human evaluation is to select a subset with the intention of estimating the inter-system ranking of the whole sample population. Ideally, the obtained subset should cover more representative samples of the population. A good sampling method will result in a more correct inter-system ranking calculated through the sampling subset.

The general evaluation sampling problem is as follows. Given a data set $D=\{(x_{i},\mathcal{Y}_{i},\mathcal{Q}_{i})\}_{i=1}^{N}$ where $N$ is the size of the whole sample population, $x_{i}$ represents a data input, $\mathcal{Y}_{i}$ is the corresponding set of generated outputs, $\mathcal{Q}_{i}$ is the corresponding set of human score vectors. The generated output set $\mathcal{Y}_{i}$ consists of $M$ system outputs and is denoted as $\mathcal{Y}_{i}=\{y_{i1},...,y_{ij}\}_{j=1}^{M}$ ,

where $y_{ij}$ represents the $j$ -th system generated output of the $i$ -th sample. The human score vector set $\mathcal{Q}_{i}$ consists of the corresponding human score vector for each system output and is denoted as $\mathcal{Q}_{i}=\{\mathbf{q}_{i1},...,\mathbf{q}_{ij}\}_{j=1}^{M}$ . Since human evaluation is usually carried out in multiple aspects, we use a vector to represent human evaluation results from multiple aspects for each system. Each human score vector $\mathbf{q}_{ij}$ consists of $K$ human annotation metrics from different aspects and is denoted as $\mathbf{q}_{ij}=(q_{ij1},...,q_{ijK})$ . Eventually there will be separate inter-system ranking on each aspect. Let $\mathcal{H}$ represent the final sample subset. Function $\psi$ calculates the mean scores of each system in the sample set for each human evaluation aspect and gives the ranking among systems. $\mathcal{P}$ calculates the similarity between two inter-system rankings. The overall objective of sampling and constraint is as follows:

		$\displaystyle\text{minimize}\quad-\mathcal{P}[\psi(\mathcal{H}),\psi(D)],$
		$\displaystyle\text{subject to}\quad\|\mathcal{H}\|=r\times N,$

where $r$ is the sampling rate, $|.|$ refers to the cardinality of a sample set and $\psi$ first calculates the average human scores in each aspect of each system in the sample set, and then gives the inter-system ranking of each human indicator according to the mean score of each system.

Sample Representativeness

Taking representative samples allows for a more complete evaluation of the overall performance of the system. Inspired by the theoretical model of summarization (Peyrard 2019), the Representativeness of samples can be measured in two aspects, including Quality Diversity and Redundancy. Quality Diversity represents the diversity of sample quality levels, that is, the sampled subset should contain samples of various quality levels. Evaluation on qualitatively diverse subsets of samples allows the system to better reflect the performance of all samples. Quality is the average quality of generated outputs of the sample. More comprehensive coverage of samples of different qualities will result in a better Quality Diversity. Redundancy indicates the degree of similarity or duplication among the generated outputs of samples.

Constrained Active Sampling Framework

Overall Framework

The proposed Constrained Active Sampling Framework aims to select representative samples for human evaluation in multiple phases to get a more correct inter-system ranking. The proposed CASF operates through a Learner, a Systematic Sampler and a Constrained Controller. The goal of the Learner is to predict the quality of samples and give a ranking of sample quality by a regressor. The Systematic Sampler divides samples into multiple buckets according to the sample quality ranking given by the Learner. The Constrained Controller controls the Redundancy of samples and selects a final sample from each bucket given by the Systematic Sampler.

The proposed Constrained Active Sampling Framework is shown in Figure 2. There are several sampling phases denoted by $t$ , an preliminary sampling phase $t=0$ (the left branch in Figure 2) and $T$ batch active sampling phases $t=1,...,T$ (the right branch in Figure 2). In the preliminary sampling phase, alternate quality scores for all samples are calculated through an automated metric, as the Learner is not ready to use yet. The Systematic Sampler, then, selects a small preliminary subset of samples $\mathcal{H}_{0}$ as part of the final sample subset $\mathcal{H}$ according to the given quality ranking. The selected samples are then evaluated by human beings. In the current batch active sampling phase, samples selected in all previous phases together with the corresponding human scores, then, are fed to the regressor of the learner, and the regressor of the learner is updated and applied to predict the quality of the rest samples with the sample’s scores over various automatic metrics as features. After that, the Systematic Sampler and Constrained Controller work together to choose batch subset $\mathcal{H}_{t}$ from the rest samples for the $t$ -th batch active sampling phase as part of the final samples. Then, the samples selected in the $t$ -th phase are subjected to human evaluation for use in the subsequent sampling phases. The final sample set $\mathcal{H}$ consists of batch subsets from each phase $\mathcal{H}_{t}$ . We conduct experiments to explore the determination of the number of phases and the sampling ratio of each phase in the Phases and Associated Sampling Ratios section.

Learner and Sample Quality

Estimating the quality of the samples is a vital step in CASF. Since the quality of samples is difficult to define and calculate directly, we propose a Learner to predict the human scores as the quality scores for the rest samples for selection at each phase $t$ (except the preliminary phase). As various automatic metrics can measure the characteristics of samples in different aspects and are easy to calculate with lower cost, we use scores of automatic metrics as features to predict the quality of samples.

Note that in the preliminary phase, the quality of samples is simply estimated by an automatic metric. In each of the batch active sampling phases, the Learner receives feedback from human annotators and update its parameters. After that, it utilizes the scores of automatic metrics to predict the quality score for each sample. The Learner will then provide the quality ranking $\{p_{t}(i)\}_{i=1}^{N-|\mathcal{H}|}$ of samples at each batch $t$ , where $i$ is the sample index and the number of the rest samples for selection in each phase is $N-|\mathcal{H}|$ .

The main objective of the Learner $g$ is to map $x_{i}$ to the corresponding human score vector set $\mathcal{Q}_{i}$ . Since there are multiple elements in $\mathcal{Q}_{i}$ , we standardize scores for each human evaluation aspect and use the sum of each element in $\mathcal{Q}_{i}$ , which is the sum of human scores for all aspects of all NLG systems under sample $x_{i}$ , to represent $\mathcal{Q}_{i}$ . The objective is to minimize the following loss function:

\underset{\theta_{t}}{\operatorname{argmin}}\sum_{i=1}^{|\mathcal{H}|}\mathcal% {L}\left(g(x_{i};\theta_{t}),\sum_{j=1}^{M}\sum_{k=1}^{K}q_{ijk}\right),

where $|\mathcal{H}|$ is the number of samples selected in the final subset and $\theta_{t}$ is the parameter of Learner $g$ in the $t$ -th phase. The predicted quality scores $\{s_{t}(i)\}_{i=1}^{N-|\mathcal{H}|}$ for the rest samples at each phase $t$ are calculated as follows:

\{s_{t}(i)\}_{i=1}^{N-|\mathcal{H}|}=\{g(x_{i};\theta_{t})\}_{i=1}^{N-|% \mathcal{H}|}.

Specifically, the Learner first calculates the results of each automatic metric based on the output of each NLG system from the input sample. Then, the automatic metric results under each NLG system will be fed as features into the Learner’s regressor. Eight popular NLG metrics are chosen as the automatic metrics set (details are in the Automatic Metric for Preliminary Phase section) of CASF. Due to the small number of samples and features mainly containing automatic metrics’ scores, we explore several popular learning methods and recommend choosing Gradient Boosting Decision Tree (GBDT) (Friedman 2001) as the regressor of the Learner. Full experimental results are in the Learner Selection section of Appendix. The loss function is the least squares method (2007), which is commonly used in GBDT.

Systematic Sampler

Systematic sampling has advantage of eliminating clustered selection problem and can reduce the risk of favoritism, which meets our motivation. Therefore, we adopt the systematic sampling method (Yates 1948) sorted by relevant signs as the sampling core of CASF. The Systematic Sampler selects representative initial samples and candidate samples according to the quality ranking of samples. Specifically, the Systematic Sampler first divides the $N_{t}=N-|\mathcal{H}|$ samples for the $t$ -th phase into $n_{t}$ buckets according to the given quality ranking $\{p_{t}(i)\}_{i=1}^{N-|\mathcal{H}|}$ . $n_{t}$ is the number of samples to be selected at the $t$ -th phase. Samples with quality ranking $p_{t}\in[e\times\lfloor\frac{N_{t}}{n_{t}}\rfloor,(e+1)\times\lfloor\frac{N_{t% }}{n_{t}}\rfloor)$ are divided into the same bucket, where $e=0,1,...,n_{t}$ . The samples with quality rank $p_{t}=e\times\lfloor\frac{N_{t}}{n_{t}}\rfloor$ are selected as the initial selection samples. And the rest samples in each bucket are candidate samples.

Constrained Controller

The proposed Constrained Controller controls the Redundancy of samples and selects one sample from each of the buckets divided by the Systematic Sampler to form a final sample subset (as shown in Figure 3). Since the Systematic Sampler selects initial samples at a regular interval, which makes the distribution of the initial subset align closely with the overall distribution, we aim to preserve the original sampling intervals as much as possible while controlling the Redundancy to maintain the representativeness of the sample subset.

Specifically, we define objective function $\operatorname{Obj}$ as the quality ranking distance between the current sample $x_{i}$ and the initial selection sample in each bucket. We also define violation function $\operatorname{Vio}$ to calculate the Redundancy between the current sample $x_{i}$ and the final samples. Since the bi-gram similarity (Kondrak 2005) is regarded as a simple and effective method to calculate the redundancy between texts, we calculate the Redundancy by calculating the bi-gram similarity between the outputs generated for the sample and that for the final samples. A sample $x_{i}$ is called feasible if $\operatorname{Vio}(x_{i})=0$ , which means it is not redundant with the selected final samples. Otherwise, $x_{i}$ is infeasible.

The Constrained Controller is summarized into 3 rules:

\left\{\begin{array}[]{ll}rule\ 1:\ x_{i}\prec x_{j},\ \ \text{if}&x_{i}\text{% is infeasible, }x_{j}\text{ is feasible};\\ rule\ 2:\ x_{i}\prec x_{j},\ \ \text{if}&\operatorname{Vio}\left(x_{i}\right)>% \operatorname{Vio}\left(x_{j}\right),\\ &x_{i}\text{ is infeasible, }x_{j}\text{ is infeasible};\\ rule\ 3:\ x_{i}\prec x_{j},\ \ \text{if}&\operatorname{Obj}\left(x_{i}\right)>% \operatorname{Obj}\left(x_{j}\right),\\ &x_{i}\text{ is feasible, }x_{j}\text{ is feasible},\end{array}\right.

where $x_{i}$ is the $i$ -th sample and $x_{i}\prec x_{j}$ means $x_{j}$ is a better choice. $rule\ 1$ means the Constrained Controller tends to select samples that are not redundant. $rule\ 2$ represents that if two samples are both redundant with the final samples, the Constrained Controller tends to select samples with less redundancy. $rule\ 3$ demonstrates that if two samples are both not redundant with the final samples, the Constrained Controller tends to select samples with ranks as close as possible to those of the initial selection samples.

In Figure 3, the rest samples for selection are first re-indexed, and then re-ordered according to Learner’s predicted quality score. The system sampler divides samples into three buckets based on quality ranking and marks initial selection sample for each bucket. In the first bucket, only sample 3 is feasible, that is, sample 3 is not redundant with existing final samples. Thus, Sample 3 is selected as the final sample according to $rule\ 1$ . In the second bucket, none of the three samples is feasible, so sample 0 with the smallest redundancy is selected as the final sample according to $rule\ 2$ . In the third bucket, all samples are feasible, and sample 5 is the initial selection sample and it is selected by default or according to $rule\ 3$ .

Dataset

HE Metric

R 1

R 2

R 3

Mean

H 1

H 2

H 3

Mean

CASF

(ours)

SummEval

coherence

0.85

0.65

0.33

0.61

0.70

0.82

0.92

0.81

0.42

0.87

0.95

consistency

0.25

0.48

0.43

0.39

0.68

0.02

0.65

0.45

0.30

0.17

0.53

fluency

0.40

0.35

0.52

0.42

0.45

0.30

0.40

0.35

0.37

0.52

0.33

relevance

0.72

0.60

0.68

0.67

0.65

0.43

0.72

0.60

0.40

0.60

0.45

0.82

REALSumm

litepyramid

0.39

0.54

0.44

0.46

0.36

0.38

0.44

0.39

0.33

0.37

0.54

NeR18

coherence

1.00

0.43

0.81

0.90

1.00

fluency

0.52

1.00

0.84

1.00

0.52

0.90

0.81

1.00

informativeness

1.00

0.71

1.00

0.90

0.87

0.71

1.00

relevance

1.00

0.52

1.00

0.84

0.90

1.00

DialSumm

consistency

0.74

0.72

0.49

0.65

0.74

0.64

0.62

0.67

0.59

0.56

0.54

0.77

relevance

0.69

0.46

0.64

0.60

0.64

0.69

0.54

0.62

0.23

0.44

0.59

0.72

fluency

0.59

0.56

0.59

0.58

0.38

0.56

0.51

0.49

0.15

0.49

0.64

0.62

coherence

0.67

0.80

0.74

0.80

0.59

0.71

0.59

0.67

0.82

0.90

OpenAI 1

accuracy

0.80

0.00

1.00

0.60

0.80

1.00

0.80

0.87

0.80

0.00

1.00

coherence

0.40

0.80

0.00

0.40

0.80

0.20

0.80

0.60

0.80

0.40

0.20

0.80

coverage

1.00

0.80

1.00

0.80

overall

0.80

1.00

0.93

0.80

1.00

0.80

0.87

0.80

1.00

0.80

1.00

OpenAI 2

accuracy

0.71

0.43

1.00

0.71

0.62

0.71

0.81

0.71

1.00

0.52

0.14

0.90

coherence

0.24

0.52

0.33

0.37

-0.14

0.24

0.43

0.17

0.24

0.52

0.24

0.43

coverage

1.00

0.71

0.90

0.87

1.00

0.90

1.00

0.97

1.00

overall

0.90

0.71

1.00

0.87

0.62

1.00

0.90

0.84

0.90

1.00

OpenAI 3

accuracy

0.73

0.82

0.79

0.87

0.78

0.82

0.73

0.69

0.78

0.87

coherence

0.51

0.33

0.56

0.47

0.42

0.51

0.56

0.50

0.56

0.20

0.60

coverage

0.38

0.87

0.54

0.51

0.87

0.51

0.63

1.00

0.42

0.87

overall

0.87

0.51

1.00

0.79

1.00

0.73

0.51

0.75

1.00

0.38

0.47

1.00

OpenAI 4

accuracy

1.00

0.33

1.00

0.78

1.00

0.33

0.56

0.33

1.00

0.33

1.00

coherence

1.00

0.33

1.00

coverage

0.33

1.00

0.78

0.33

1.00

0.78

1.00

overall

0.33

1.00

0.78

0.33

1.00

0.78

1.00

newstest 1

MQM

0.14

0.33

0.14

-0.05

0.14

0.33

0.14

pSQM

0.81

0.90

0.87

0.81

0.90

0.87

1.00

0.90

1.00

newstest 2

MQM

0.79

0.93

0.71

0.81

0.64

0.86

0.71

0.74

0.14

0.93

0.86

0.93

pSQM

0.43

0.36

0.79

0.52

0.29

0.86

0.43

0.52

0.36

0.93

0.79

newstest 3

MQM

0.00

-0.13

-0.05

-0.06

-0.05

-0.03

-0.05

-0.04

0.46

0.13

0.00

0.03

Persona

Understandable

0.33

-1.00

0.33

-0.11

-1.00

0.33

-0.11

0.33

Natural

0.33

-1.00

1.00

0.11

1.00

-1.00

0.33

0.11

0.33

1.00

Maintains

Context

1.00

-1.00

1.00

Interesting

1.00

0.33

1.00

0.78

1.00

Uses Knowledge

1.00

-1.00

1.00

0.33

1.00

Overall Quality

1.00

MANS-ROC

overall

1.00

MANS-WP

overall

1.00

0.80

0.87

0.80

1.00

0.93

1.00

THUMB

overall

1.00

0.80

1.00

0.93

1.00

VATEX

consistency

0.60

1.00

0.60

0.73

0.60

1.00

0.87

1.00

Overall Performance

0.69

0.61

0.75

0.68

0.61

0.67

0.65

0.72

0.67

0.72

0.68

0.83

Table 1: Kendall’s Tau of methods on 16 datasets across 5 NLG tasks. ’HE Metric’ indicates different human evaluation aspects in a dataset. Bold number indicates that the method has the best performance among all methods under the corresponding aspect. Underlined number indicates the method ranks second.

Experimental Setup

Tasks and Datasets

We conduct experiments on 44 human metrics across 16 datasets spanning 5 tasks. A total of 137 NLG systems are involved. Details of the datasets, preprocessing and the validation set for hyper-parameters selection are in the Tasks and Dataset section of Appendix. The datasets are: Summarization (SUM): We utilize 8 human evaluation datasets of the model generated summarization, which are SummEval (2021), REALSumm (2020), Newsroom (NeR18) (2018), DialSummEval (DialSumm) (2022) and OpenAI-axis1 (OpenAI 1) (2020; 2017), OpenAI-axis2 (OpenAI 2) , OpenAI-CNN/DM1 (OpenAI 3) , and OpenAI-CNN/DM3 (OpenAI 4) . Machine Translation (MT): We use 3 datasets collected from WMT news translation tasks (2021) viz. newstest2020 en-de (newstest 1), newstest2020 cn-en (newstest 2) and newstest2021 cn-en (newstest 3). Dialogue Generation (DGen): We utilize a human annotation dataset of machine-generated dialogues released with the Persona Chat (Persona) (Mehri and Eskenazi 2020) dataset. Story Generation (SGen): We use two manual evaluation datasets for story generation namely MANS-ROC (Guan et al. 2021) and MANS-WP (Guan et al. 2021). Multi-Modal Generation (MMGen): We use two existing human evaluation datasets namely THUMB-MSCOCO (THUMB) (Kasai et al. 2022) and VATEX-EVAL (VATEX) (Shi et al. 2022).

Evaluation Metric

We select a subset of each dataset and then compute the results for all the human metrics in various aspects. We measure the efficacy of sampling method by computing rankings of candidate models on the subset and their Kendall’s Tau correlation (1938) with rankings obtained on the full dataset. We refer to Kendall’s treatment (1945) to handle ties.

Comparison of Methods

The comparison methods are selected based on the survey of evaluation sampling methods in 1404 papers where Random and Heuristic are the main sampling methods for NLG human evaluation. We also include some ablation methods. The comparison methods are: Random Sampling (R) randomly sample the dataset and is performed 3 times (2008; 2022; 2007) to reflect real sampling scenarios. Results of each time and the average result are recorded. Heuristic Sampling (H) (2022) first sorts the samples according to the average length of the generated sentences. Then, Heuristic randomly collects a small number of samples with extreme sentence length and a large number of samples with normal sentence length. Heuristic is performed 3 times. Eight Metric (8M): CASF with only the preliminary sampling phase which normalizes the score obtained by the 8 automatic metrics used in CASF and calculates the average score. Single Metric (SM): CASF with only the preliminary sampling phase which uses the automatic metric used in the preliminary sampling phases of CASF. Online Sampling (OL): CASF without Constrained Controller. We compare methods with 50% sampling rate. Results for other sampling ratios are in Different Sampling Ratio section of Appendix. In addition, the number of phases and the sampling ratio of each phase are 5 and 10%. The determination of these parameters is shown in the Phases and Associated Sampling Ratios section. We also treat the sample size as an independent variable and results are shown in the Appendix.

Results and Analysis

Method	SUM	MT	DGen	SGen	MMGen	Overall
R	0.76	0.87	0.78	0.67	1.00	0.76
H	0.80	0.67	0.78	0.67	1.00	0.78
8M	0.83	0.80	0.83	1.00	1.00	0.84
SM	0.90	1.00	0.83	1.00	1.00	0.91
OL	0.69	0.80	1.00	1.00	1.00	0.77
CASF	0.93	0.80	1.00	1.00	1.00	0.93

Table 2: Top-ranked accuracy on 5 NLG tasks. ‘Overall’ shows the average result on all human metrics from all tasks.

Comparison Results

Full Inter-System Ranking Accuracy

According to results on validation set (Automatic Metrics for Preliminary Sampling Phase section of Appendix), We select MOVER-SCORE (Zhao et al. 2019) for calculating sample quality in the preliminary sampling phase. Inter-system ranking accuracy of methods on 16 datasets across 5 NLG tasks are shown in Table 1. The results show Random have large fluctuations. For example, in the newstest2020 cn-en dataset of MT task, different times of random sampling result in different inter-system correlation. This shows the risky of widely using Random in evaluation. CASF ranked first on 79.55% of human metrics and ranked first or ranked second on 90.91% of metrics. This shows CASF can better select representative samples to get a more accurate ranking. Results of the remaining human metrics, although not ranking first, are still acceptable and close to the best results. These acceptable results appear as we measure the quality of each sample in the dataset. However, human evaluation in different aspects is conducted in the same dataset. The overall scores can represent the overall evaluation results. We use Wilcoxon signed ranks (2006) to test the results of Random and Heuristic (both iterated 10000 times) with CASF in 44 human metrics. Results show CASF is statistically outperforming Random, Heuristic and other methods with $p=0.00010$ , $p=0.00009$ and $p<0.05$ .

Top-Ranked System Accuracy

One of the important goals of evaluation is to select the top-ranked system. Accurately selecting the best system with limited manpower can help the NLG field to keep good systems and eliminate poor ones. Thus, we explore the ability of CASF to identify the top-ranked system. As shown in Table 2, CASF achieves 93.18% top-ranked system recognition accuracy in 44 human evaluation metrics involving 137 NLG systems. For typical NLG tasks like DGen, SGen and MMGen, CASF achieves 100% identification accuracy. Experimental results also showed CASF was statistically outperforming the popular Random and Heuristic at the $p<0.05$ level.

M	#P	P-R	B-R	Tau	M	#P	P-R	B-R	Tau	M	#P	P-R	B-R	Tau	M	#P	P-R	B-R	Tau
A	2	0.25	0.25	0.75	F	2	0.10	0.40	0.73	F	2	0.05	0.45	0.74	F	2	0.15	0.35	0.73
	3	0.17	0.17	0.76		3	0.10	0.20	0.75		3	0.05	0.23	0.74		3	0.15	0.18	0.77
	4	0.13	0.13	0.76		4	0.10	0.13	0.80		4	0.05	0.15	0.77		4	0.15	0.12	0.76
	5	0.10	0.10	0.83		5	0.10	0.10	0.83		5	0.05	0.11	0.77		5	0.15	0.09	0.76
	6	0.08	0.08	0.72		6	0.10	0.08	0.75		6	0.05	0.09	0.73		6	0.15	0.07	0.71
	7	0.07	0.07	0.72		7	0.10	0.07	0.69		7	0.05	0.08	0.69		7	0.15	0.06	0.73
	8	0.06	0.06	0.70		8	0.10	0.06	0.73		8	0.05	0.06	0.72		8	0.15	0.05	0.79
	9	0.06	0.06	0.73		9	0.10	0.05	0.72		9	0.05	0.06	0.72		9	0.15	0.04	0.75
	10	0.05	0.05	0.75		10	0.10	0.04	0.73		10	0.05	0.05	0.75		10	0.15	0.04	0.70

Table 3: Experimental results on 44 human metrics with different mode (M) (Average (A) and Preliminary-Fixed (F)), number of phases (#P), preliminary sample ratio (P-R) and batch sampling ratio (B-R) of each phase for the proposed CASF.

Case Study

Taking the human aspect accuracy in the OpenAI 1 (Stiennon et al. 2020; Völske et al. 2017) dataset as an example, CASF obtains an accurate inter-system ranking as shown in Figure 4. The 3 times of random sampling obtained different inter-system rankings, and the ranking of the first system fluctuated between the first and fourth, with great volatility. This confirms the problem we raised about the risk of random sampling, making evaluation unreliable. CASF selects the same subset in multiple times, and the variance of the inter-ranking accuracy obtained by multiple sampling times is 0 (Learner Selection section of Appendix). Since CASF selects representative samples, it obtains more accurate inter-system rankings, making evaluation more reliable.

Automatic Metric for Preliminary Phase

Metric	SUM	MT	DGen	SGen	MMGen	Avg
BERT-S	0.74	0.58	0.67	1.00	1.00	0.73
MOVER-S	0.84	0.58	0.89	1.00	1.00	0.83
ROUGE-1	0.73	0.57	0.67	0.30	1.00	0.70
ROUGE-2	0.73	0.55	0.56	1.00	0.80	0.70
ROUGE-L	0.72	0.52	0.89	1.00	1.00	0.75
BART-S	0.60	0.44	0.89	0.90	0.80	0.64
BLEU	0.72	0.37	0.56	1.00	0.80	0.67
METEOR	0.78	0.54	0.89	1.00	1.00	0.79

Table 4: Results of CASF pre-ranking on different automatic metrics. “-S” indicates “-Score”. “Avg” represents the average result on all human metrics from all tasks.

We choose automatic metrics commonly used in NLG as our automatic metrics set, including BERT-SCORE (2019), MOVER-SCORE (2019), ROUGE-1 (2004), ROUGE-2, ROUGE-L, BART-SCORE (2021), BLEU (2002) and METEOR (2005). We apply each metric to calculate sample quality in the preliminary sampling phase of CASF in Table 4. Results show sample quality calculated on MOVER-SCORE get a more correct ranking. This shows the ability to calculate sample quality of contextual-embedding-based metric MOVER-SCORE. Traditional metric METEOR ranks second. Full results are in Appendix.

Phases and Associated Sampling Ratios

We conduct experiments to explore the influence of the number of phases and the sampling ratio of each phase for CASF. Results at the sampling rate of 50% on 16 datasets are shown in Table 3. In average mode, all phases are sampled in equal proportions. In the preliminary-fixed mode, we fix the preliminary sampling ratio, and the batch sampling ratio is divided equally according to the number of iteration phases and the total sampling ratio. Results show that performance is better when the number of iteration phases is 5 in most cases. It is simple and effective to sample each phase according to the total sampling rate and the number of phases.

Significant Information Retention Accuracy

Previous work (2022) focused on identifying top-ranked systems, and we further explored giving more accurate overall inter-system rankings and tested the significant information retention accuracy on sample subsets, that is, to test whether the subset can preserve the significance of ranking among systems. Results showed CASF outperforms Random and Heuristic. Details are in the Appendix.

Related Work

Previous works (2014; 2015; 2014; 2016) adopt TrueSkill (2006) to rank NLG methods with pairwise human evaluation. Sakaguchi and Van Durme (2018) introduce a method for system quality estimation from pairwise annotation by human judgment. Hashimoto et al. (2019) propose an evaluation mechanism to calculate a model’s sampling probabilities. Chaganty, Mussman, and Liang (2018) utilize control variates to obtain an unbiased estimator with lower cost than only using human evaluation. Mendonça et al. (2021) adopt online learning to find the best systems for machine translation. Wei et al. (2022) study the power on pairwise direct assessment comparisons. A recent work (2022) introduces Active Evaluation to identify the top-ranked system with less pairwise human annotations. There is still a vacancy in the research to derive a complete inter-system ranking based on the results of direct human scoring for general NLG tasks. Yates (1948) proposed Systematic Sampling. ILDAE (2022) calculates the difficulty score of the sample and uses a simple sampling method for Natural Language Inference. However, ILDAE is not suitable for NLG since there is no direct confidence value in NLG methods. To the best of our knowledge, this paper is the first work to extensively study the sampling method for direct scoring to get the whole inter-system ranking in NLG human evaluation.

Conclusion

In this paper, we focused on giving a more correct inter-system ranking for reliable human evaluation with limited time and cost. We propose CASF and show the overall inter-system Kendall correlation improved by 41% to 0.83 compared to the widely used random sampling in 44 human evaluation metrics across 16 datasets in 5 NLG tasks. CASF ranked first or ranked second among all comparison methods on up to 90.91% of the human metrics. We release a tool and we strongly recommend using CASF for reliable human evaluation to get a more reliable inter-system ranking.

Acknowledgements

This work was supported by National Key R&D Program of China (2021YFF0901502), National Science Foundation of China (No. 62161160339), State Key Laboratory of Media Convergence Production Technology and Systems and Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology). We appreciate the anonymous reviewers for their helpful comments. Xiaojun Wan is the corresponding author.

References

Abdi et al. (2007) Abdi, H.; et al. 2007. The method of least squares. Encyclopedia of measurement and statistics, 1: 530–532.
Banerjee and Lavie (2005) Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72.
Begg and Mazumdar (1994) Begg, C. B.; and Mazumdar, M. 1994. Operating characteristics of a rank correlation test for publication bias. Biometrics, 1088–1101.
Bethard (2022) Bethard, S. 2022. We need to talk about random seeds. arXiv preprint arXiv:2210.13393.
Bhandari et al. (2020) Bhandari, M.; Gour, P.; Ashfaq, A.; Liu, P.; and Neubig, G. 2020. Re-evaluating evaluation in text summarization. arXiv preprint arXiv:2010.07100.
Bhatnagar, Ganesh, and Kann (2022) Bhatnagar, R.; Ganesh, A.; and Kann, K. 2022. CHIA: CHoosing Instances to Annotate for Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2022, 7299–7315.
Bojar et al. (2014) Bojar, O.; Buck, C.; Federmann, C.; Haddow, B.; Koehn, P.; Leveling, J.; Monz, C.; Pecina, P.; Post, M.; Saint-Amand, H.; et al. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, 12–58.
Bojar et al. (2015) Bojar, O.; Chatterjee, R.; Federmann, C.; Haddow, B.; Huck, M.; Hokamp, C.; Koehn, P.; Logacheva, V.; Monz, C.; Negri, M.; Post, M.; Scarton, C.; Specia, L.; and Turchi, M. 2015. Findings of the 2015 Workshop on Statistical Machine Translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, 1–46. Lisbon, Portugal: Association for Computational Linguistics.
Breiman (1996) Breiman, L. 1996. Bagging predictors. Machine learning, 24(2): 123–140.
Breiman (2001) Breiman, L. 2001. Random forests. Machine learning, 45(1): 5–32.
Card et al. (2020) Card, D.; Henderson, P.; Khandelwal, U.; Jia, R.; Mahowald, K.; and Jurafsky, D. 2020. With Little Power Comes Great Responsibility. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9263–9274.
Celikyilmaz, Clark, and Gao (2020) Celikyilmaz, A.; Clark, E.; and Gao, J. 2020. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
Chaganty, Mussman, and Liang (2018) Chaganty, A. T.; Mussman, S.; and Liang, P. 2018. The price of debiasing automatic metrics in natural language evaluation. arXiv preprint arXiv:1807.02202.
Cover and Hart (1967) Cover, T.; and Hart, P. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1): 21–27.
Demšar (2006) Demšar, J. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7: 1–30.
Dušek, Novikova, and Rieser (2018) Dušek, O.; Novikova, J.; and Rieser, V. 2018. Findings of the E2E NLG Challenge. In Proceedings of the 11th International Conference on Natural Language Generation, 322–328. Tilburg University, The Netherlands: Association for Computational Linguistics.
Fabbri et al. (2021) Fabbri, A. R.; Kryściński, W.; McCann, B.; Xiong, C.; Socher, R.; and Radev, D. 2021. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9: 391–409.
Freitag et al. (2021) Freitag, M.; Foster, G.; Grangier, D.; Ratnakar, V.; Tan, Q.; and Macherey, W. 2021. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. arXiv:2104.14478.
Freund, Schapire et al. (1996) Freund, Y.; Schapire, R. E.; et al. 1996. Experiments with a new boosting algorithm. In icml, volume 96, 148–156. Citeseer.
Friedman (2001) Friedman, J. H. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232.
Gao and Wan (2022) Gao, M.; and Wan, X. 2022. DialSummEval: Revisiting Summarization Evaluation for Dialogues. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5693–5709. Seattle, United States: Association for Computational Linguistics.
Gatt and Krahmer (2018) Gatt, A.; and Krahmer, E. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61: 65–170.
Geurts, Ernst, and Wehenkel (2006) Geurts, P.; Ernst, D.; and Wehenkel, L. 2006. Extremely randomized trees. Machine learning, 63(1): 3–42.
Gkatzia and Mahamood (2015) Gkatzia, D.; and Mahamood, S. 2015. A snapshot of NLG evaluation practices 2005-2014. In Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), 57–60.
Grusky, Naaman, and Artzi (2018) Grusky, M.; Naaman, M.; and Artzi, Y. 2018. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. arXiv preprint arXiv:1804.11283.
Guan et al. (2021) Guan, J.; Zhang, Z.; Feng, Z.; Liu, Z.; Ding, W.; Mao, X.; Fan, C.; and Huang, M. 2021. OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics. arXiv:2105.08920.
Hashimoto, Zhang, and Liang (2019) Hashimoto, T. B.; Zhang, H.; and Liang, P. 2019. Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792.
Hearst et al. (1998) Hearst, M. A.; Dumais, S. T.; Osuna, E.; Platt, J.; and Scholkopf, B. 1998. Support vector machines. IEEE Intelligent Systems and their applications, 13(4): 18–28.
Herbrich, Minka, and Graepel (2006) Herbrich, R.; Minka, T.; and Graepel, T. 2006. TrueSkill™: a Bayesian skill rating system. Advances in neural information processing systems, 19.
Howcroft et al. (2020) Howcroft, D. M.; Belz, A.; Clinciu, M.-A.; Gkatzia, D.; Hasan, S. A.; Mahamood, S.; Mille, S.; Van Miltenburg, E.; Santhanam, S.; and Rieser, V. 2020. Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions. In Proceedings of the 13th International Conference on Natural Language Generation, 169–182.
Hu et al. (2019) Hu, J. E.; Rudinger, R.; Post, M.; and Van Durme, B. 2019. Parabank: Monolingual bitext generation and sentential paraphrasing via lexically-constrained neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 6521–6528.
Kasai et al. (2022) Kasai, J.; Sakaguchi, K.; Dunagan, L.; Morrison, J.; Bras, R. L.; Choi, Y.; and Smith, N. A. 2022. Transparent Human Evaluation for Image Captioning. In Proc. of NAACL.
Kendall (1938) Kendall, M. G. 1938. A new measure of rank correlation. Biometrika, 30(1/2): 81–93.
Kendall (1945) Kendall, M. G. 1945. The treatment of ties in ranking problems. Biometrika, 33(3): 239–251.
Kondrak (2005) Kondrak, G. 2005. N-gram similarity and distance. In International symposium on string processing and information retrieval, 115–126. Springer.
Lin (2004) Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81.
Mehri and Eskenazi (2020) Mehri, S.; and Eskenazi, M. 2020. USR: An unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456.
Mendonça et al. (2021) Mendonça, V.; Rei, R.; Coheur, L.; Sardinha, A.; and Santos, A. L. 2021. Online learning meets machine translation evaluation: Finding the best systems with the least human effort. arXiv preprint arXiv:2105.13385.
Mohankumar and Khapra (2022) Mohankumar, A. K.; and Khapra, M. M. 2022. Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8761–8781.
Novikova et al. (2017) Novikova, J.; Dušek, O.; Curry, A. C.; and Rieser, V. 2017. Why we need new evaluation metrics for NLG. arXiv preprint arXiv:1707.06875.
Papineni et al. (2002) Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318.
Peyrard (2019) Peyrard, M. 2019. A Simple Theoretical Model of Importance for Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1059–1073.
Quinlan (1986) Quinlan, J. R. 1986. Induction of decision trees. Machine learning, 1(1): 81–106.
Reiter and Belz (2009) Reiter, E.; and Belz, A. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4): 529–558.
Sakaguchi et al. (2016) Sakaguchi, K.; Napoles, C.; Post, M.; and Tetreault, J. 2016. Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. Transactions of the Association for Computational Linguistics, 4: 169–182.
Sakaguchi, Post, and Van Durme (2014) Sakaguchi, K.; Post, M.; and Van Durme, B. 2014. Efficient elicitation of annotations for human evaluation of machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 1–11.
Sakaguchi and Van Durme (2018) Sakaguchi, K.; and Van Durme, B. 2018. Efficient online scalar annotation with bounded support. arXiv preprint arXiv:1806.01170.
Seber and Lee (2012) Seber, G. A.; and Lee, A. J. 2012. Linear regression analysis. John Wiley & Sons.
Shi et al. (2022) Shi, Y.; Yang, X.; Xu, H.; Yuan, C.; Li, B.; Hu, W.; and Zha, Z. 2022. EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022.
Stiennon et al. (2020) Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; and Christiano, P. F. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 3008–3021.
Varshney, Mishra, and Baral (2022) Varshney, N.; Mishra, S.; and Baral, C. 2022. ILDAE: Instance-Level Difficulty Analysis of Evaluation Data. arXiv preprint arXiv:2203.03073.
Völske et al. (2017) Völske, M.; Potthast, M.; Syed, S.; and Stein, B. 2017. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, 59–63.
Wan and Xiao (2008) Wan, X.; and Xiao, J. 2008. CollabRank: towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), 969–976.
Wan and Yang (2007) Wan, X.; and Yang, J. 2007. CollabSum: exploiting multiple document clustering for collaborative single document summarizations. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 143–150.
Wei, Kocmi, and Federmann (2022) Wei, J. T.-Z.; Kocmi, T.; and Federmann, C. 2022. Searching for a higher power in the human evaluation of MT. arXiv preprint arXiv:2210.11612.
Yates (1948) Yates, F. 1948. Systematic sampling. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 241(834): 345–377.
Yuan, Neubig, and Liu (2021) Yuan, W.; Neubig, G.; and Liu, P. 2021. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34: 27263–27277.
Zhang et al. (2019) Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
Zhao et al. (2019) Zhao, W.; Peyrard, M.; Liu, F.; Gao, Y.; Meyer, C. M.; and Eger, S. 2019. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622.
Zhou et al. (2022) Zhou, K.; Blodgett, S. L.; Trischler, A.; Daumé III, H.; Suleman, K.; and Olteanu, A. 2022. Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications. arXiv preprint arXiv:2205.06828.

Appendix A Appendix

Survey

We investigate papers with human evaluation to better study the current manual evaluation sampling problem. First, we randomly selected 1404 papers from ACL, EMNLP and COLING in the last two years from paperswithcode.com and the lists of accepted work published by the conferences. We then browse each paper by searching with the keywords ‘human’, ‘manual’ and ‘annotate’, and the content of the keywords in context is viewed. We find that 270 papers selected a subset of the test dataset for manual evaluation to save the labor and cost of manual evaluation. For these papers, we use ‘sample’ as the keyword to search for the sampling method used for human evaluation. It is found that random sampling is the most important sampling method, accounting for 60.7%, and the other 39.3% do not mention the sampling method they used. The number of papers using random sampling and unknown sampling methods in each conference is shown in Figure 5.

Of these 270 papers that take human evaluation, 179 papers are from NLG tasks, with the most being text summarization at 22%. The proportion of these NLG papers on different tasks is shown in Figure 6. The lists of the 1404 papers surveyed and the 270 papers that selected a subset of the test dataset for human evaluation will be released.

According to the results of the survey, there are two major problems in the current sampling problem of human evaluation. On the one hand, using random sampling to select samples can lead to unreliable human evaluation results, because different sampling subsets are likely to lead to different inter-system ranking results. In addition, not disclosing the list of samples selected by random sampling will lead to poor reproducibility of evaluation results. On the other hand, we find that up to 39.3% of the papers do not provide information on human evaluation sampling, which would lead to low reliability and low reproducibility of evaluation results. We recommend providing sampling information including sampling methods, evaluation sample lists, etc. when conducting human evaluations in the future to standardize the human evaluation process. At the same time, we strongly recommend using our proposed Constrained Active Sampling Framework for sampling evaluation subsets in human evaluations, which will make human evaluations more reliable and allow excellent systems to be retained.

Tasks and Datasets

Test Set

In Table 5, we report information on the NLG tasks and related datasets we used as test set, including the number of human evaluation aspects, the number of NLG systems involved in the corresponding datasets, the sample size of the datasets and specific human evaluation aspects. The order of the human evaluation metrics in the experiment in this paper follows the order of the human metrics shown in Table 5. As Likert-scale comparisons are the most commonly reported type of evaluation (Card et al. 2020), we focus on Likert-scale datasets. For data preprocessing, we first discard samples that lack information, including the system output, human evaluation score, and reference text. We then compute the automated metrics’ scores of these samples for use.

Validation Set

We select automatic metric for the preliminary phase, regressor for the learner, number of phases and the associated sampling ratios on the validation set shown in Table 6. The validation set contains nine datasets and 41 NLG systems from two traditional NLG tasks namely Data to Text and Paraphrase Generation. Since not all NLG systems on the E2E (Dušek, Novikova, and Rieser 2018) and ParaBank (Hu et al. 2019) datasets have human evaluation score on the same samples, we divide the ParaBank datasets into subsets. The samples on these subsets have human scoring results on all systems. For subsets selection, we first arranged and combined all the systems into system subsets, and calculated the number of samples that meet the need of having human evaluation score on all systems in the system subset. Then, we select the system subset with more samples that meet the need. The final system IDs of the selected subset are shown in Table 6.

Table 5: Description of tasks and datasets with the number of human evaluation metrics (# HE Metrics), number of NLG systems (# Systems), number of samples (# Samples) and the human evaluation aspects (HE Metrics).

Tasks

Datasets

# HE Metrics

# Systems

# Samples

HE Metrics

SummEval (Fabbri et al. 2021)

100

coherence, consistency, fluency,relevance

REALSumm (Bhandari et al. 2020)

100

litepyramid-recall

NeR18 (Grusky, Naaman, and Artzi 2018)

coherence, fluency, informativeness, relevance

DialSummEval (Gao and Wan 2022)

100

consistency,relevance,fluency,coherence

OpenAI-axis1 (Stiennon et al. 2020; Völske et al. 2017)

439

accuracy,coherence,coverage,overall

OpenAI-axis2 (Stiennon et al. 2020; Völske et al. 2017)

636

accuracy,coherence,coverage,overall

OpenAI-CNN/DM1 (Stiennon et al. 2020; Völske et al. 2017)

206

accuracy,coherence,coverage,overall

Summarization

OpenAI-CNN/DM3 (Stiennon et al. 2020; Völske et al. 2017)

206

accuracy,coherence,coverage,overall

newstext2020 en-de (Freitag et al. 2021)

1066

MQM, pSQM

newstext2020 cn-en (Freitag et al. 2021)

1641

MQM, pSQM

Machine Translation

newstext2021 cn-en (Freitag et al. 2021)

147

MQM

Dialogue

Generation

Persona Chat (Mehri and Eskenazi 2020)

Understandable, Natural, Maintains Context,

Interesting, Uses Knowledge, Overall Quality

MANS-ROC (Guan et al. 2021)

200

overall

Story Generation

MANS-WP (Guan et al. 2021)

200

overall

THUMB-MSCOCO (Kasai et al. 2022)

500

overall

Multi-Modal Generation

VATEX-EVAL (Shi et al. 2022)

3000

consistency

Overall

137

8661

Table 6: Description of tasks and datasets with the number of human evaluation metrics (# HE Metrics), number of NLG systems (# Systems), number of samples (# Samples), human evaluation aspects (HE Metrics) and System ID for the validation set.

Tasks	Datasets	# HE Metrics	# Systems	# Samples	HE Metrics	System ID
Data to Text	E2E (Dušek, Novikova, and Rieser 2018)	1	3	31	naturalness	zhang, gong, tnt2
Paraphrase Generation	ParaBank1 (Hu et al. 2019)	1	4	69	overall	0, 2, 30, 35
	ParaBank2 (Hu et al. 2019)	1	4	62	overall	0, 3, 24, 31
	ParaBank3 (Hu et al. 2019)	1	5	77	overall	4, 0, 6, 30, 35
	ParaBank4 (Hu et al. 2019)	1	5	84	overall	5, 0, 13, 29, 35
	ParaBank5 (Hu et al. 2019)	1	5	90	overall	6, 0, 20, 30, 35
	ParaBank6 (Hu et al. 2019)	1	5	82	overall	7, 0, 6, 29, 35
	ParaBank7 (Hu et al. 2019)	1	5	77	overall	9, 0, 13, 32, 35
	ParaBank8 (Hu et al. 2019)	1	5	64	overall	10, 0, 3, 27, 35
Overall	9	9	41	636	-	-

Learner Selection

Practical Recommendation

We explore the learning stability and accuracy of nine popular statistical machine learning algorithms as the regressors of Learner. We replace the core algorithm of Learner in CASF, and carried out experiments in 16 datasets under five NLG tasks. The experiment involved 137 NLG systems and 44 human indicators. The sampling rate is 50%. The experimental results are shown in the Table 7. Each experiment is run three times and the average inter-system ranking accuracy and variance through the three runs of each regressor are recorded. The stability of the regressors and be judged by the recorded variance.

The nine popular statistical machine learning regressors are Linear Regressor (Seber and Lee 2012), AdaBoost (Freund, Schapire et al. 1996), Bootstrap aggregating (Bagging) (Breiman 1996), Decision Tree (Quinlan 1986), Extremely Randomized Trees (ExtRaTree) (Geurts, Ernst, and Wehenkel 2006), K-Nearest Neighbor (KNN) (Cover and Hart 1967), Random Forest (Breiman 2001), support vector machine (SVM) (Hearst et al. 1998) and Gradient Boosting Decision Tree (GBDT) (Friedman 2001). We use the implementation of the corresponding statistical machine learning regressors in the sklearn library.

As for the stability of the algorithm, we do not want the proposed sampling method to get different results in each time of sampling like random sampling. Therefore, we should choose stable regressors as the core algorithm of the learner. The results in Table 7 show that linear regression, KNN, SVM and GBDT achieve good stability, and the variance of the inter-system ranking of three runs is 0. In terms of inter-system ranking accuracy, GBDT obtained the highest inter-system ranking accuracy, reaching 0.83 Kendall’s correlation. Based on the above experimental results, we recommend choosing GBDT as the core regression algorithm of the Learner in the proposed CASF.

Learner Selection in This Paper

For the selection of regressor for Learner in this paper, we conduct similar experiments on the validation set. Experimental results in Table 8 show GBDT obtained the highest inter-system ranking accuracy and stability, reaching 1.000 Kendall’s correlation for inter-system ranking accuracy and zero fluctuation. Based on the above experimental results and analysis, we chose GBDT as the core regression algorithm of the Learner in the proposed CASF in this paper.

Table 7: Experiment for practical recommendation of selecting the core regressing method for Learner. ’Mean’ represents the average inter-system ranking accuracy across three runs and ’Std’ represents the variance of the three runs. HE Metric’ indicates different human evaluation aspects in a dataset. Bold number indicates that the regressor ranks first among all regressors under the corresponding human evaluation metric. Underlined number indicates that the regressor ranks second among all regressors for the corresponding human evaluation metric.

Task

Dataset

HE Metric

Linear

AdaBoost

Bagging

DecisionTree

ExtRaTree

KNN

Random Forest

SVM

GBDT

Mean

Std

Mean

Std

Mean

Std

Mean

Std

Mean

Std

Mean

Std

Mean

Std

Mean

Std

Mean

Std

SUM

SummEval

coherence

0.617

0.000

0.628

0.150

0.583

0.191

0.622

0.136

0.672

0.198

0.650

0.000

0.522

0.142

0.717

0.000

0.950

0.000

consistency

0.600

0.000

0.567

0.054

0.478

0.021

0.494

0.093

0.494

0.087

0.567

0.000

0.472

0.123

0.117

0.000

0.533

0.000

fluency

0.467

0.000

0.406

0.034

0.300

0.072

0.311

0.034

0.256

0.165

0.200

0.000

0.378

0.021

0.367

0.000

0.333

0.000

relevance

0.750

0.000

0.472

0.122

0.450

0.072

0.611

0.244

0.461

0.162

0.817

0.000

0.567

0.167

0.383

0.000

0.817

0.000

REALSumm

litepyramid

0.399

0.000

0.430

0.109

0.394

0.021

0.333

0.119

0.403

0.054

0.601

0.000

0.529

0.041

0.464

0.000

0.543

0.000

NeR18

coherence

1.000

0.000

0.841

0.224

1.000

0.000

1.000

0.000

0.968

0.045

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

fluency

1.000

0.000

0.841

0.224

0.968

0.045

0.968

0.045

0.968

0.045

1.000

0.000

0.619

0.206

1.000

0.000

1.000

0.000

informativeness

0.714

0.000

1.000

0.000

0.905

0.135

1.000

0.000

0.873

0.119

1.000

0.000

0.873

0.119

0.714

0.000

1.000

0.000

relevance

0.905

0.000

0.968

0.045

0.905

0.000

0.651

0.250

1.000

0.000

0.905

0.000

0.778

0.250

0.619

0.000

1.000

0.000

DialSummEval

consistency

0.513

0.000

0.632

0.074

0.726

0.032

0.752

0.024

0.675

0.151

0.872

0.000

0.547

0.044

0.615

0.000

0.769

0.000

relevance

0.564

0.000

0.624

0.212

0.744

0.126

0.607

0.099

0.667

0.126

0.564

0.000

0.496

0.024

0.538

0.000

0.718

0.000

fluency

0.897

0.000

0.598

0.157

0.538

0.117

0.735

0.128

0.504

0.281

0.385

0.000

0.658

0.103

0.744

0.000

0.615

0.000

coherence

0.795

0.000

0.684

0.067

0.590

0.091

0.786

0.067

0.632

0.154

0.641

0.000

0.556

0.053

0.846

0.000

0.897

0.000

OpenAI-axis1

accuracy

0.000

0.400

0.432

0.400

0.283

0.267

0.377

0.333

0.340

0.000

0.400

0.432

0.200

0.000

1.000

0.000

coherence

0.400

0.000

0.467

0.249

0.400

0.283

0.467

0.249

0.400

0.283

0.000

0.667

0.340

0.800

0.000

0.800

0.000

coverage

1.000

0.000

0.933

0.094

1.000

0.000

1.000

0.000

0.933

0.094

1.000

0.000

1.000

0.000

0.800

0.000

0.800

0.000

overall

1.000

0.000

0.933

0.094

0.867

0.094

1.000

0.000

0.933

0.094

1.000

0.000

0.933

0.094

0.800

0.000

1.000

0.000

OpenAI-axis2

accuracy

0.714

0.000

0.873

0.119

0.397

0.196

0.810

0.269

0.810

0.206

0.619

0.000

0.556

0.119

0.714

0.000

0.905

0.000

coherence

0.905

0.000

0.270

0.119

0.524

0.156

0.587

0.314

0.492

0.273

0.238

0.000

0.397

0.119

0.429

0.000

0.429

0.000

coverage

1.000

0.000

0.968

0.045

0.905

0.135

1.000

0.000

0.968

0.045

0.619

0.000

0.968

0.045

1.000

0.000

1.000

0.000

overall

0.905

0.000

0.968

0.045

0.841

0.162

0.873

0.119

0.873

0.180

0.714

0.000

0.841

0.162

1.000

0.000

1.000

0.000

OpenAI-CNN/DM1

accuracy

0.956

0.000

0.822

0.131

0.896

0.147

0.896

0.147

0.807

0.042

0.644

0.000

0.837

0.091

0.956

0.000

0.867

0.000

coherence

0.822

0.000

0.556

0.181

0.407

0.137

0.837

0.171

0.407

0.302

0.600

0.000

0.333

0.063

0.244

0.000

0.600

0.000

coverage

1.000

0.000

0.837

0.230

0.911

0.063

0.956

0.063

0.911

0.063

0.867

0.000

0.748

0.267

1.000

0.000

0.867

0.000

overall

1.000

0.000

0.837

0.230

0.630

0.168

0.630

0.267

0.822

0.063

0.511

0.000

0.837

0.230

1.000

0.000

1.000

0.000

OpenAI-CNN/DM3

accuracy

1.000

0.000

1.000

0.000

0.556

0.314

0.778

0.314

0.556

0.314

1.000

0.000

0.778

0.314

1.000

0.000

1.000

0.000

coherence

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

coverage

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

0.778

0.314

1.000

0.000

1.000

0.000

overall

1.000

0.000

0.556

0.314

1.000

0.000

0.778

0.314

0.778

0.314

1.000

0.000

0.556

0.314

1.000

0.000

1.000

0.000

newstest2020 en-de

MQM

0.714

0.000

0.556

0.324

0.524

0.339

0.429

0.467

0.746

0.359

0.333

0.000

0.492

0.367

1.000

0.000

0.143

0.000

pSQM

1.000

0.000

0.968

0.045

0.937

0.045

0.683

0.384

1.000

0.000

1.000

0.000

0.937

0.045

0.905

0.000

1.000

0.000

newstest2020 cn-en

MQM

1.000

0.000

0.619

0.243

0.714

0.404

0.667

0.269

0.476

0.332

0.286

0.000

0.452

0.388

0.214

0.000

0.929

0.000

pSQM

0.786

0.000

0.262

0.410

0.690

0.243

0.476

0.221

0.548

0.321

0.929

0.000

0.524

0.337

0.071

0.000

0.786

0.000

newstest2021

cn-en

MQM

0.026

0.000

0.368

0.119

0.376

0.094

0.120

0.067

0.017

0.250

0.000

-0.060

0.169

-0.077

0.000

0.026

0.000

DialoGen

Persona Chat

Understandable

1.000

0.000

0.778

0.314

0.556

0.314

0.778

0.314

0.333

0.943

1.000

0.000

1.000

0.000

0.333

0.000

0.333

0.000

Natural

1.000

0.000

0.111

0.629

0.333

0.544

0.111

0.831

1.000

0.000

1.000

0.000

0.556

0.314

0.333

0.000

1.000

0.000

Maintains Context

1.000

0.000

0.333

0.943

1.000

0.000

0.333

0.943

1.000

0.000

1.000

0.000

0.333

0.943

1.000

0.000

1.000

0.000

Interesting

1.000

0.000

1.000

0.000

0.778

0.314

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

Uses Knowledge

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

Overall Quality

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

StoryGen

MANS-ROC

overall

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

MANS-WP

overall

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

MMGen

THUMB-MSCOCO

overall

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

VATEX-EVAL

consistency

1.000

0.000

0.867

0.189

0.867

0.189

0.867

0.189

0.733

0.189

1.000

0.000

1.000

0.000

1.000

0.000

1.000

0.000

Overall Performance

0.828

0.000

0.727

0.158

0.729

0.126

0.732

0.171

0.738

0.150

0.740

0.000

0.701

0.154

0.724

0.000

0.833

0.000

Table 8: Experiment for selecting the core regressing method for Learner on the validation set. ’Mean’ represents the average inter-system ranking accuracy across three runs and ’Std’ represents the variance of the three runs. HE Metric’ indicates different human evaluation aspects in a dataset. Bold number indicates that the regressor ranks first among all regressors under the corresponding human evaluation metric. Underlined number indicates that the regressor ranks second among all regressors for the corresponding human evaluation metric.

Task	Dataset	HE Metric	Linear		AdaBoost		Bagging		DecisionTree		ExtraTree		KNN		Random Forest		SVM		GBDT
Task	Dataset	HE Metric	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std	Mean	Std
Data to Text	E2E	naturalness	1.000	0.000	0.556	0.629	0.556	0.629	-0.111	0.314	0.333	0.544	1.000	0.000	0.111	0.629	1.000	0.000	1.000	0.000
Paraphrase Generation	ParaBank1	overall	0.667	0.000	0.333	0.471	0.333	0.272	0.333	0.471	0.444	0.416	0.667	0.000	0.889	0.157	0.000	0.000	1.000	0.000
	ParaBank2	overall	1.000	0.000	1.000	0.000	0.667	0.471	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000
	ParaBank3	overall	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	0.667	0.471	0.000	0.000	1.000	0.000
	ParaBank4	overall	0.667	0.000	0.556	0.416	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	0.889	0.157	1.000	0.000	1.000	0.000
	ParaBank5	overall	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000
	ParaBank6	overall	1.000	0.000	1.000	0.000	0.889	0.157	0.889	0.157	1.000	0.000	1.000	0.000	0.667	0.471	1.000	0.000	1.000	0.000
	ParaBank7	overall	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000
	ParaBank8	overall	0.000	0.000	0.667	0.471	0.889	0.157	0.667	0.471	0.556	0.314	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000
Overall Performance			0.815	0.000	0.790	0.221	0.815	0.187	0.753	0.157	0.815	0.142	0.963	0.000	0.802	0.210	0.778	0.000	1.000	0.000

Automatic Metrics for Preliminary Sampling Phase

Practical Recommendation

By replacing the automated metrics of the proposed CASF in the preliminary sampling phase, we explore which automated metrics are more suitable for measuring sample quality in the preliminary sampling phase. We calculate metrics in the selected NLG automatic metric set by using the official provided code. Full experiment results of the proposed Constrained Active Sampling Framework on 44 human evaluation metrics from 5 NLG tasks pre-ranking on different automatic metrics are shown in Table 10. We can learn from Table 10 that MOVER-SCORE (Zhao et al. 2019) ranks first in the whole inter-system ranking accuracy of 64% human evaluation metrics. In addition, MOVER-SCORE ranked first in the overall inter-system ranking accuracy of 16 datasets, so we recommend using MOVER-SCORE as the calculation method of sample quality in the preliminary phase.

The results of top-ranked system recognition accuracy shown in Table 9 demonstrate that MOVER-SCORE has the best recognition performance in summarization, dialogue generation, story generation and multi-modal generation tasks, while the recognition effect in machine translation task is the second-best among 8 automatic metrics. MOVER-SCORE has an average top-ranked system identification accuracy of 93.18% across all 16 human evaluation metrics, involving 137 NLG systems. These results further indicate that MOVER-SCORE is a more suitable sampling quality measurement method in the preliminary sampling phase of CASF.

Table 9: Experiment results of top-ranked accuracy of the proposed CASF on NLG tasks pre-ranking on different automatic metrics. ‘Overall’ represents the average result on all human indicators from all tasks. Bold number indicates that the automatic metric ranks first among all automatic metrics. Underlined number indicates that the automatic metric ranks second among all automatic metrics.

Automatic Metric	SUM	MT	DialoGen	StoryGen	MMGen	Overall
BERT-SCORE	0.8276	0.8000	0.8333	1.0000	1.0000	0.8409
MOVER-SCORE	0.9310	0.8000	1.0000	1.0000	1.0000	0.9318
ROUGE-1	0.7586	1.0000	0.8333	1.0000	1.0000	0.8182
ROUGE-2	0.7931	1.0000	0.8333	1.0000	1.0000	0.8409
ROUGE-L	0.7241	1.0000	0.8333	1.0000	1.0000	0.7955
BART-SCORE	0.8621	1.0000	1.0000	0.5000	1.0000	0.8864
BLEU	0.8621	1.0000	0.8333	1.0000	1.0000	0.8864
METEOR	0.7931	1.0000	0.8333	1.0000	1.0000	0.8409

Table 10: Experiment results of the proposed Constrained Active Sampling pre-ranking on different automatic metrics. ’HE Metric’ indicates different human evaluation aspects in a dataset. Bold number indicates that the automatic metric ranks first among all automatic metrics under the corresponding human evaluation metric. Underlined number indicates that the automatic metric ranks second among all automatic metrics for the corresponding human evaluation metric.

Task	Dataset	HE Metric	BERT-SCORE	MOVER-SCORE	ROUGE-1	ROUGE-2	ROUGE-L	BART-SCORE	BLEU	METEOR
Summarization	SummEval	coherence	0.5333	0.9500	0.7333	0.7000	0.6333	0.1667	0.8167	0.5500
		consistency	-0.0167	0.5333	0.5833	0.3500	0.6500	0.4833	0.2833	0.5667
		fluency	0.2500	0.3333	0.5833	0.2667	0.3500	0.2500	0.0500	0.1667
		relevance	0.6167	0.8167	0.7000	0.3833	0.5333	0.3833	0.6833	0.6500
	REALSumm	litepyramid	0.4928	0.5435	0.3333	0.4493	0.5507	0.3841	0.5217	0.4493
	NeR18	coherence	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
		fluency	1.0000	1.0000	1.0000	0.5238	1.0000	0.9048	1.0000	1.0000
		informativeness	0.7143	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
		relevance	1.0000	1.0000	0.9048	0.9048	0.9048	0.4286	1.0000	1.0000
	DialSummEval	consistency	0.7179	0.7692	0.5128	0.6667	0.4615	0.6923	0.6923	0.8974
		relevance	0.5385	0.7179	0.4359	0.6923	0.6154	0.8718	0.4872	0.6923
		fluency	0.6410	0.6154	0.8718	0.5641	0.3846	0.5385	0.5897	0.4359
		coherence	0.6154	0.8974	0.5897	0.5128	0.6923	0.5128	0.3846	0.5897
	OpenAI-axis1	accuracy	0.0000	1.0000	0.2000	0.0000	1.0000	0.0000	0.0000	1.0000
		coherence	0.8000	0.8000	0.4000	0.2000	0.0000	0.0000	0.2000	1.0000
		coverage	1.0000	0.8000	1.0000	1.0000	1.0000	1.0000	1.0000	0.8000
		overall	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	0.8000
	OpenAI-axis2	accuracy	0.6190	0.9048	0.6190	1.0000	0.4286	0.6190	0.7143	0.7143
		coherence	0.7143	0.4286	0.3333	0.6190	0.7143	0.0476	1.0000	0.2381
		coverage	1.0000	1.0000	1.0000	0.9048	0.9048	0.7143	1.0000	0.6190
		overall	1.0000	1.0000	0.9048	1.0000	1.0000	0.7143	1.0000	1.0000
	OpenAI-CNN/DM1	accuracy	0.9556	0.8667	0.7778	1.0000	0.7778	0.6889	0.7778	1.0000
		coherence	0.9556	0.6000	0.5111	1.0000	0.6000	0.0667	0.2889	0.5111
		coverage	0.8667	0.8667	1.0000	1.0000	1.0000	0.6444	0.8667	1.0000
		overall	1.0000	1.0000	0.8667	1.0000	1.0000	1.0000	0.5111	1.0000
	OpenAI-CNN/DM3	accuracy	0.3333	1.0000	1.0000	0.3333	0.3333	0.3333	1.0000	1.0000
		coherence	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
		coverage	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
		overall	1.0000	1.0000	0.3333	1.0000	0.3333	1.0000	1.0000	1.0000
MT	newstest2020 en-de	MQM	1.0000	0.1429	1.0000	0.1429	0.1429	0.3333	0.3333	0.3333
	newstest2020 en-de	pSQM	0.9048	1.0000	0.9048	1.0000	0.9048	0.9048	1.0000	0.9048
	newstest2020 cn-en	MQM	0.7857	0.9286	0.2143	0.7143	0.6429	0.1429	0.2143	0.7143
	newstest2020 cn-en	pSQM	0.2857	0.7857	0.5000	0.7857	0.7857	0.2143	0.2857	0.2143
	newstest2021 cn-en	MQM	-0.0769	0.0256	0.2308	0.1026	0.1282	0.5897	0.0256	0.5128
Dialogue Generation	Persona Chat	Understandable	1.0000	0.3333	-1.0000	0.3333	1.0000	0.3333	0.3333	1.0000
		Natural	-0.3333	1.0000	1.0000	-0.3333	0.3333	1.0000	-1.0000	0.3333
		Maintains Context	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
		Interesting	0.3333	1.0000	1.0000	0.3333	1.0000	1.0000	1.0000	1.0000
		Uses Knowledge	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
		Overall Quality	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
Story Generation	MANS-ROC	overall	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
Story Generation	MANS-WP	overall	1.0000	1.0000	-0.4000	1.0000	1.0000	0.8000	1.0000	1.0000
Multi-Modal Generation	THUMB-MSCOCO	overall	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
Multi-Modal Generation	VATEX-EVAL	overall	1.0000	1.0000	1.0000	0.6000	1.0000	0.6000	0.6000	1.0000
Overall			0.7329	0.8332	0.6965	0.6989	0.7456	0.6446	0.6741	0.7885

Table 11: Experiment results of CASF pre-ranking on different automatic metrics on the validation set. ’HE Metric’ indicates different human evaluation aspects in a dataset. Bold number indicates that the automatic metric ranks first among all automatic metrics under the corresponding human evaluation metric. Underlined number indicates that the automatic metric ranks second among all automatic metrics for the corresponding human evaluation metric.

Task	Dataset	HE Metric	BERT-SCORE	MOVER-SCORE	ROUGE-1	ROUGE-2	ROUGE-L	BART-SCORE	BLEU	METEOR
Data to Text	E2E	naturalness	0.3333	1.0000	0.3333	1.0000	1.0000	-0.3333	1.0000	-0.3333
Paraphrase Generation	ParaBank1	overall	0.0000	1.0000	1.0000	0.6667	1.0000	1.0000	0.0000	1.0000
	ParaBank2	overall	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
	ParaBank3	overall	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
	ParaBank4	overall	1.0000	1.0000	1.0000	1.0000	0.6667	1.0000	0.6667	0.6667
	ParaBank5	overall	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
	ParaBank6	overall	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
	ParaBank7	overall	1.0000	1.0000	1.0000	1.0000	0.0000	1.0000	1.0000	1.0000
	ParaBank8	overall	1.0000	1.0000	1.0000	1.0000	0.6667	0.0000	1.0000	1.0000
Overall			0.8148	1.0000	0.9259	0.9630	0.8148	0.7407	0.8519	0.8148

Automatic Metric Selection in This Paper

We conduct a similar experiment on the validation set to select automatic metrics for the preliminary phase of CASF in the paper. According to experimental results in Table 11, we find MOVER-SCORE is capable to measure sample quality in the preliminary phase. And we finally select MOVER-SCORE as the automatic metric for the preliminary phase of CASF in the paper.

Different Sampling Ratio

Experimental results in Table 12 show the inter-system ranking accuracy under different sampling ratios. The full results of sampling half of the dataset are in Table 1. Experimental results demonstrate that CASF has the best inter-system ranking accuracy among three different sampling methods under different sampling ratios, with an average gap between random sampling of 0.1133 Kendall correlation while solving the problem of clustered selection and data manipulation for human evaluation. We also observe an interesting phenomenon that sometimes there is a negative correlation between sampling ratio and inter-system ranking accuracy (with the sampling ratio of 70% and 80%), that is, with the increase of sampling ratio, inter-system ranking accuracy decreases. This phenomena may occur because some samples do not contribute to the overall effect or have a negative effect, and are sometimes used as a sign of publication bias (Begg and Mazumdar 1994; Card et al. 2020). Overall, the inter-system ranking accuracy increases with the increase of the sampling ratio.

Table 12: Experiment results of Random Sampling, Heuristic Sampling and Constrained Active Sampling Framework (CASF) with different sampling ratio on 16 datasets. The inter-system ranking accuracy recorded in each dataset is the average scores for all aspects of human evaluation under the dataset. Bold number indicates that the sampling method ranks first among all sampling method under the corresponding NLG dataset. Random and Heuristic are performed 3 times and the mean results are recorded.

Task	Dataset	Method	90%	80%	70%	60%	40%	30%	20%	10%
SUM	SummEval	Random	0.6167	0.6236	0.5847	0.4625	0.4306	0.3306	0.3403	0.1097
		Heuristic	0.6403	0.5792	0.5403	0.5306	0.4042	0.3611	0.3069	0.0625
		CASF (ours)	0.5833	0.6417	0.5875	0.8167	0.5000	0.4917	0.4833	-0.0708
	REALSumm	Random	0.6739	0.5580	0.4928	0.5338	0.2826	0.2874	0.3696	0.0411
		Heuristic	0.7657	0.6715	0.4517	0.4324	0.2874	0.3382	0.2923	0.0242
		CASF (ours)	0.9565	0.7391	0.5797	0.5870	0.3116	0.4275	0.4565	0.1739
	NeR18	Random	0.9762	1.0000	0.9286	0.9524	0.8492	0.6667	0.5714	0.3810
		Heuristic	0.9524	0.9444	0.9762	0.8810	0.6746	0.5635	0.3016	0.1667
		CASF (ours)	1.0000	1.0000	1.0000	1.0000	0.9762	0.9524	0.2619	0.8095
	DialSummEval	Random	0.7350	0.6966	0.6880	0.5940	0.5919	0.4679	0.4038	0.4423
		Heuristic	0.7714	0.6688	0.5641	0.6688	0.6261	0.5278	0.4103	0.4637
		CASF (ours)	0.8654	0.7308	0.6154	0.7115	0.7372	0.7564	0.4103	0.4872
	OpenAI-axis1	Random	0.6000	0.6833	0.7000	0.7167	0.5500	0.7500	0.7500	0.6000
		Heuristic	0.7333	0.7500	0.6500	0.7667	0.6333	0.5500	0.4667	0.6333
		CASF (ours)	0.7500	0.7500	1.0000	0.9000	0.9500	0.9500	0.9500	0.6500
	OpenAI-axis2	Random	0.7540	0.7540	0.7698	0.5794	0.6270	0.5952	0.5397	0.3492
		Heuristic	0.8095	0.6746	0.6746	0.6746	0.6349	0.6905	0.4127	0.3968
		CASF (ours)	0.8095	0.8571	0.9524	0.7381	0.7381	0.9524	0.9286	0.5000
	OpenAI-CNN/DM1	Random	0.9667	0.8444	0.8111	0.8259	0.5926	0.6889	0.4741	0.6259
		Heuristic	0.8519	0.7593	0.7111	0.7185	0.8185	0.6741	0.5667	0.4444
		CASF (ours)	0.8778	0.8778	0.7667	0.8444	0.8222	0.8222	0.7556	0.7333
	OpenAI-CNN/DM3	Random	1.0000	0.9444	0.9444	0.9444	0.8333	0.6667	0.5556	0.4444
		Heuristic	0.9444	0.8889	0.9444	0.8889	0.8333	0.6667	0.5000	0.2222
		CASF (ours)	1.0000	1.0000	1.0000	1.0000	1.0000	0.6667	0.6667	0.5000
MT	newstext2020 en-de	Random	0.8571	0.8889	0.6984	0.7460	0.6349	0.6032	0.5714	0.0000
		Heuristic	0.7302	0.7778	0.7937	0.6032	0.4921	0.1587	0.4921	0.1587
		CASF (ours)	1.0000	1.0000	1.0000	0.6190	1.0000	0.2381	0.6190	0.1429
	newstext2020 cn-en	Random	0.2500	0.5952	0.3452	0.3333	0.4167	0.1667	0.3810	0.0000
		Heuristic	0.7024	0.4643	0.5000	0.4405	0.3929	-0.0595	0.2500	-0.1190
		CASF (ours)	0.8929	0.8929	0.8571	0.6071	-0.0855	0.2857	0.5000	0.5000
	newstext2021 cn-en	Random	0.4103	0.2051	0.2222	0.2564	-0.0342	-0.1624	-0.1880	-0.1966
		Heuristic	0.0855	0.1282	0.1282	0.1966	-0.0342	-0.1197	-0.0085	-0.1624
		CASF (ours)	0.1026	0.0769	0.1282	0.4615	0.0769	-0.0513	-0.0513	-0.1026
DialoGen	Persona Chat	Random	0.9259	0.7037	0.7037	0.8148	0.4444	0.1852	0.0741	-0.0370
		Heuristic	0.8519	0.7778	0.8148	0.6667	0.3333	0.4444	0.1481	-0.2593
		CASF (ours)	1.0000	0.8889	1.0000	0.8889	0.6667	0.6667	0.8889	0.4444
StoryGen	MANS-ROC	Random	1.0000	1.0000	1.0000	1.0000	0.9333	1.0000	0.6000	-0.5333
		Heuristic	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	0.4000	-0.3333
		CASF (ours)	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	0.8000	-0.6000
	MANS-WP	Random	1.0000	1.0000	1.0000	0.9333	0.9333	1.0000	0.5333	-0.2667
		Heuristic	1.0000	1.0000	1.0000	1.0000	1.0000	0.3333	1.0000	-0.3333
		CASF (ours)	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	0.0000
MMGen	THUMB-MSCOCO	Random	1.0000	1.0000	1.0000	1.0000	0.9333	1.0000	1.0000	0.9333
		Heuristic	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	0.8000
		CASF (ours)	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
	VATEX-EVAL	Random	1.0000	1.0000	0.8667	0.8667	0.7333	0.5111	0.6000	0.6444
		Heuristic	1.0000	1.0000	0.8667	0.8667	0.7333	0.4667	0.5111	0.2889
		CASF (ours)	1.0000	1.0000	1.0000	1.0000	0.4667	0.6000	1.0000	1.0000
Overall Performance		Random	0.7979	0.7811	0.7347	0.7225	0.6095	0.5473	0.4735	0.2211
		Heuristic	0.8024	0.7553	0.7260	0.7084	0.6144	0.4747	0.4406	0.1534
		CASF (ours)	0.8649	0.8409	0.8429	0.8234	0.6975	0.6724	0.6668	0.3855

Different Sampling Size

We treat the sample size as an independent variable and add additional experiments. The experimental results of different sampling sizes are shown in Table 13, and the inter-system ranking accuracy metric is Kendall’s Tau. Both Random and Heuristic were run 100 times, and the average inter-system rankings were recorded as Random Mean and Heuristic Mean in Table 13. We also randomly selected three execution results of Random and Heuristic and displayed them in Table 13. Experimental results show that different times of random sampling or heuristic sampling may get different inter-system ranking accuracy. Experimental results also show CASF outperforms the popular NLG human evaluation sampling method Random and Heuristic in typical sampling sizes. We conduct experiments on datasets with a population size larger than the sample size, and the number of tasks(# Task), datasets(# Dataset), human evaluation aspects(# HE Metric), and systems(# System) involved for each sample size are shown in Table 14.

Table 13: Experimental results of different sampling sizes.

Sample Size	50	100	150	200	250	300
Random 1	0.6847	0.7478	0.5938	0.7758	0.5595	0.6639
Random 2	0.5838	0.6648	0.7547	0.6905	0.7105	0.6755
Random 3	0.6012	0.7346	0.7258	0.8062	0.6537	0.5058
Random Mean	0.6496	0.7478	0.7167	0.7806	0.6596	0.6736
Heuristic 1	0.6460	0.7192	0.6210	0.7768	0.5432	0.6935
Heuristic 2	0.5434	0.6716	0.7542	0.7821	0.7575	0.6112
Heuristic 3	0.6599	0.7490	0.7058	0.6401	0.6435	0.5432
Heuristic Mean	0.6476	0.7497	0.7137	0.7712	0.6412	0.6725
CASF (Ours)	0.7156	0.7514	0.7757	0.8264	0.7706	0.7010

Table 14: The number of tasks(# Task), datasets(# Dataset), human evaluation aspects(# HE Metric), and systems(# System) involved for each sample size.

Sample Size	50	100	150	200	250	300
# Task	5	4	4	4	3	3
# Dataset	16	14	10	10	6	6
# HE Metric	44	34	24	24	14	14
# System	137	127	61	61	38	38

Significant Information Retention Accuracy

We used the common Wilcoxon signed-rank test (Demšar 2006) to evaluate the performance of methods on identifying statistically significant differences between systems on the test set. The overall significant information retention accuracy of CASF, Random Sampling and Heuristic Sampling (both iterated 10000 times) in 44 aspects were 0.6030, 0.5992 and 0.5976 at the $p=0.05$ level, and 0.4344, 0.4156 and 0.4138 at $p=0.001$ level when sampling 50% of the dataset. The results showed CASF outperforms the popular Random Sampling and Heuristic Sampling.

Limitations and Future Work

Accurate and reliable evaluation of models is an important aspect of NLG research and practical applications. We makes human evaluation more reliable with limited cost and labor used for annotation. However, there are still some limitations. On the one hand, quality of samples are predicted by the Learner with features of automated metrics, which are easy to calculate in practice. The information of automatic indicators may not be comprehensive enough to represent the quality of a sample. Future work would consider introducing the characteristics of samples, such as the length of the generated text and lexical complexity, so as to make the quality of samples more comprehensive. Similarly, future work would take more information into account about redundancy. On the other hand, since reliable human evaluation is important for NLP tasks which are lack of reliable automated metrics, we focuses on the problem of reliable human evaluation in NLG tasks. However, CASF may be applicable to other NLP tasks. We would like to extend CASF to more NLP tasks in future work. Due to the necessity of a certain sample size for learner training, our approach may not be applicable in situations with an extremely small sample size, such as when the sample size is less than 50. In cases where the sample size is small, evaluation costs are typically lower, and full-scale assessment could be considered.

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

Abstract

Introduction

Methodology

Problem Statement

Sample Representativeness

Constrained Active Sampling Framework

Overall Framework

Learner and Sample Quality

Systematic Sampler

Constrained Controller

Experimental Setup

Tasks and Datasets

Evaluation Metric

Comparison of Methods

Results and Analysis

Comparison Results

Full Inter-System Ranking Accuracy

Top-Ranked System Accuracy

Case Study

Automatic Metric for Preliminary Phase

Phases and Associated Sampling Ratios

Significant Information Retention Accuracy

Related Work

Conclusion

Acknowledgements

References

Appendix A Appendix

Survey

Tasks and Datasets

Test Set

Validation Set

Learner Selection

Practical Recommendation

Learner Selection in This Paper

Automatic Metrics for Preliminary Sampling Phase

Practical Recommendation

Automatic Metric Selection in This Paper

Different Sampling Ratio

Different Sampling Size

Significant Information Retention Accuracy

Limitations and Future Work

Better than Random: Reliable NLG Human Evaluation with
Constrained Active Sampling