Better than Random: Reliable NLG Human Evaluation with
Constrained Active Sampling
Abstract
Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and time-consuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking. Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18% top-ranked system recognition accuracy and ranks first or ranks second on 90.91% of the human metrics with 0.83 overall inter-system ranking Kendall correlation. Code and data are publicly available online.
![Refer to caption](extracted/5661293/motivation.jpg)
Introduction
Evaluation of NLG systems remains challenging. The reason is that similar content in text can often be expressed in various ways, and the same output of the NLG system may need to satisfy multiple goals in different aspects (2020; 2022). Hence, reliable automatic metrics are complex to design (2017; 2009). Human evaluation is generally considered to be a more reliable evaluation way in natural language generation tasks (2020; 2018; 2015). However, human judgment is viewed as expensive, time-consuming, and lacks standardized evaluation procedures (2020; 2020; 2022).
To save labor and costs, human evaluation is usually performed on a small subset sampled from the dataset in practice. Researchers compare the average scores of the systems on this subset to obtain a ranking between the systems. However, different sample subsets will lead to different rankings of the systems. We re-evaluated 137 real NLG evaluation setups on 44 human metrics across 16 datasets and 5 NLG tasks. Results show that 87.5% of datasets have different inter-system rankings across 5 times of random sampling. Since research is driven by evaluation, focusing on the final ranking of systems, it is vital to design a more reliable evaluation method to obtain the correct inter-system ranking.
We randomly select 1404 papers from ACL, EMNLP and COLING in the past 2 years and find that 270 papers select a subset of the dataset for manual evaluation to save labor and cost (details are in the Survey section of the Appendix). The survey results show that random sampling is the most vital sampling method, accounting for 60.7%, and the rest 39.3% of the papers do not mention the sampling method they used. Random sampling is widely used in human evaluation sampling for its simplicity. However, random sampling can be risky (2022). On the one hand, random sampling can lead to clustered selection, a phenomenon in which randomly selected samples are uncommonly close together in a population (as shown in the black and purple circle in Figure 1). On the other hand, random sampling may have the risk of data manipulation. Researchers can choose samples at will or conduct multiple random sampling to select a favorite subset, which will lead to unfair evaluation results. Since different sampling subsets may result in different inter-system rankings in human judgment, it is difficult to reliably select the best system. We urgently need a better sampling method to deliver reliable human evaluation with low labor and cost.
In this paper, we focus on improving the reliability of the gold standard human evaluation with limited cost and time used for human annotation. Specifically, we explore the problem of clustered selection and data manipulation for manual evaluation sampling and propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. The proposed CASF consists of a Learner, a Systematic Sampler and a Constrained Controller. CASF obtains a representative subset of samples in multiple sampling phases. In each sampling phase, the Learner predicts the quality score for samples and feeds the quality score of each sample to the Systematic Sampler. Then, the Systematic Sampler and the Constrained Controller work together to select representative samples with lower redundancy for the sampling phase. Samples collected in each phase are not duplicates of those collected in previous phases, and will be directly subjected to human evaluation, and the newly labeled ones will also be used to update the Learner.
The main contributions are as follows: 1) We investigate and experimentally analyze the sampling problem for the gold standard human evaluation in natural language generation. 2) We propose a Constrained Active Sampling Framework (CASF) for the sampling problem in manual evaluation. The proposed CASF can solve the problem of clustered selection and data manipulation for human evaluation sampling. 3) We re-evaluate 137 real NLG evaluation setups on 44 human evaluation metrics across 16 datasets and 5 NLG tasks. Experiment results demonstrate the proposed method ranks first or ranks second on 90.91% of the human metrics and receives 93.18% top-ranked system recognition accuracy. To ease the adoption of reliable sampling, we release a constrained active sampling tool. We strongly recommend using CASF to sample test instances for human evaluation. Our tool, code and data are publicly available online.111https://github.com/EnablerRx/CASF
Methodology
Problem Statement
The goal of sampling in human evaluation is to select a subset with the intention of estimating the inter-system ranking of the whole sample population. Ideally, the obtained subset should cover more representative samples of the population. A good sampling method will result in a more correct inter-system ranking calculated through the sampling subset.
The general evaluation sampling problem is as follows. Given a data set where is the size of the whole sample population, represents a data input, is the corresponding set of generated outputs, is the corresponding set of human score vectors. The generated output set consists of system outputs and is denoted as ,
![Refer to caption](extracted/5661293/framework.png)
where represents the -th system generated output of the -th sample. The human score vector set consists of the corresponding human score vector for each system output and is denoted as . Since human evaluation is usually carried out in multiple aspects, we use a vector to represent human evaluation results from multiple aspects for each system. Each human score vector consists of human annotation metrics from different aspects and is denoted as . Eventually there will be separate inter-system ranking on each aspect. Let represent the final sample subset. Function calculates the mean scores of each system in the sample set for each human evaluation aspect and gives the ranking among systems. calculates the similarity between two inter-system rankings. The overall objective of sampling and constraint is as follows:
where is the sampling rate, refers to the cardinality of a sample set and first calculates the average human scores in each aspect of each system in the sample set, and then gives the inter-system ranking of each human indicator according to the mean score of each system.
Sample Representativeness
Taking representative samples allows for a more complete evaluation of the overall performance of the system. Inspired by the theoretical model of summarization (Peyrard 2019), the Representativeness of samples can be measured in two aspects, including Quality Diversity and Redundancy. Quality Diversity represents the diversity of sample quality levels, that is, the sampled subset should contain samples of various quality levels. Evaluation on qualitatively diverse subsets of samples allows the system to better reflect the performance of all samples. Quality is the average quality of generated outputs of the sample. More comprehensive coverage of samples of different qualities will result in a better Quality Diversity. Redundancy indicates the degree of similarity or duplication among the generated outputs of samples.
Constrained Active Sampling Framework
Overall Framework
The proposed Constrained Active Sampling Framework aims to select representative samples for human evaluation in multiple phases to get a more correct inter-system ranking. The proposed CASF operates through a Learner, a Systematic Sampler and a Constrained Controller. The goal of the Learner is to predict the quality of samples and give a ranking of sample quality by a regressor. The Systematic Sampler divides samples into multiple buckets according to the sample quality ranking given by the Learner. The Constrained Controller controls the Redundancy of samples and selects a final sample from each bucket given by the Systematic Sampler.
The proposed Constrained Active Sampling Framework is shown in Figure 2. There are several sampling phases denoted by , an preliminary sampling phase (the left branch in Figure 2) and batch active sampling phases (the right branch in Figure 2). In the preliminary sampling phase, alternate quality scores for all samples are calculated through an automated metric, as the Learner is not ready to use yet. The Systematic Sampler, then, selects a small preliminary subset of samples as part of the final sample subset according to the given quality ranking. The selected samples are then evaluated by human beings. In the current batch active sampling phase, samples selected in all previous phases together with the corresponding human scores, then, are fed to the regressor of the learner, and the regressor of the learner is updated and applied to predict the quality of the rest samples with the sample’s scores over various automatic metrics as features. After that, the Systematic Sampler and Constrained Controller work together to choose batch subset from the rest samples for the -th batch active sampling phase as part of the final samples. Then, the samples selected in the -th phase are subjected to human evaluation for use in the subsequent sampling phases. The final sample set consists of batch subsets from each phase . We conduct experiments to explore the determination of the number of phases and the sampling ratio of each phase in the Phases and Associated Sampling Ratios section.
Learner and Sample Quality
Estimating the quality of the samples is a vital step in CASF. Since the quality of samples is difficult to define and calculate directly, we propose a Learner to predict the human scores as the quality scores for the rest samples for selection at each phase (except the preliminary phase). As various automatic metrics can measure the characteristics of samples in different aspects and are easy to calculate with lower cost, we use scores of automatic metrics as features to predict the quality of samples.
Note that in the preliminary phase, the quality of samples is simply estimated by an automatic metric. In each of the batch active sampling phases, the Learner receives feedback from human annotators and update its parameters. After that, it utilizes the scores of automatic metrics to predict the quality score for each sample. The Learner will then provide the quality ranking of samples at each batch , where is the sample index and the number of the rest samples for selection in each phase is .
The main objective of the Learner is to map to the corresponding human score vector set . Since there are multiple elements in , we standardize scores for each human evaluation aspect and use the sum of each element in , which is the sum of human scores for all aspects of all NLG systems under sample , to represent . The objective is to minimize the following loss function:
where is the number of samples selected in the final subset and is the parameter of Learner in the -th phase. The predicted quality scores for the rest samples at each phase are calculated as follows:
Specifically, the Learner first calculates the results of each automatic metric based on the output of each NLG system from the input sample. Then, the automatic metric results under each NLG system will be fed as features into the Learner’s regressor. Eight popular NLG metrics are chosen as the automatic metrics set (details are in the Automatic Metric for Preliminary Phase section) of CASF. Due to the small number of samples and features mainly containing automatic metrics’ scores, we explore several popular learning methods and recommend choosing Gradient Boosting Decision Tree (GBDT) (Friedman 2001) as the regressor of the Learner. Full experimental results are in the Learner Selection section of Appendix. The loss function is the least squares method (2007), which is commonly used in GBDT.
Systematic Sampler
Systematic sampling has advantage of eliminating clustered selection problem and can reduce the risk of favoritism, which meets our motivation. Therefore, we adopt the systematic sampling method (Yates 1948) sorted by relevant signs as the sampling core of CASF. The Systematic Sampler selects representative initial samples and candidate samples according to the quality ranking of samples. Specifically, the Systematic Sampler first divides the samples for the -th phase into buckets according to the given quality ranking . is the number of samples to be selected at the -th phase. Samples with quality ranking are divided into the same bucket, where . The samples with quality rank are selected as the initial selection samples. And the rest samples in each bucket are candidate samples.
Constrained Controller
The proposed Constrained Controller controls the Redundancy of samples and selects one sample from each of the buckets divided by the Systematic Sampler to form a final sample subset (as shown in Figure 3). Since the Systematic Sampler selects initial samples at a regular interval, which makes the distribution of the initial subset align closely with the overall distribution, we aim to preserve the original sampling intervals as much as possible while controlling the Redundancy to maintain the representativeness of the sample subset.
Specifically, we define objective function as the quality ranking distance between the current sample and the initial selection sample in each bucket. We also define violation function to calculate the Redundancy between the current sample and the final samples. Since the bi-gram similarity (Kondrak 2005) is regarded as a simple and effective method to calculate the redundancy between texts, we calculate the Redundancy by calculating the bi-gram similarity between the outputs generated for the sample and that for the final samples. A sample is called feasible if , which means it is not redundant with the selected final samples. Otherwise, is infeasible.
The Constrained Controller is summarized into 3 rules:
where is the -th sample and means is a better choice. means the Constrained Controller tends to select samples that are not redundant. represents that if two samples are both redundant with the final samples, the Constrained Controller tends to select samples with less redundancy. demonstrates that if two samples are both not redundant with the final samples, the Constrained Controller tends to select samples with ranks as close as possible to those of the initial selection samples.
![Refer to caption](extracted/5661293/constrainedControler.png)
In Figure 3, the rest samples for selection are first re-indexed, and then re-ordered according to Learner’s predicted quality score. The system sampler divides samples into three buckets based on quality ranking and marks initial selection sample for each bucket. In the first bucket, only sample 3 is feasible, that is, sample 3 is not redundant with existing final samples. Thus, Sample 3 is selected as the final sample according to . In the second bucket, none of the three samples is feasible, so sample 0 with the smallest redundancy is selected as the final sample according to . In the third bucket, all samples are feasible, and sample 5 is the initial selection sample and it is selected by default or according to .
Dataset | HE Metric | R 1 | R 2 | R 3 |
|
H 1 | H 2 | H 3 |
|
8M | SM | OL |
|
||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SummEval | coherence | 0.85 | 0.65 | 0.33 | 0.61 | 0.70 | 0.82 | 0.92 | 0.81 | 0.42 | 0.42 | 0.87 | 0.95 | ||||||
consistency | 0.25 | 0.48 | 0.43 | 0.39 | 0.68 | 0.02 | 0.65 | 0.45 | 0.30 | 0.17 | 0.53 | 0.53 | |||||||
fluency | 0.40 | 0.35 | 0.52 | 0.42 | 0.45 | 0.45 | 0.30 | 0.40 | 0.35 | 0.37 | 0.52 | 0.33 | |||||||
relevance | 0.72 | 0.60 | 0.68 | 0.67 | 0.65 | 0.43 | 0.72 | 0.60 | 0.40 | 0.60 | 0.45 | 0.82 | |||||||
REALSumm | litepyramid | 0.39 | 0.54 | 0.44 | 0.46 | 0.36 | 0.38 | 0.44 | 0.39 | 0.33 | 0.37 | 0.54 | 0.54 | ||||||
NeR18 | coherence | 1.00 | 1.00 | 0.43 | 0.81 | 0.90 | 0.90 | 0.90 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | ||||||
fluency | 0.52 | 1.00 | 1.00 | 0.84 | 1.00 | 0.52 | 0.90 | 0.81 | 1.00 | 1.00 | 1.00 | 1.00 | |||||||
informativeness | 1.00 | 1.00 | 1.00 | 1.00 | 0.71 | 1.00 | 0.90 | 0.87 | 0.71 | 1.00 | 1.00 | 1.00 | |||||||
relevance | 1.00 | 0.52 | 1.00 | 0.84 | 0.90 | 0.90 | 0.90 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | |||||||
DialSumm | consistency | 0.74 | 0.72 | 0.49 | 0.65 | 0.74 | 0.64 | 0.62 | 0.67 | 0.59 | 0.56 | 0.54 | 0.77 | ||||||
relevance | 0.69 | 0.46 | 0.64 | 0.60 | 0.64 | 0.69 | 0.54 | 0.62 | 0.23 | 0.44 | 0.59 | 0.72 | |||||||
fluency | 0.59 | 0.56 | 0.59 | 0.58 | 0.38 | 0.56 | 0.51 | 0.49 | 0.15 | 0.49 | 0.64 | 0.62 | |||||||
coherence | 0.67 | 0.80 | 0.74 | 0.74 | 0.74 | 0.80 | 0.59 | 0.71 | 0.59 | 0.67 | 0.82 | 0.90 | |||||||
OpenAI 1 | accuracy | 0.80 | 0.00 | 1.00 | 0.60 | 0.80 | 1.00 | 0.80 | 0.87 | 0.80 | 0.00 | 0.00 | 1.00 | ||||||
coherence | 0.40 | 0.80 | 0.00 | 0.40 | 0.80 | 0.20 | 0.80 | 0.60 | 0.80 | 0.40 | 0.20 | 0.80 | |||||||
coverage | 1.00 | 1.00 | 1.00 | 1.00 | 0.80 | 0.80 | 0.80 | 0.80 | 0.80 | 1.00 | 0.80 | 0.80 | |||||||
overall | 0.80 | 1.00 | 1.00 | 0.93 | 0.80 | 1.00 | 0.80 | 0.87 | 0.80 | 1.00 | 0.80 | 1.00 | |||||||
OpenAI 2 | accuracy | 0.71 | 0.43 | 1.00 | 0.71 | 0.62 | 0.71 | 0.81 | 0.71 | 1.00 | 0.52 | 0.14 | 0.90 | ||||||
coherence | 0.24 | 0.52 | 0.33 | 0.37 | -0.14 | 0.24 | 0.43 | 0.17 | 0.24 | 0.52 | 0.24 | 0.43 | |||||||
coverage | 1.00 | 0.71 | 0.90 | 0.87 | 1.00 | 0.90 | 1.00 | 0.97 | 1.00 | 1.00 | 1.00 | 1.00 | |||||||
overall | 0.90 | 0.71 | 1.00 | 0.87 | 0.62 | 1.00 | 0.90 | 0.84 | 0.90 | 0.90 | 0.90 | 1.00 | |||||||
OpenAI 3 | accuracy | 0.73 | 0.82 | 0.82 | 0.79 | 0.87 | 0.78 | 0.82 | 0.82 | 0.73 | 0.69 | 0.78 | 0.87 | ||||||
coherence | 0.51 | 0.33 | 0.56 | 0.47 | 0.42 | 0.51 | 0.56 | 0.50 | 0.56 | 0.20 | 0.60 | 0.60 | |||||||
coverage | 0.38 | 0.38 | 0.87 | 0.54 | 0.51 | 0.87 | 0.51 | 0.63 | 1.00 | 1.00 | 0.42 | 0.87 | |||||||
overall | 0.87 | 0.51 | 1.00 | 0.79 | 1.00 | 0.73 | 0.51 | 0.75 | 1.00 | 0.38 | 0.47 | 1.00 | |||||||
OpenAI 4 | accuracy | 1.00 | 0.33 | 1.00 | 0.78 | 1.00 | 0.33 | 0.33 | 0.56 | 0.33 | 1.00 | 0.33 | 1.00 | ||||||
coherence | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.33 | 1.00 | |||||||
coverage | 0.33 | 1.00 | 1.00 | 0.78 | 0.33 | 1.00 | 1.00 | 0.78 | 1.00 | 1.00 | 1.00 | 1.00 | |||||||
overall | 0.33 | 1.00 | 1.00 | 0.78 | 0.33 | 1.00 | 1.00 | 0.78 | 1.00 | 1.00 | 1.00 | 1.00 | |||||||
newstest 1 | MQM | 0.14 | 0.14 | 0.14 | 0.14 | 0.33 | 0.14 | -0.05 | 0.14 | 0.14 | 0.33 | 0.14 | 0.14 | ||||||
pSQM | 0.81 | 0.90 | 0.90 | 0.87 | 0.81 | 0.90 | 0.90 | 0.87 | 1.00 | 0.90 | 0.90 | 1.00 | |||||||
newstest 2 | MQM | 0.79 | 0.93 | 0.71 | 0.81 | 0.64 | 0.86 | 0.71 | 0.74 | 0.14 | 0.93 | 0.86 | 0.93 | ||||||
pSQM | 0.43 | 0.36 | 0.79 | 0.52 | 0.29 | 0.86 | 0.43 | 0.52 | 0.36 | 0.93 | 0.79 | 0.79 | |||||||
newstest 3 | MQM | 0.00 | -0.13 | -0.05 | -0.06 | -0.05 | -0.03 | -0.05 | -0.04 | 0.46 | 0.13 | 0.00 | 0.03 | ||||||
Persona | Understandable | 0.33 | -1.00 | 0.33 | -0.11 | -1.00 | 0.33 | 0.33 | -0.11 | 0.33 | 0.33 | 0.33 | 0.33 | ||||||
Natural | 0.33 | -1.00 | 1.00 | 0.11 | 1.00 | -1.00 | 0.33 | 0.11 | 0.33 | 0.33 | 0.33 | 1.00 | |||||||
|
1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | -1.00 | 1.00 | 1.00 | 1.00 | |||||||
Interesting | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.33 | 1.00 | 0.78 | 1.00 | 1.00 | 1.00 | 1.00 | |||||||
Uses Knowledge | 1.00 | 1.00 | 1.00 | 1.00 | -1.00 | 1.00 | 1.00 | 0.33 | 1.00 | 1.00 | 1.00 | 1.00 | |||||||
Overall Quality | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||||||
MANS-ROC | overall | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||||||
MANS-WP | overall | 1.00 | 0.80 | 0.80 | 0.87 | 0.80 | 1.00 | 1.00 | 0.93 | 1.00 | 1.00 | 1.00 | 1.00 | ||||||
THUMB | overall | 1.00 | 0.80 | 1.00 | 0.93 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||||||
VATEX | consistency | 0.60 | 1.00 | 0.60 | 0.73 | 0.60 | 1.00 | 1.00 | 0.87 | 1.00 | 1.00 | 1.00 | 1.00 | ||||||
Overall Performance | 0.69 | 0.61 | 0.75 | 0.68 | 0.61 | 0.67 | 0.65 | 0.72 | 0.67 | 0.72 | 0.68 | 0.83 |
Experimental Setup
Tasks and Datasets
We conduct experiments on 44 human metrics across 16 datasets spanning 5 tasks. A total of 137 NLG systems are involved. Details of the datasets, preprocessing and the validation set for hyper-parameters selection are in the Tasks and Dataset section of Appendix. The datasets are: Summarization (SUM): We utilize 8 human evaluation datasets of the model generated summarization, which are SummEval (2021), REALSumm (2020), Newsroom (NeR18) (2018), DialSummEval (DialSumm) (2022) and OpenAI-axis1 (OpenAI 1) (2020; 2017), OpenAI-axis2 (OpenAI 2) , OpenAI-CNN/DM1 (OpenAI 3) , and OpenAI-CNN/DM3 (OpenAI 4) . Machine Translation (MT): We use 3 datasets collected from WMT news translation tasks (2021) viz. newstest2020 en-de (newstest 1), newstest2020 cn-en (newstest 2) and newstest2021 cn-en (newstest 3). Dialogue Generation (DGen): We utilize a human annotation dataset of machine-generated dialogues released with the Persona Chat (Persona) (Mehri and Eskenazi 2020) dataset. Story Generation (SGen): We use two manual evaluation datasets for story generation namely MANS-ROC (Guan et al. 2021) and MANS-WP (Guan et al. 2021). Multi-Modal Generation (MMGen): We use two existing human evaluation datasets namely THUMB-MSCOCO (THUMB) (Kasai et al. 2022) and VATEX-EVAL (VATEX) (Shi et al. 2022).
Evaluation Metric
We select a subset of each dataset and then compute the results for all the human metrics in various aspects. We measure the efficacy of sampling method by computing rankings of candidate models on the subset and their Kendall’s Tau correlation (1938) with rankings obtained on the full dataset. We refer to Kendall’s treatment (1945) to handle ties.
Comparison of Methods
The comparison methods are selected based on the survey of evaluation sampling methods in 1404 papers where Random and Heuristic are the main sampling methods for NLG human evaluation. We also include some ablation methods. The comparison methods are: Random Sampling (R) randomly sample the dataset and is performed 3 times (2008; 2022; 2007) to reflect real sampling scenarios. Results of each time and the average result are recorded. Heuristic Sampling (H) (2022) first sorts the samples according to the average length of the generated sentences. Then, Heuristic randomly collects a small number of samples with extreme sentence length and a large number of samples with normal sentence length. Heuristic is performed 3 times. Eight Metric (8M): CASF with only the preliminary sampling phase which normalizes the score obtained by the 8 automatic metrics used in CASF and calculates the average score. Single Metric (SM): CASF with only the preliminary sampling phase which uses the automatic metric used in the preliminary sampling phases of CASF. Online Sampling (OL): CASF without Constrained Controller. We compare methods with 50% sampling rate. Results for other sampling ratios are in Different Sampling Ratio section of Appendix. In addition, the number of phases and the sampling ratio of each phase are 5 and 10%. The determination of these parameters is shown in the Phases and Associated Sampling Ratios section. We also treat the sample size as an independent variable and results are shown in the Appendix.
Results and Analysis
Method | SUM | MT | DGen | SGen | MMGen | Overall |
---|---|---|---|---|---|---|
R | 0.76 | 0.87 | 0.78 | 0.67 | 1.00 | 0.76 |
H | 0.80 | 0.67 | 0.78 | 0.67 | 1.00 | 0.78 |
8M | 0.83 | 0.80 | 0.83 | 1.00 | 1.00 | 0.84 |
SM | 0.90 | 1.00 | 0.83 | 1.00 | 1.00 | 0.91 |
OL | 0.69 | 0.80 | 1.00 | 1.00 | 1.00 | 0.77 |
CASF | 0.93 | 0.80 | 1.00 | 1.00 | 1.00 | 0.93 |
Comparison Results
Full Inter-System Ranking Accuracy
According to results on validation set (Automatic Metrics for Preliminary Sampling Phase section of Appendix), We select MOVER-SCORE (Zhao et al. 2019) for calculating sample quality in the preliminary sampling phase. Inter-system ranking accuracy of methods on 16 datasets across 5 NLG tasks are shown in Table 1. The results show Random have large fluctuations. For example, in the newstest2020 cn-en dataset of MT task, different times of random sampling result in different inter-system correlation. This shows the risky of widely using Random in evaluation. CASF ranked first on 79.55% of human metrics and ranked first or ranked second on 90.91% of metrics. This shows CASF can better select representative samples to get a more accurate ranking. Results of the remaining human metrics, although not ranking first, are still acceptable and close to the best results. These acceptable results appear as we measure the quality of each sample in the dataset. However, human evaluation in different aspects is conducted in the same dataset. The overall scores can represent the overall evaluation results. We use Wilcoxon signed ranks (2006) to test the results of Random and Heuristic (both iterated 10000 times) with CASF in 44 human metrics. Results show CASF is statistically outperforming Random, Heuristic and other methods with , and .
Top-Ranked System Accuracy
One of the important goals of evaluation is to select the top-ranked system. Accurately selecting the best system with limited manpower can help the NLG field to keep good systems and eliminate poor ones. Thus, we explore the ability of CASF to identify the top-ranked system. As shown in Table 2, CASF achieves 93.18% top-ranked system recognition accuracy in 44 human evaluation metrics involving 137 NLG systems. For typical NLG tasks like DGen, SGen and MMGen, CASF achieves 100% identification accuracy. Experimental results also showed CASF was statistically outperforming the popular Random and Heuristic at the level.
M | #P | P-R | B-R | Tau | M | #P | P-R | B-R | Tau | M | #P | P-R | B-R | Tau | M | #P | P-R | B-R | Tau |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 2 | 0.25 | 0.25 | 0.75 | F | 2 | 0.10 | 0.40 | 0.73 | F | 2 | 0.05 | 0.45 | 0.74 | F | 2 | 0.15 | 0.35 | 0.73 |
3 | 0.17 | 0.17 | 0.76 | 3 | 0.10 | 0.20 | 0.75 | 3 | 0.05 | 0.23 | 0.74 | 3 | 0.15 | 0.18 | 0.77 | ||||
4 | 0.13 | 0.13 | 0.76 | 4 | 0.10 | 0.13 | 0.80 | 4 | 0.05 | 0.15 | 0.77 | 4 | 0.15 | 0.12 | 0.76 | ||||
5 | 0.10 | 0.10 | 0.83 | 5 | 0.10 | 0.10 | 0.83 | 5 | 0.05 | 0.11 | 0.77 | 5 | 0.15 | 0.09 | 0.76 | ||||
6 | 0.08 | 0.08 | 0.72 | 6 | 0.10 | 0.08 | 0.75 | 6 | 0.05 | 0.09 | 0.73 | 6 | 0.15 | 0.07 | 0.71 | ||||
7 | 0.07 | 0.07 | 0.72 | 7 | 0.10 | 0.07 | 0.69 | 7 | 0.05 | 0.08 | 0.69 | 7 | 0.15 | 0.06 | 0.73 | ||||
8 | 0.06 | 0.06 | 0.70 | 8 | 0.10 | 0.06 | 0.73 | 8 | 0.05 | 0.06 | 0.72 | 8 | 0.15 | 0.05 | 0.79 | ||||
9 | 0.06 | 0.06 | 0.73 | 9 | 0.10 | 0.05 | 0.72 | 9 | 0.05 | 0.06 | 0.72 | 9 | 0.15 | 0.04 | 0.75 | ||||
10 | 0.05 | 0.05 | 0.75 | 10 | 0.10 | 0.04 | 0.73 | 10 | 0.05 | 0.05 | 0.75 | 10 | 0.15 | 0.04 | 0.70 |
Case Study
![Refer to caption](extracted/5661293/caseStudy.png)
Taking the human aspect accuracy in the OpenAI 1 (Stiennon et al. 2020; Völske et al. 2017) dataset as an example, CASF obtains an accurate inter-system ranking as shown in Figure 4. The 3 times of random sampling obtained different inter-system rankings, and the ranking of the first system fluctuated between the first and fourth, with great volatility. This confirms the problem we raised about the risk of random sampling, making evaluation unreliable. CASF selects the same subset in multiple times, and the variance of the inter-ranking accuracy obtained by multiple sampling times is 0 (Learner Selection section of Appendix). Since CASF selects representative samples, it obtains more accurate inter-system rankings, making evaluation more reliable.
Automatic Metric for Preliminary Phase
Metric | SUM | MT | DGen | SGen | MMGen | Avg |
---|---|---|---|---|---|---|
BERT-S | 0.74 | 0.58 | 0.67 | 1.00 | 1.00 | 0.73 |
MOVER-S | 0.84 | 0.58 | 0.89 | 1.00 | 1.00 | 0.83 |
ROUGE-1 | 0.73 | 0.57 | 0.67 | 0.30 | 1.00 | 0.70 |
ROUGE-2 | 0.73 | 0.55 | 0.56 | 1.00 | 0.80 | 0.70 |
ROUGE-L | 0.72 | 0.52 | 0.89 | 1.00 | 1.00 | 0.75 |
BART-S | 0.60 | 0.44 | 0.89 | 0.90 | 0.80 | 0.64 |
BLEU | 0.72 | 0.37 | 0.56 | 1.00 | 0.80 | 0.67 |
METEOR | 0.78 | 0.54 | 0.89 | 1.00 | 1.00 | 0.79 |
We choose automatic metrics commonly used in NLG as our automatic metrics set, including BERT-SCORE (2019), MOVER-SCORE (2019), ROUGE-1 (2004), ROUGE-2, ROUGE-L, BART-SCORE (2021), BLEU (2002) and METEOR (2005). We apply each metric to calculate sample quality in the preliminary sampling phase of CASF in Table 4. Results show sample quality calculated on MOVER-SCORE get a more correct ranking. This shows the ability to calculate sample quality of contextual-embedding-based metric MOVER-SCORE. Traditional metric METEOR ranks second. Full results are in Appendix.
Phases and Associated Sampling Ratios
We conduct experiments to explore the influence of the number of phases and the sampling ratio of each phase for CASF. Results at the sampling rate of 50% on 16 datasets are shown in Table 3. In average mode, all phases are sampled in equal proportions. In the preliminary-fixed mode, we fix the preliminary sampling ratio, and the batch sampling ratio is divided equally according to the number of iteration phases and the total sampling ratio. Results show that performance is better when the number of iteration phases is 5 in most cases. It is simple and effective to sample each phase according to the total sampling rate and the number of phases.
Significant Information Retention Accuracy
Previous work (2022) focused on identifying top-ranked systems, and we further explored giving more accurate overall inter-system rankings and tested the significant information retention accuracy on sample subsets, that is, to test whether the subset can preserve the significance of ranking among systems. Results showed CASF outperforms Random and Heuristic. Details are in the Appendix.
Related Work
Previous works (2014; 2015; 2014; 2016) adopt TrueSkill (2006) to rank NLG methods with pairwise human evaluation. Sakaguchi and Van Durme (2018) introduce a method for system quality estimation from pairwise annotation by human judgment. Hashimoto et al. (2019) propose an evaluation mechanism to calculate a model’s sampling probabilities. Chaganty, Mussman, and Liang (2018) utilize control variates to obtain an unbiased estimator with lower cost than only using human evaluation. Mendonça et al. (2021) adopt online learning to find the best systems for machine translation. Wei et al. (2022) study the power on pairwise direct assessment comparisons. A recent work (2022) introduces Active Evaluation to identify the top-ranked system with less pairwise human annotations. There is still a vacancy in the research to derive a complete inter-system ranking based on the results of direct human scoring for general NLG tasks. Yates (1948) proposed Systematic Sampling. ILDAE (2022) calculates the difficulty score of the sample and uses a simple sampling method for Natural Language Inference. However, ILDAE is not suitable for NLG since there is no direct confidence value in NLG methods. To the best of our knowledge, this paper is the first work to extensively study the sampling method for direct scoring to get the whole inter-system ranking in NLG human evaluation.
Conclusion
In this paper, we focused on giving a more correct inter-system ranking for reliable human evaluation with limited time and cost. We propose CASF and show the overall inter-system Kendall correlation improved by 41% to 0.83 compared to the widely used random sampling in 44 human evaluation metrics across 16 datasets in 5 NLG tasks. CASF ranked first or ranked second among all comparison methods on up to 90.91% of the human metrics. We release a tool and we strongly recommend using CASF for reliable human evaluation to get a more reliable inter-system ranking.
Acknowledgements
This work was supported by National Key R&D Program of China (2021YFF0901502), National Science Foundation of China (No. 62161160339), State Key Laboratory of Media Convergence Production Technology and Systems and Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology). We appreciate the anonymous reviewers for their helpful comments. Xiaojun Wan is the corresponding author.
References
- Abdi et al. (2007) Abdi, H.; et al. 2007. The method of least squares. Encyclopedia of measurement and statistics, 1: 530–532.
- Banerjee and Lavie (2005) Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72.
- Begg and Mazumdar (1994) Begg, C. B.; and Mazumdar, M. 1994. Operating characteristics of a rank correlation test for publication bias. Biometrics, 1088–1101.
- Bethard (2022) Bethard, S. 2022. We need to talk about random seeds. arXiv preprint arXiv:2210.13393.
- Bhandari et al. (2020) Bhandari, M.; Gour, P.; Ashfaq, A.; Liu, P.; and Neubig, G. 2020. Re-evaluating evaluation in text summarization. arXiv preprint arXiv:2010.07100.
- Bhatnagar, Ganesh, and Kann (2022) Bhatnagar, R.; Ganesh, A.; and Kann, K. 2022. CHIA: CHoosing Instances to Annotate for Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2022, 7299–7315.
- Bojar et al. (2014) Bojar, O.; Buck, C.; Federmann, C.; Haddow, B.; Koehn, P.; Leveling, J.; Monz, C.; Pecina, P.; Post, M.; Saint-Amand, H.; et al. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, 12–58.
- Bojar et al. (2015) Bojar, O.; Chatterjee, R.; Federmann, C.; Haddow, B.; Huck, M.; Hokamp, C.; Koehn, P.; Logacheva, V.; Monz, C.; Negri, M.; Post, M.; Scarton, C.; Specia, L.; and Turchi, M. 2015. Findings of the 2015 Workshop on Statistical Machine Translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, 1–46. Lisbon, Portugal: Association for Computational Linguistics.
- Breiman (1996) Breiman, L. 1996. Bagging predictors. Machine learning, 24(2): 123–140.
- Breiman (2001) Breiman, L. 2001. Random forests. Machine learning, 45(1): 5–32.
- Card et al. (2020) Card, D.; Henderson, P.; Khandelwal, U.; Jia, R.; Mahowald, K.; and Jurafsky, D. 2020. With Little Power Comes Great Responsibility. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9263–9274.
- Celikyilmaz, Clark, and Gao (2020) Celikyilmaz, A.; Clark, E.; and Gao, J. 2020. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
- Chaganty, Mussman, and Liang (2018) Chaganty, A. T.; Mussman, S.; and Liang, P. 2018. The price of debiasing automatic metrics in natural language evaluation. arXiv preprint arXiv:1807.02202.
- Cover and Hart (1967) Cover, T.; and Hart, P. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1): 21–27.
- Demšar (2006) Demšar, J. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7: 1–30.
- Dušek, Novikova, and Rieser (2018) Dušek, O.; Novikova, J.; and Rieser, V. 2018. Findings of the E2E NLG Challenge. In Proceedings of the 11th International Conference on Natural Language Generation, 322–328. Tilburg University, The Netherlands: Association for Computational Linguistics.
- Fabbri et al. (2021) Fabbri, A. R.; Kryściński, W.; McCann, B.; Xiong, C.; Socher, R.; and Radev, D. 2021. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9: 391–409.
- Freitag et al. (2021) Freitag, M.; Foster, G.; Grangier, D.; Ratnakar, V.; Tan, Q.; and Macherey, W. 2021. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. arXiv:2104.14478.
- Freund, Schapire et al. (1996) Freund, Y.; Schapire, R. E.; et al. 1996. Experiments with a new boosting algorithm. In icml, volume 96, 148–156. Citeseer.
- Friedman (2001) Friedman, J. H. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232.
- Gao and Wan (2022) Gao, M.; and Wan, X. 2022. DialSummEval: Revisiting Summarization Evaluation for Dialogues. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5693–5709. Seattle, United States: Association for Computational Linguistics.
- Gatt and Krahmer (2018) Gatt, A.; and Krahmer, E. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61: 65–170.
- Geurts, Ernst, and Wehenkel (2006) Geurts, P.; Ernst, D.; and Wehenkel, L. 2006. Extremely randomized trees. Machine learning, 63(1): 3–42.
- Gkatzia and Mahamood (2015) Gkatzia, D.; and Mahamood, S. 2015. A snapshot of NLG evaluation practices 2005-2014. In Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), 57–60.
- Grusky, Naaman, and Artzi (2018) Grusky, M.; Naaman, M.; and Artzi, Y. 2018. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. arXiv preprint arXiv:1804.11283.
- Guan et al. (2021) Guan, J.; Zhang, Z.; Feng, Z.; Liu, Z.; Ding, W.; Mao, X.; Fan, C.; and Huang, M. 2021. OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics. arXiv:2105.08920.
- Hashimoto, Zhang, and Liang (2019) Hashimoto, T. B.; Zhang, H.; and Liang, P. 2019. Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792.
- Hearst et al. (1998) Hearst, M. A.; Dumais, S. T.; Osuna, E.; Platt, J.; and Scholkopf, B. 1998. Support vector machines. IEEE Intelligent Systems and their applications, 13(4): 18–28.
- Herbrich, Minka, and Graepel (2006) Herbrich, R.; Minka, T.; and Graepel, T. 2006. TrueSkill™: a Bayesian skill rating system. Advances in neural information processing systems, 19.
- Howcroft et al. (2020) Howcroft, D. M.; Belz, A.; Clinciu, M.-A.; Gkatzia, D.; Hasan, S. A.; Mahamood, S.; Mille, S.; Van Miltenburg, E.; Santhanam, S.; and Rieser, V. 2020. Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions. In Proceedings of the 13th International Conference on Natural Language Generation, 169–182.
- Hu et al. (2019) Hu, J. E.; Rudinger, R.; Post, M.; and Van Durme, B. 2019. Parabank: Monolingual bitext generation and sentential paraphrasing via lexically-constrained neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 6521–6528.
- Kasai et al. (2022) Kasai, J.; Sakaguchi, K.; Dunagan, L.; Morrison, J.; Bras, R. L.; Choi, Y.; and Smith, N. A. 2022. Transparent Human Evaluation for Image Captioning. In Proc. of NAACL.
- Kendall (1938) Kendall, M. G. 1938. A new measure of rank correlation. Biometrika, 30(1/2): 81–93.
- Kendall (1945) Kendall, M. G. 1945. The treatment of ties in ranking problems. Biometrika, 33(3): 239–251.
- Kondrak (2005) Kondrak, G. 2005. N-gram similarity and distance. In International symposium on string processing and information retrieval, 115–126. Springer.
- Lin (2004) Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81.
- Mehri and Eskenazi (2020) Mehri, S.; and Eskenazi, M. 2020. USR: An unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456.
- Mendonça et al. (2021) Mendonça, V.; Rei, R.; Coheur, L.; Sardinha, A.; and Santos, A. L. 2021. Online learning meets machine translation evaluation: Finding the best systems with the least human effort. arXiv preprint arXiv:2105.13385.
- Mohankumar and Khapra (2022) Mohankumar, A. K.; and Khapra, M. M. 2022. Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8761–8781.
- Novikova et al. (2017) Novikova, J.; Dušek, O.; Curry, A. C.; and Rieser, V. 2017. Why we need new evaluation metrics for NLG. arXiv preprint arXiv:1707.06875.
- Papineni et al. (2002) Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318.
- Peyrard (2019) Peyrard, M. 2019. A Simple Theoretical Model of Importance for Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1059–1073.
- Quinlan (1986) Quinlan, J. R. 1986. Induction of decision trees. Machine learning, 1(1): 81–106.
- Reiter and Belz (2009) Reiter, E.; and Belz, A. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4): 529–558.
- Sakaguchi et al. (2016) Sakaguchi, K.; Napoles, C.; Post, M.; and Tetreault, J. 2016. Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. Transactions of the Association for Computational Linguistics, 4: 169–182.
- Sakaguchi, Post, and Van Durme (2014) Sakaguchi, K.; Post, M.; and Van Durme, B. 2014. Efficient elicitation of annotations for human evaluation of machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 1–11.
- Sakaguchi and Van Durme (2018) Sakaguchi, K.; and Van Durme, B. 2018. Efficient online scalar annotation with bounded support. arXiv preprint arXiv:1806.01170.
- Seber and Lee (2012) Seber, G. A.; and Lee, A. J. 2012. Linear regression analysis. John Wiley & Sons.
- Shi et al. (2022) Shi, Y.; Yang, X.; Xu, H.; Yuan, C.; Li, B.; Hu, W.; and Zha, Z. 2022. EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022.
- Stiennon et al. (2020) Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; and Christiano, P. F. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 3008–3021.
- Varshney, Mishra, and Baral (2022) Varshney, N.; Mishra, S.; and Baral, C. 2022. ILDAE: Instance-Level Difficulty Analysis of Evaluation Data. arXiv preprint arXiv:2203.03073.
- Völske et al. (2017) Völske, M.; Potthast, M.; Syed, S.; and Stein, B. 2017. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, 59–63.
- Wan and Xiao (2008) Wan, X.; and Xiao, J. 2008. CollabRank: towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), 969–976.
- Wan and Yang (2007) Wan, X.; and Yang, J. 2007. CollabSum: exploiting multiple document clustering for collaborative single document summarizations. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 143–150.
- Wei, Kocmi, and Federmann (2022) Wei, J. T.-Z.; Kocmi, T.; and Federmann, C. 2022. Searching for a higher power in the human evaluation of MT. arXiv preprint arXiv:2210.11612.
- Yates (1948) Yates, F. 1948. Systematic sampling. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 241(834): 345–377.
- Yuan, Neubig, and Liu (2021) Yuan, W.; Neubig, G.; and Liu, P. 2021. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34: 27263–27277.
- Zhang et al. (2019) Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Zhao et al. (2019) Zhao, W.; Peyrard, M.; Liu, F.; Gao, Y.; Meyer, C. M.; and Eger, S. 2019. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622.
- Zhou et al. (2022) Zhou, K.; Blodgett, S. L.; Trischler, A.; Daumé III, H.; Suleman, K.; and Olteanu, A. 2022. Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications. arXiv preprint arXiv:2205.06828.
Appendix A Appendix
Survey
We investigate papers with human evaluation to better study the current manual evaluation sampling problem. First, we randomly selected 1404 papers from ACL, EMNLP and COLING in the last two years from paperswithcode.com and the lists of accepted work published by the conferences. We then browse each paper by searching with the keywords ‘human’, ‘manual’ and ‘annotate’, and the content of the keywords in context is viewed. We find that 270 papers selected a subset of the test dataset for manual evaluation to save the labor and cost of manual evaluation. For these papers, we use ‘sample’ as the keyword to search for the sampling method used for human evaluation. It is found that random sampling is the most important sampling method, accounting for 60.7%, and the other 39.3% do not mention the sampling method they used. The number of papers using random sampling and unknown sampling methods in each conference is shown in Figure 5.
![Refer to caption](extracted/5661293/survey1.png)
Of these 270 papers that take human evaluation, 179 papers are from NLG tasks, with the most being text summarization at 22%. The proportion of these NLG papers on different tasks is shown in Figure 6. The lists of the 1404 papers surveyed and the 270 papers that selected a subset of the test dataset for human evaluation will be released.
![Refer to caption](extracted/5661293/survey2.png)
According to the results of the survey, there are two major problems in the current sampling problem of human evaluation. On the one hand, using random sampling to select samples can lead to unreliable human evaluation results, because different sampling subsets are likely to lead to different inter-system ranking results. In addition, not disclosing the list of samples selected by random sampling will lead to poor reproducibility of evaluation results. On the other hand, we find that up to 39.3% of the papers do not provide information on human evaluation sampling, which would lead to low reliability and low reproducibility of evaluation results. We recommend providing sampling information including sampling methods, evaluation sample lists, etc. when conducting human evaluations in the future to standardize the human evaluation process. At the same time, we strongly recommend using our proposed Constrained Active Sampling Framework for sampling evaluation subsets in human evaluations, which will make human evaluations more reliable and allow excellent systems to be retained.
Tasks and Datasets
Test Set
In Table 5, we report information on the NLG tasks and related datasets we used as test set, including the number of human evaluation aspects, the number of NLG systems involved in the corresponding datasets, the sample size of the datasets and specific human evaluation aspects. The order of the human evaluation metrics in the experiment in this paper follows the order of the human metrics shown in Table 5. As Likert-scale comparisons are the most commonly reported type of evaluation (Card et al. 2020), we focus on Likert-scale datasets. For data preprocessing, we first discard samples that lack information, including the system output, human evaluation score, and reference text. We then compute the automated metrics’ scores of these samples for use.
Validation Set
We select automatic metric for the preliminary phase, regressor for the learner, number of phases and the associated sampling ratios on the validation set shown in Table 6. The validation set contains nine datasets and 41 NLG systems from two traditional NLG tasks namely Data to Text and Paraphrase Generation. Since not all NLG systems on the E2E (Dušek, Novikova, and Rieser 2018) and ParaBank (Hu et al. 2019) datasets have human evaluation score on the same samples, we divide the ParaBank datasets into subsets. The samples on these subsets have human scoring results on all systems. For subsets selection, we first arranged and combined all the systems into system subsets, and calculated the number of samples that meet the need of having human evaluation score on all systems in the system subset. Then, we select the system subset with more samples that meet the need. The final system IDs of the selected subset are shown in Table 6.
Tasks | Datasets | # HE Metrics | # Systems | # Samples | HE Metrics | ||||
SummEval (Fabbri et al. 2021) | 4 | 16 | 100 | coherence, consistency, fluency,relevance | |||||
REALSumm (Bhandari et al. 2020) | 1 | 24 | 100 | litepyramid-recall | |||||
NeR18 (Grusky, Naaman, and Artzi 2018) | 4 | 7 | 60 | coherence, fluency, informativeness, relevance | |||||
DialSummEval (Gao and Wan 2022) | 4 | 13 | 100 | consistency,relevance,fluency,coherence | |||||
OpenAI-axis1 (Stiennon et al. 2020; Völske et al. 2017) | 4 | 5 | 439 | accuracy,coherence,coverage,overall | |||||
OpenAI-axis2 (Stiennon et al. 2020; Völske et al. 2017) | 4 | 7 | 636 | accuracy,coherence,coverage,overall | |||||
OpenAI-CNN/DM1 (Stiennon et al. 2020; Völske et al. 2017) | 4 | 10 | 206 | accuracy,coherence,coverage,overall | |||||
Summarization | OpenAI-CNN/DM3 (Stiennon et al. 2020; Völske et al. 2017) | 4 | 3 | 206 | accuracy,coherence,coverage,overall | ||||
newstext2020 en-de (Freitag et al. 2021) | 2 | 7 | 1066 | MQM, pSQM | |||||
newstext2020 cn-en (Freitag et al. 2021) | 2 | 8 | 1641 | MQM, pSQM | |||||
Machine Translation | newstext2021 cn-en (Freitag et al. 2021) | 1 | 13 | 147 | MQM | ||||
|
Persona Chat (Mehri and Eskenazi 2020) | 6 | 3 | 60 |
|
||||
MANS-ROC (Guan et al. 2021) | 1 | 5 | 200 | overall | |||||
Story Generation | MANS-WP (Guan et al. 2021) | 1 | 5 | 200 | overall | ||||
THUMB-MSCOCO (Kasai et al. 2022) | 1 | 5 | 500 | overall | |||||
Multi-Modal Generation | VATEX-EVAL (Shi et al. 2022) | 1 | 6 | 3000 | consistency | ||||
Overall | 16 | 44 | 137 | 8661 | - |
Tasks | Datasets | # HE Metrics | # Systems | # Samples | HE Metrics | System ID |
---|---|---|---|---|---|---|
Data to Text | E2E (Dušek, Novikova, and Rieser 2018) | 1 | 3 | 31 | naturalness | zhang, gong, tnt2 |
Paraphrase Generation | ParaBank1 (Hu et al. 2019) | 1 | 4 | 69 | overall | 0, 2, 30, 35 |
ParaBank2 (Hu et al. 2019) | 1 | 4 | 62 | overall | 0, 3, 24, 31 | |
ParaBank3 (Hu et al. 2019) | 1 | 5 | 77 | overall | 4, 0, 6, 30, 35 | |
ParaBank4 (Hu et al. 2019) | 1 | 5 | 84 | overall | 5, 0, 13, 29, 35 | |
ParaBank5 (Hu et al. 2019) | 1 | 5 | 90 | overall | 6, 0, 20, 30, 35 | |
ParaBank6 (Hu et al. 2019) | 1 | 5 | 82 | overall | 7, 0, 6, 29, 35 | |
ParaBank7 (Hu et al. 2019) | 1 | 5 | 77 | overall | 9, 0, 13, 32, 35 | |
ParaBank8 (Hu et al. 2019) | 1 | 5 | 64 | overall | 10, 0, 3, 27, 35 | |
Overall | 9 | 9 | 41 | 636 | - | - |
Learner Selection
Practical Recommendation
We explore the learning stability and accuracy of nine popular statistical machine learning algorithms as the regressors of Learner. We replace the core algorithm of Learner in CASF, and carried out experiments in 16 datasets under five NLG tasks. The experiment involved 137 NLG systems and 44 human indicators. The sampling rate is 50%. The experimental results are shown in the Table 7. Each experiment is run three times and the average inter-system ranking accuracy and variance through the three runs of each regressor are recorded. The stability of the regressors and be judged by the recorded variance.
The nine popular statistical machine learning regressors are Linear Regressor (Seber and Lee 2012), AdaBoost (Freund, Schapire et al. 1996), Bootstrap aggregating (Bagging) (Breiman 1996), Decision Tree (Quinlan 1986), Extremely Randomized Trees (ExtRaTree) (Geurts, Ernst, and Wehenkel 2006), K-Nearest Neighbor (KNN) (Cover and Hart 1967), Random Forest (Breiman 2001), support vector machine (SVM) (Hearst et al. 1998) and Gradient Boosting Decision Tree (GBDT) (Friedman 2001). We use the implementation of the corresponding statistical machine learning regressors in the sklearn library.
As for the stability of the algorithm, we do not want the proposed sampling method to get different results in each time of sampling like random sampling. Therefore, we should choose stable regressors as the core algorithm of the learner. The results in Table 7 show that linear regression, KNN, SVM and GBDT achieve good stability, and the variance of the inter-system ranking of three runs is 0. In terms of inter-system ranking accuracy, GBDT obtained the highest inter-system ranking accuracy, reaching 0.83 Kendall’s correlation. Based on the above experimental results, we recommend choosing GBDT as the core regression algorithm of the Learner in the proposed CASF.
Learner Selection in This Paper
For the selection of regressor for Learner in this paper, we conduct similar experiments on the validation set. Experimental results in Table 8 show GBDT obtained the highest inter-system ranking accuracy and stability, reaching 1.000 Kendall’s correlation for inter-system ranking accuracy and zero fluctuation. Based on the above experimental results and analysis, we chose GBDT as the core regression algorithm of the Learner in the proposed CASF in this paper.
Task | Dataset | HE Metric | Linear | AdaBoost | Bagging | DecisionTree | ExtRaTree | KNN | Random Forest | SVM | GBDT | ||||||||||
Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | ||||
SUM | SummEval | coherence | 0.617 | 0.000 | 0.628 | 0.150 | 0.583 | 0.191 | 0.622 | 0.136 | 0.672 | 0.198 | 0.650 | 0.000 | 0.522 | 0.142 | 0.717 | 0.000 | 0.950 | 0.000 | |
consistency | 0.600 | 0.000 | 0.567 | 0.054 | 0.478 | 0.021 | 0.494 | 0.093 | 0.494 | 0.087 | 0.567 | 0.000 | 0.472 | 0.123 | 0.117 | 0.000 | 0.533 | 0.000 | |||
fluency | 0.467 | 0.000 | 0.406 | 0.034 | 0.300 | 0.072 | 0.311 | 0.034 | 0.256 | 0.165 | 0.200 | 0.000 | 0.378 | 0.021 | 0.367 | 0.000 | 0.333 | 0.000 | |||
relevance | 0.750 | 0.000 | 0.472 | 0.122 | 0.450 | 0.072 | 0.611 | 0.244 | 0.461 | 0.162 | 0.817 | 0.000 | 0.567 | 0.167 | 0.383 | 0.000 | 0.817 | 0.000 | |||
REALSumm | litepyramid | 0.399 | 0.000 | 0.430 | 0.109 | 0.394 | 0.021 | 0.333 | 0.119 | 0.403 | 0.054 | 0.601 | 0.000 | 0.529 | 0.041 | 0.464 | 0.000 | 0.543 | 0.000 | ||
NeR18 | coherence | 1.000 | 0.000 | 0.841 | 0.224 | 1.000 | 0.000 | 1.000 | 0.000 | 0.968 | 0.045 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | ||
fluency | 1.000 | 0.000 | 0.841 | 0.224 | 0.968 | 0.045 | 0.968 | 0.045 | 0.968 | 0.045 | 1.000 | 0.000 | 0.619 | 0.206 | 1.000 | 0.000 | 1.000 | 0.000 | |||
informativeness | 0.714 | 0.000 | 1.000 | 0.000 | 0.905 | 0.135 | 1.000 | 0.000 | 0.873 | 0.119 | 1.000 | 0.000 | 0.873 | 0.119 | 0.714 | 0.000 | 1.000 | 0.000 | |||
relevance | 0.905 | 0.000 | 0.968 | 0.045 | 0.905 | 0.000 | 0.651 | 0.250 | 1.000 | 0.000 | 0.905 | 0.000 | 0.778 | 0.250 | 0.619 | 0.000 | 1.000 | 0.000 | |||
DialSummEval | consistency | 0.513 | 0.000 | 0.632 | 0.074 | 0.726 | 0.032 | 0.752 | 0.024 | 0.675 | 0.151 | 0.872 | 0.000 | 0.547 | 0.044 | 0.615 | 0.000 | 0.769 | 0.000 | ||
relevance | 0.564 | 0.000 | 0.624 | 0.212 | 0.744 | 0.126 | 0.607 | 0.099 | 0.667 | 0.126 | 0.564 | 0.000 | 0.496 | 0.024 | 0.538 | 0.000 | 0.718 | 0.000 | |||
fluency | 0.897 | 0.000 | 0.598 | 0.157 | 0.538 | 0.117 | 0.735 | 0.128 | 0.504 | 0.281 | 0.385 | 0.000 | 0.658 | 0.103 | 0.744 | 0.000 | 0.615 | 0.000 | |||
coherence | 0.795 | 0.000 | 0.684 | 0.067 | 0.590 | 0.091 | 0.786 | 0.067 | 0.632 | 0.154 | 0.641 | 0.000 | 0.556 | 0.053 | 0.846 | 0.000 | 0.897 | 0.000 | |||
OpenAI-axis1 | accuracy | 0.000 | 0.000 | 0.400 | 0.432 | 0.400 | 0.283 | 0.267 | 0.377 | 0.333 | 0.340 | 0.000 | 0.000 | 0.400 | 0.432 | 0.200 | 0.000 | 1.000 | 0.000 | ||
coherence | 0.400 | 0.000 | 0.467 | 0.249 | 0.400 | 0.283 | 0.467 | 0.249 | 0.400 | 0.283 | 0.000 | 0.000 | 0.667 | 0.340 | 0.800 | 0.000 | 0.800 | 0.000 | |||
coverage | 1.000 | 0.000 | 0.933 | 0.094 | 1.000 | 0.000 | 1.000 | 0.000 | 0.933 | 0.094 | 1.000 | 0.000 | 1.000 | 0.000 | 0.800 | 0.000 | 0.800 | 0.000 | |||
overall | 1.000 | 0.000 | 0.933 | 0.094 | 0.867 | 0.094 | 1.000 | 0.000 | 0.933 | 0.094 | 1.000 | 0.000 | 0.933 | 0.094 | 0.800 | 0.000 | 1.000 | 0.000 | |||
OpenAI-axis2 | accuracy | 0.714 | 0.000 | 0.873 | 0.119 | 0.397 | 0.196 | 0.810 | 0.269 | 0.810 | 0.206 | 0.619 | 0.000 | 0.556 | 0.119 | 0.714 | 0.000 | 0.905 | 0.000 | ||
coherence | 0.905 | 0.000 | 0.270 | 0.119 | 0.524 | 0.156 | 0.587 | 0.314 | 0.492 | 0.273 | 0.238 | 0.000 | 0.397 | 0.119 | 0.429 | 0.000 | 0.429 | 0.000 | |||
coverage | 1.000 | 0.000 | 0.968 | 0.045 | 0.905 | 0.135 | 1.000 | 0.000 | 0.968 | 0.045 | 0.619 | 0.000 | 0.968 | 0.045 | 1.000 | 0.000 | 1.000 | 0.000 | |||
overall | 0.905 | 0.000 | 0.968 | 0.045 | 0.841 | 0.162 | 0.873 | 0.119 | 0.873 | 0.180 | 0.714 | 0.000 | 0.841 | 0.162 | 1.000 | 0.000 | 1.000 | 0.000 | |||
OpenAI-CNN/DM1 | accuracy | 0.956 | 0.000 | 0.822 | 0.131 | 0.896 | 0.147 | 0.896 | 0.147 | 0.807 | 0.042 | 0.644 | 0.000 | 0.837 | 0.091 | 0.956 | 0.000 | 0.867 | 0.000 | ||
coherence | 0.822 | 0.000 | 0.556 | 0.181 | 0.407 | 0.137 | 0.837 | 0.171 | 0.407 | 0.302 | 0.600 | 0.000 | 0.333 | 0.063 | 0.244 | 0.000 | 0.600 | 0.000 | |||
coverage | 1.000 | 0.000 | 0.837 | 0.230 | 0.911 | 0.063 | 0.956 | 0.063 | 0.911 | 0.063 | 0.867 | 0.000 | 0.748 | 0.267 | 1.000 | 0.000 | 0.867 | 0.000 | |||
overall | 1.000 | 0.000 | 0.837 | 0.230 | 0.630 | 0.168 | 0.630 | 0.267 | 0.822 | 0.063 | 0.511 | 0.000 | 0.837 | 0.230 | 1.000 | 0.000 | 1.000 | 0.000 | |||
OpenAI-CNN/DM3 | accuracy | 1.000 | 0.000 | 1.000 | 0.000 | 0.556 | 0.314 | 0.778 | 0.314 | 0.556 | 0.314 | 1.000 | 0.000 | 0.778 | 0.314 | 1.000 | 0.000 | 1.000 | 0.000 | ||
coherence | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | |||
coverage | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 0.778 | 0.314 | 1.000 | 0.000 | 1.000 | 0.000 | |||
overall | 1.000 | 0.000 | 0.556 | 0.314 | 1.000 | 0.000 | 0.778 | 0.314 | 0.778 | 0.314 | 1.000 | 0.000 | 0.556 | 0.314 | 1.000 | 0.000 | 1.000 | 0.000 | |||
MT | newstest2020 en-de | MQM | 0.714 | 0.000 | 0.556 | 0.324 | 0.524 | 0.339 | 0.429 | 0.467 | 0.746 | 0.359 | 0.333 | 0.000 | 0.492 | 0.367 | 1.000 | 0.000 | 0.143 | 0.000 | |
pSQM | 1.000 | 0.000 | 0.968 | 0.045 | 0.937 | 0.045 | 0.683 | 0.384 | 1.000 | 0.000 | 1.000 | 0.000 | 0.937 | 0.045 | 0.905 | 0.000 | 1.000 | 0.000 | |||
newstest2020 cn-en | MQM | 1.000 | 0.000 | 0.619 | 0.243 | 0.714 | 0.404 | 0.667 | 0.269 | 0.476 | 0.332 | 0.286 | 0.000 | 0.452 | 0.388 | 0.214 | 0.000 | 0.929 | 0.000 | ||
pSQM | 0.786 | 0.000 | 0.262 | 0.410 | 0.690 | 0.243 | 0.476 | 0.221 | 0.548 | 0.321 | 0.929 | 0.000 | 0.524 | 0.337 | 0.071 | 0.000 | 0.786 | 0.000 | |||
|
MQM | 0.026 | 0.000 | 0.368 | 0.119 | 0.376 | 0.094 | 0.120 | 0.067 | 0.017 | 0.250 | 0.000 | 0.000 | -0.060 | 0.169 | -0.077 | 0.000 | 0.026 | 0.000 | ||
DialoGen | Persona Chat | Understandable | 1.000 | 0.000 | 0.778 | 0.314 | 0.556 | 0.314 | 0.778 | 0.314 | 0.333 | 0.943 | 1.000 | 0.000 | 1.000 | 0.000 | 0.333 | 0.000 | 0.333 | 0.000 | |
Natural | 1.000 | 0.000 | 0.111 | 0.629 | 0.333 | 0.544 | 0.111 | 0.831 | 1.000 | 0.000 | 1.000 | 0.000 | 0.556 | 0.314 | 0.333 | 0.000 | 1.000 | 0.000 | |||
Maintains Context | 1.000 | 0.000 | 0.333 | 0.943 | 1.000 | 0.000 | 0.333 | 0.943 | 1.000 | 0.000 | 1.000 | 0.000 | 0.333 | 0.943 | 1.000 | 0.000 | 1.000 | 0.000 | |||
Interesting | 1.000 | 0.000 | 1.000 | 0.000 | 0.778 | 0.314 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | |||
Uses Knowledge | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | |||
Overall Quality | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | |||
StoryGen | MANS-ROC | overall | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | |
MANS-WP | overall | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | ||
MMGen | THUMB-MSCOCO | overall | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | |
VATEX-EVAL | consistency | 1.000 | 0.000 | 0.867 | 0.189 | 0.867 | 0.189 | 0.867 | 0.189 | 0.733 | 0.189 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | ||
Overall Performance | 0.828 | 0.000 | 0.727 | 0.158 | 0.729 | 0.126 | 0.732 | 0.171 | 0.738 | 0.150 | 0.740 | 0.000 | 0.701 | 0.154 | 0.724 | 0.000 | 0.833 | 0.000 |
Task | Dataset | HE Metric | Linear | AdaBoost | Bagging | DecisionTree | ExtraTree | KNN | Random Forest | SVM | GBDT | |||||||||
Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | |||
Data to Text | E2E | naturalness | 1.000 | 0.000 | 0.556 | 0.629 | 0.556 | 0.629 | -0.111 | 0.314 | 0.333 | 0.544 | 1.000 | 0.000 | 0.111 | 0.629 | 1.000 | 0.000 | 1.000 | 0.000 |
Paraphrase Generation | ParaBank1 | overall | 0.667 | 0.000 | 0.333 | 0.471 | 0.333 | 0.272 | 0.333 | 0.471 | 0.444 | 0.416 | 0.667 | 0.000 | 0.889 | 0.157 | 0.000 | 0.000 | 1.000 | 0.000 |
ParaBank2 | overall | 1.000 | 0.000 | 1.000 | 0.000 | 0.667 | 0.471 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | |
ParaBank3 | overall | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 0.667 | 0.471 | 0.000 | 0.000 | 1.000 | 0.000 | |
ParaBank4 | overall | 0.667 | 0.000 | 0.556 | 0.416 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 0.889 | 0.157 | 1.000 | 0.000 | 1.000 | 0.000 | |
ParaBank5 | overall | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | |
ParaBank6 | overall | 1.000 | 0.000 | 1.000 | 0.000 | 0.889 | 0.157 | 0.889 | 0.157 | 1.000 | 0.000 | 1.000 | 0.000 | 0.667 | 0.471 | 1.000 | 0.000 | 1.000 | 0.000 | |
ParaBank7 | overall | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | |
ParaBank8 | overall | 0.000 | 0.000 | 0.667 | 0.471 | 0.889 | 0.157 | 0.667 | 0.471 | 0.556 | 0.314 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | |
Overall Performance | 0.815 | 0.000 | 0.790 | 0.221 | 0.815 | 0.187 | 0.753 | 0.157 | 0.815 | 0.142 | 0.963 | 0.000 | 0.802 | 0.210 | 0.778 | 0.000 | 1.000 | 0.000 |
Automatic Metrics for Preliminary Sampling Phase
Practical Recommendation
By replacing the automated metrics of the proposed CASF in the preliminary sampling phase, we explore which automated metrics are more suitable for measuring sample quality in the preliminary sampling phase. We calculate metrics in the selected NLG automatic metric set by using the official provided code. Full experiment results of the proposed Constrained Active Sampling Framework on 44 human evaluation metrics from 5 NLG tasks pre-ranking on different automatic metrics are shown in Table 10. We can learn from Table 10 that MOVER-SCORE (Zhao et al. 2019) ranks first in the whole inter-system ranking accuracy of 64% human evaluation metrics. In addition, MOVER-SCORE ranked first in the overall inter-system ranking accuracy of 16 datasets, so we recommend using MOVER-SCORE as the calculation method of sample quality in the preliminary phase.
The results of top-ranked system recognition accuracy shown in Table 9 demonstrate that MOVER-SCORE has the best recognition performance in summarization, dialogue generation, story generation and multi-modal generation tasks, while the recognition effect in machine translation task is the second-best among 8 automatic metrics. MOVER-SCORE has an average top-ranked system identification accuracy of 93.18% across all 16 human evaluation metrics, involving 137 NLG systems. These results further indicate that MOVER-SCORE is a more suitable sampling quality measurement method in the preliminary sampling phase of CASF.
Automatic Metric | SUM | MT | DialoGen | StoryGen | MMGen | Overall |
---|---|---|---|---|---|---|
BERT-SCORE | 0.8276 | 0.8000 | 0.8333 | 1.0000 | 1.0000 | 0.8409 |
MOVER-SCORE | 0.9310 | 0.8000 | 1.0000 | 1.0000 | 1.0000 | 0.9318 |
ROUGE-1 | 0.7586 | 1.0000 | 0.8333 | 1.0000 | 1.0000 | 0.8182 |
ROUGE-2 | 0.7931 | 1.0000 | 0.8333 | 1.0000 | 1.0000 | 0.8409 |
ROUGE-L | 0.7241 | 1.0000 | 0.8333 | 1.0000 | 1.0000 | 0.7955 |
BART-SCORE | 0.8621 | 1.0000 | 1.0000 | 0.5000 | 1.0000 | 0.8864 |
BLEU | 0.8621 | 1.0000 | 0.8333 | 1.0000 | 1.0000 | 0.8864 |
METEOR | 0.7931 | 1.0000 | 0.8333 | 1.0000 | 1.0000 | 0.8409 |
Task | Dataset | HE Metric | BERT-SCORE | MOVER-SCORE | ROUGE-1 | ROUGE-2 | ROUGE-L | BART-SCORE | BLEU | METEOR |
Summarization | SummEval | coherence | 0.5333 | 0.9500 | 0.7333 | 0.7000 | 0.6333 | 0.1667 | 0.8167 | 0.5500 |
consistency | -0.0167 | 0.5333 | 0.5833 | 0.3500 | 0.6500 | 0.4833 | 0.2833 | 0.5667 | ||
fluency | 0.2500 | 0.3333 | 0.5833 | 0.2667 | 0.3500 | 0.2500 | 0.0500 | 0.1667 | ||
relevance | 0.6167 | 0.8167 | 0.7000 | 0.3833 | 0.5333 | 0.3833 | 0.6833 | 0.6500 | ||
REALSumm | litepyramid | 0.4928 | 0.5435 | 0.3333 | 0.4493 | 0.5507 | 0.3841 | 0.5217 | 0.4493 | |
NeR18 | coherence | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
fluency | 1.0000 | 1.0000 | 1.0000 | 0.5238 | 1.0000 | 0.9048 | 1.0000 | 1.0000 | ||
informativeness | 0.7143 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
relevance | 1.0000 | 1.0000 | 0.9048 | 0.9048 | 0.9048 | 0.4286 | 1.0000 | 1.0000 | ||
DialSummEval | consistency | 0.7179 | 0.7692 | 0.5128 | 0.6667 | 0.4615 | 0.6923 | 0.6923 | 0.8974 | |
relevance | 0.5385 | 0.7179 | 0.4359 | 0.6923 | 0.6154 | 0.8718 | 0.4872 | 0.6923 | ||
fluency | 0.6410 | 0.6154 | 0.8718 | 0.5641 | 0.3846 | 0.5385 | 0.5897 | 0.4359 | ||
coherence | 0.6154 | 0.8974 | 0.5897 | 0.5128 | 0.6923 | 0.5128 | 0.3846 | 0.5897 | ||
OpenAI-axis1 | accuracy | 0.0000 | 1.0000 | 0.2000 | 0.0000 | 1.0000 | 0.0000 | 0.0000 | 1.0000 | |
coherence | 0.8000 | 0.8000 | 0.4000 | 0.2000 | 0.0000 | 0.0000 | 0.2000 | 1.0000 | ||
coverage | 1.0000 | 0.8000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.8000 | ||
overall | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.8000 | ||
OpenAI-axis2 | accuracy | 0.6190 | 0.9048 | 0.6190 | 1.0000 | 0.4286 | 0.6190 | 0.7143 | 0.7143 | |
coherence | 0.7143 | 0.4286 | 0.3333 | 0.6190 | 0.7143 | 0.0476 | 1.0000 | 0.2381 | ||
coverage | 1.0000 | 1.0000 | 1.0000 | 0.9048 | 0.9048 | 0.7143 | 1.0000 | 0.6190 | ||
overall | 1.0000 | 1.0000 | 0.9048 | 1.0000 | 1.0000 | 0.7143 | 1.0000 | 1.0000 | ||
OpenAI-CNN/DM1 | accuracy | 0.9556 | 0.8667 | 0.7778 | 1.0000 | 0.7778 | 0.6889 | 0.7778 | 1.0000 | |
coherence | 0.9556 | 0.6000 | 0.5111 | 1.0000 | 0.6000 | 0.0667 | 0.2889 | 0.5111 | ||
coverage | 0.8667 | 0.8667 | 1.0000 | 1.0000 | 1.0000 | 0.6444 | 0.8667 | 1.0000 | ||
overall | 1.0000 | 1.0000 | 0.8667 | 1.0000 | 1.0000 | 1.0000 | 0.5111 | 1.0000 | ||
OpenAI-CNN/DM3 | accuracy | 0.3333 | 1.0000 | 1.0000 | 0.3333 | 0.3333 | 0.3333 | 1.0000 | 1.0000 | |
coherence | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
coverage | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
overall | 1.0000 | 1.0000 | 0.3333 | 1.0000 | 0.3333 | 1.0000 | 1.0000 | 1.0000 | ||
MT | newstest2020 en-de | MQM | 1.0000 | 0.1429 | 1.0000 | 0.1429 | 0.1429 | 0.3333 | 0.3333 | 0.3333 |
pSQM | 0.9048 | 1.0000 | 0.9048 | 1.0000 | 0.9048 | 0.9048 | 1.0000 | 0.9048 | ||
newstest2020 cn-en | MQM | 0.7857 | 0.9286 | 0.2143 | 0.7143 | 0.6429 | 0.1429 | 0.2143 | 0.7143 | |
pSQM | 0.2857 | 0.7857 | 0.5000 | 0.7857 | 0.7857 | 0.2143 | 0.2857 | 0.2143 | ||
newstest2021 cn-en | MQM | -0.0769 | 0.0256 | 0.2308 | 0.1026 | 0.1282 | 0.5897 | 0.0256 | 0.5128 | |
Dialogue Generation | Persona Chat | Understandable | 1.0000 | 0.3333 | -1.0000 | 0.3333 | 1.0000 | 0.3333 | 0.3333 | 1.0000 |
Natural | -0.3333 | 1.0000 | 1.0000 | -0.3333 | 0.3333 | 1.0000 | -1.0000 | 0.3333 | ||
Maintains Context | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
Interesting | 0.3333 | 1.0000 | 1.0000 | 0.3333 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
Uses Knowledge | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
Overall Quality | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
Story Generation | MANS-ROC | overall | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
MANS-WP | overall | 1.0000 | 1.0000 | -0.4000 | 1.0000 | 1.0000 | 0.8000 | 1.0000 | 1.0000 | |
Multi-Modal Generation | THUMB-MSCOCO | overall | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
VATEX-EVAL | overall | 1.0000 | 1.0000 | 1.0000 | 0.6000 | 1.0000 | 0.6000 | 0.6000 | 1.0000 | |
Overall | 0.7329 | 0.8332 | 0.6965 | 0.6989 | 0.7456 | 0.6446 | 0.6741 | 0.7885 |
Task | Dataset | HE Metric | BERT-SCORE | MOVER-SCORE | ROUGE-1 | ROUGE-2 | ROUGE-L | BART-SCORE | BLEU | METEOR |
Data to Text | E2E | naturalness | 0.3333 | 1.0000 | 0.3333 | 1.0000 | 1.0000 | -0.3333 | 1.0000 | -0.3333 |
Paraphrase Generation | ParaBank1 | overall | 0.0000 | 1.0000 | 1.0000 | 0.6667 | 1.0000 | 1.0000 | 0.0000 | 1.0000 |
ParaBank2 | overall | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
ParaBank3 | overall | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
ParaBank4 | overall | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.6667 | 1.0000 | 0.6667 | 0.6667 | |
ParaBank5 | overall | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
ParaBank6 | overall | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
ParaBank7 | overall | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 1.0000 | |
ParaBank8 | overall | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.6667 | 0.0000 | 1.0000 | 1.0000 | |
Overall | 0.8148 | 1.0000 | 0.9259 | 0.9630 | 0.8148 | 0.7407 | 0.8519 | 0.8148 |
Automatic Metric Selection in This Paper
We conduct a similar experiment on the validation set to select automatic metrics for the preliminary phase of CASF in the paper. According to experimental results in Table 11, we find MOVER-SCORE is capable to measure sample quality in the preliminary phase. And we finally select MOVER-SCORE as the automatic metric for the preliminary phase of CASF in the paper.
Different Sampling Ratio
Experimental results in Table 12 show the inter-system ranking accuracy under different sampling ratios. The full results of sampling half of the dataset are in Table 1. Experimental results demonstrate that CASF has the best inter-system ranking accuracy among three different sampling methods under different sampling ratios, with an average gap between random sampling of 0.1133 Kendall correlation while solving the problem of clustered selection and data manipulation for human evaluation. We also observe an interesting phenomenon that sometimes there is a negative correlation between sampling ratio and inter-system ranking accuracy (with the sampling ratio of 70% and 80%), that is, with the increase of sampling ratio, inter-system ranking accuracy decreases. This phenomena may occur because some samples do not contribute to the overall effect or have a negative effect, and are sometimes used as a sign of publication bias (Begg and Mazumdar 1994; Card et al. 2020). Overall, the inter-system ranking accuracy increases with the increase of the sampling ratio.
Task | Dataset | Method | 90% | 80% | 70% | 60% | 40% | 30% | 20% | 10% |
---|---|---|---|---|---|---|---|---|---|---|
SUM | SummEval | Random | 0.6167 | 0.6236 | 0.5847 | 0.4625 | 0.4306 | 0.3306 | 0.3403 | 0.1097 |
Heuristic | 0.6403 | 0.5792 | 0.5403 | 0.5306 | 0.4042 | 0.3611 | 0.3069 | 0.0625 | ||
CASF (ours) | 0.5833 | 0.6417 | 0.5875 | 0.8167 | 0.5000 | 0.4917 | 0.4833 | -0.0708 | ||
REALSumm | Random | 0.6739 | 0.5580 | 0.4928 | 0.5338 | 0.2826 | 0.2874 | 0.3696 | 0.0411 | |
Heuristic | 0.7657 | 0.6715 | 0.4517 | 0.4324 | 0.2874 | 0.3382 | 0.2923 | 0.0242 | ||
CASF (ours) | 0.9565 | 0.7391 | 0.5797 | 0.5870 | 0.3116 | 0.4275 | 0.4565 | 0.1739 | ||
NeR18 | Random | 0.9762 | 1.0000 | 0.9286 | 0.9524 | 0.8492 | 0.6667 | 0.5714 | 0.3810 | |
Heuristic | 0.9524 | 0.9444 | 0.9762 | 0.8810 | 0.6746 | 0.5635 | 0.3016 | 0.1667 | ||
CASF (ours) | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.9762 | 0.9524 | 0.2619 | 0.8095 | ||
DialSummEval | Random | 0.7350 | 0.6966 | 0.6880 | 0.5940 | 0.5919 | 0.4679 | 0.4038 | 0.4423 | |
Heuristic | 0.7714 | 0.6688 | 0.5641 | 0.6688 | 0.6261 | 0.5278 | 0.4103 | 0.4637 | ||
CASF (ours) | 0.8654 | 0.7308 | 0.6154 | 0.7115 | 0.7372 | 0.7564 | 0.4103 | 0.4872 | ||
OpenAI-axis1 | Random | 0.6000 | 0.6833 | 0.7000 | 0.7167 | 0.5500 | 0.7500 | 0.7500 | 0.6000 | |
Heuristic | 0.7333 | 0.7500 | 0.6500 | 0.7667 | 0.6333 | 0.5500 | 0.4667 | 0.6333 | ||
CASF (ours) | 0.7500 | 0.7500 | 1.0000 | 0.9000 | 0.9500 | 0.9500 | 0.9500 | 0.6500 | ||
OpenAI-axis2 | Random | 0.7540 | 0.7540 | 0.7698 | 0.5794 | 0.6270 | 0.5952 | 0.5397 | 0.3492 | |
Heuristic | 0.8095 | 0.6746 | 0.6746 | 0.6746 | 0.6349 | 0.6905 | 0.4127 | 0.3968 | ||
CASF (ours) | 0.8095 | 0.8571 | 0.9524 | 0.7381 | 0.7381 | 0.9524 | 0.9286 | 0.5000 | ||
OpenAI-CNN/DM1 | Random | 0.9667 | 0.8444 | 0.8111 | 0.8259 | 0.5926 | 0.6889 | 0.4741 | 0.6259 | |
Heuristic | 0.8519 | 0.7593 | 0.7111 | 0.7185 | 0.8185 | 0.6741 | 0.5667 | 0.4444 | ||
CASF (ours) | 0.8778 | 0.8778 | 0.7667 | 0.8444 | 0.8222 | 0.8222 | 0.7556 | 0.7333 | ||
OpenAI-CNN/DM3 | Random | 1.0000 | 0.9444 | 0.9444 | 0.9444 | 0.8333 | 0.6667 | 0.5556 | 0.4444 | |
Heuristic | 0.9444 | 0.8889 | 0.9444 | 0.8889 | 0.8333 | 0.6667 | 0.5000 | 0.2222 | ||
CASF (ours) | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.6667 | 0.6667 | 0.5000 | ||
MT | newstext2020 en-de | Random | 0.8571 | 0.8889 | 0.6984 | 0.7460 | 0.6349 | 0.6032 | 0.5714 | 0.0000 |
Heuristic | 0.7302 | 0.7778 | 0.7937 | 0.6032 | 0.4921 | 0.1587 | 0.4921 | 0.1587 | ||
CASF (ours) | 1.0000 | 1.0000 | 1.0000 | 0.6190 | 1.0000 | 0.2381 | 0.6190 | 0.1429 | ||
newstext2020 cn-en | Random | 0.2500 | 0.5952 | 0.3452 | 0.3333 | 0.4167 | 0.1667 | 0.3810 | 0.0000 | |
Heuristic | 0.7024 | 0.4643 | 0.5000 | 0.4405 | 0.3929 | -0.0595 | 0.2500 | -0.1190 | ||
CASF (ours) | 0.8929 | 0.8929 | 0.8571 | 0.6071 | -0.0855 | 0.2857 | 0.5000 | 0.5000 | ||
newstext2021 cn-en | Random | 0.4103 | 0.2051 | 0.2222 | 0.2564 | -0.0342 | -0.1624 | -0.1880 | -0.1966 | |
Heuristic | 0.0855 | 0.1282 | 0.1282 | 0.1966 | -0.0342 | -0.1197 | -0.0085 | -0.1624 | ||
CASF (ours) | 0.1026 | 0.0769 | 0.1282 | 0.4615 | 0.0769 | -0.0513 | -0.0513 | -0.1026 | ||
DialoGen | Persona Chat | Random | 0.9259 | 0.7037 | 0.7037 | 0.8148 | 0.4444 | 0.1852 | 0.0741 | -0.0370 |
Heuristic | 0.8519 | 0.7778 | 0.8148 | 0.6667 | 0.3333 | 0.4444 | 0.1481 | -0.2593 | ||
CASF (ours) | 1.0000 | 0.8889 | 1.0000 | 0.8889 | 0.6667 | 0.6667 | 0.8889 | 0.4444 | ||
StoryGen | MANS-ROC | Random | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.9333 | 1.0000 | 0.6000 | -0.5333 |
Heuristic | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.4000 | -0.3333 | ||
CASF (ours) | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.8000 | -0.6000 | ||
MANS-WP | Random | 1.0000 | 1.0000 | 1.0000 | 0.9333 | 0.9333 | 1.0000 | 0.5333 | -0.2667 | |
Heuristic | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.3333 | 1.0000 | -0.3333 | ||
CASF (ours) | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.0000 | ||
MMGen | THUMB-MSCOCO | Random | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.9333 | 1.0000 | 1.0000 | 0.9333 |
Heuristic | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.8000 | ||
CASF (ours) | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
VATEX-EVAL | Random | 1.0000 | 1.0000 | 0.8667 | 0.8667 | 0.7333 | 0.5111 | 0.6000 | 0.6444 | |
Heuristic | 1.0000 | 1.0000 | 0.8667 | 0.8667 | 0.7333 | 0.4667 | 0.5111 | 0.2889 | ||
CASF (ours) | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.4667 | 0.6000 | 1.0000 | 1.0000 | ||
Overall Performance | Random | 0.7979 | 0.7811 | 0.7347 | 0.7225 | 0.6095 | 0.5473 | 0.4735 | 0.2211 | |
Heuristic | 0.8024 | 0.7553 | 0.7260 | 0.7084 | 0.6144 | 0.4747 | 0.4406 | 0.1534 | ||
CASF (ours) | 0.8649 | 0.8409 | 0.8429 | 0.8234 | 0.6975 | 0.6724 | 0.6668 | 0.3855 |
Different Sampling Size
We treat the sample size as an independent variable and add additional experiments. The experimental results of different sampling sizes are shown in Table 13, and the inter-system ranking accuracy metric is Kendall’s Tau. Both Random and Heuristic were run 100 times, and the average inter-system rankings were recorded as Random Mean and Heuristic Mean in Table 13. We also randomly selected three execution results of Random and Heuristic and displayed them in Table 13. Experimental results show that different times of random sampling or heuristic sampling may get different inter-system ranking accuracy. Experimental results also show CASF outperforms the popular NLG human evaluation sampling method Random and Heuristic in typical sampling sizes. We conduct experiments on datasets with a population size larger than the sample size, and the number of tasks(# Task), datasets(# Dataset), human evaluation aspects(# HE Metric), and systems(# System) involved for each sample size are shown in Table 14.
Sample Size | 50 | 100 | 150 | 200 | 250 | 300 |
---|---|---|---|---|---|---|
Random 1 | 0.6847 | 0.7478 | 0.5938 | 0.7758 | 0.5595 | 0.6639 |
Random 2 | 0.5838 | 0.6648 | 0.7547 | 0.6905 | 0.7105 | 0.6755 |
Random 3 | 0.6012 | 0.7346 | 0.7258 | 0.8062 | 0.6537 | 0.5058 |
Random Mean | 0.6496 | 0.7478 | 0.7167 | 0.7806 | 0.6596 | 0.6736 |
Heuristic 1 | 0.6460 | 0.7192 | 0.6210 | 0.7768 | 0.5432 | 0.6935 |
Heuristic 2 | 0.5434 | 0.6716 | 0.7542 | 0.7821 | 0.7575 | 0.6112 |
Heuristic 3 | 0.6599 | 0.7490 | 0.7058 | 0.6401 | 0.6435 | 0.5432 |
Heuristic Mean | 0.6476 | 0.7497 | 0.7137 | 0.7712 | 0.6412 | 0.6725 |
CASF (Ours) | 0.7156 | 0.7514 | 0.7757 | 0.8264 | 0.7706 | 0.7010 |
Sample Size | 50 | 100 | 150 | 200 | 250 | 300 |
---|---|---|---|---|---|---|
# Task | 5 | 4 | 4 | 4 | 3 | 3 |
# Dataset | 16 | 14 | 10 | 10 | 6 | 6 |
# HE Metric | 44 | 34 | 24 | 24 | 14 | 14 |
# System | 137 | 127 | 61 | 61 | 38 | 38 |
Significant Information Retention Accuracy
We used the common Wilcoxon signed-rank test (Demšar 2006) to evaluate the performance of methods on identifying statistically significant differences between systems on the test set. The overall significant information retention accuracy of CASF, Random Sampling and Heuristic Sampling (both iterated 10000 times) in 44 aspects were 0.6030, 0.5992 and 0.5976 at the level, and 0.4344, 0.4156 and 0.4138 at level when sampling 50% of the dataset. The results showed CASF outperforms the popular Random Sampling and Heuristic Sampling.
Limitations and Future Work
Accurate and reliable evaluation of models is an important aspect of NLG research and practical applications. We makes human evaluation more reliable with limited cost and labor used for annotation. However, there are still some limitations. On the one hand, quality of samples are predicted by the Learner with features of automated metrics, which are easy to calculate in practice. The information of automatic indicators may not be comprehensive enough to represent the quality of a sample. Future work would consider introducing the characteristics of samples, such as the length of the generated text and lexical complexity, so as to make the quality of samples more comprehensive. Similarly, future work would take more information into account about redundancy. On the other hand, since reliable human evaluation is important for NLP tasks which are lack of reliable automated metrics, we focuses on the problem of reliable human evaluation in NLG tasks. However, CASF may be applicable to other NLP tasks. We would like to extend CASF to more NLP tasks in future work. Due to the necessity of a certain sample size for learner training, our approach may not be applicable in situations with an extremely small sample size, such as when the sample size is less than 50. In cases where the sample size is small, evaluation costs are typically lower, and full-scale assessment could be considered.