Evaluate What You Can’t Evaluate: Unassessable Quality for Generated Response

Yongkang Liu^1,2,3, Shi Feng¹, Daling Wang¹, Yifei Zhang¹, Hinrich Schütze^2,3
¹ Northeastern University, China
² Center for Information and Language Processing, LMU Munich
³ Munich Center for Machine Learning (MCML), LMU Munich
[email protected], {fengshi,wangdaling,zhangyifei}@cse.neu.edu.cn

Abstract

LLMs (large language models) like ChatGPT have demonstrated exceptional language comprehension and generation abilities. While reference-free evaluators grounded in LLMs exhibit superior human alignment compared to traditional reference-based evaluators, the utilization of such evaluators poses several challenges. Reference-free evaluators are better suited for open-ended examples with different possible responses, but not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators may still consider it high quality, even if the given response contradicts the facts and semantics of dialogue history. To provide a comprehensive assessment of the reliability of evaluators based on LLMs, we have created two adversarial meta-evaluation dialogue generation datasets: KdConv-ADV, derived from KdConv, and DSTC7-ADV, derived from DSTC7-AVSD. Compared to previous meta-evaluation benchmarks, both KdConv-ADV and DSTC7-ADV present greater challenges since they contain lots of closed-ended examples and adversarial instances derived from references. Experimental results reveal that reference-free evaluators based on LLMs are a reliable alternative to reference-based evaluators on tasks that do not involve external knowledge. Reference-free evaluators tend to overestimate the quality of the text and are still deficient in distinguishing text quality.

1 Introduction

The evaluation of generated response quality using reference-based metrics has faced criticism from researchers Liu et al. (2016). The primary reason behind this criticism stems from the fact that reference-based evaluation metrics, such as BLEU Papineni et al. (2002), ROUGE Lin (2004), and METEOR Banerjee and Lavie (2005) consider candidates with high similarity with reference responses as indication of high quality, which contradicts the semantic and expression diversity present in the responses. Therefore, reference-based metrics fail to fairly evaluate different reasonable responses, leading to a low correlation with human judgments Liu et al. (2016); Sedoc et al. (2019); Liu et al. (2023).

Refer to caption — Figure 1: Evaluation examples of ChatGPT. The correct response semantic for this example is unique. The reference response is I checked and his hometown should be Düsseldorf, Germany.

Given the remarkable language understanding and generation capabilities demonstrated by LLMs Kocoń et al. (2023); Frieder et al. (2023); Huang et al. (2023); Qin et al. (2023); Rao et al. (2023) like ChatGPT Ouyang et al. (2022), LLaMA Touvron et al. (2023), and GPT-4 OpenAI (2023), recent studies have suggested leveraging these models as reference-free evaluators for assessing the quality of generated text Fu et al. (2023); Wang et al. (2023a); Liu et al. (2023). Different from reference-based evaluators, reference-free evaluators employ LLMs to score the generated responses according to different instructions without any reference target, which can address the problem of reference-based evaluators using the reference as the sole criterion.

Although researches Fu et al. (2023); Wang et al. (2023a); Liu et al. (2023) show that reference-free evaluators demonstrate better human agreement, their reliability is questionable. The primary reason is that previous studies Fu et al. (2023); Wang et al. (2023a); Liu et al. (2023) have not conducted a comprehensive evaluation of reference-free evaluators. As shown in Table 5, the benchmark datasets Topical-Chat Gopalakrishnan et al. (2020) and Persona-Chat Zhang et al. (2018a) utilized in existing works predominantly consist of open-ended examples with different semantic responses, lacking evaluations for closed-ended examples with unique correct semantic responses. The conversational context of open-ended examples is unrestricted, and the semantics of the corresponding responses are broad, or even arbitrary. The results only on open-ended examples do not truly reflect the accuracy and objectivity of evaluators. The broad semantic space of responses results in many candidates being equally reasonable. It is unexplored whether evaluators have the ability to distinguish nuances in the quality of responses. As long as the evaluator gives high scores or similar scores in most cases, it will be considered reliable. Closed-ended examples differ from open-ended examples in that their conversation context is semantically restricted, either derived from external knowledge or the dialogue history. The limitation makes the semantics of responses unique, which makes evaluators able to provide reasonable judgments only when it correctly understands the underlying limitation. Therefore, closed-ended examples can better reflect the quality of evaluators than open-ended examples.

A closed-ended example is provided in Figure 1, where "the hometown of Wim Wenders is Düsseldorf, Germany" represents the sole accurate candidate semantic for this closed-ended instance. Despite two unreasonable responses (i.e., "… Munich, Germany" and "… Stuttgart, Germany") that are inconsistent with the fact (i.e., the hometown of Wim Wenders is Düsseldorf, Germany), ChatGPT still gave a high score in terms of consistency dimension (i.e., 0.9 and 0.85). When evaluating solely on open-ended examples, evaluators with significant high scoring biases may be erroneously perceived as exhibiting stronger agreement with humans, owing to the expansive semantic possibilities within the candidate space of open-ended examples.

To address these challenges, we build two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv Zhou et al. (2020) and DSTC7-AVSD Alamri et al. (2019), respectively. Meta-evaluation is a process that assesses the quality of evaluation methods. In contrast to prior meta-evaluation dialogue datasets Mehri and Eskenazi (2020), both KdConv-ADV and DSTC7-ADV encompass not only open-ended examples but also lots of closed-ended examples. Specifically, the KdConv-ADV consists of equal numbers of closed-ended and open-ended examples. We ask human annotators to generate three new candidate responses with low lexical overlap with the reference response for each example. The generated candidates demonstrate both reasonability and high quality for open-ended examples, while the generated candidates tend to be inconsistent with the provided facts, and even include fictitious information (i.e., adversarial examples) for closed-ended instances. The DSTC7-ADV is completely composed of closed-ended examples. We generate adversarial examples by rewriting the reference responses to ensure that their semantics are inconsistent with the provided facts. Candidate responses that have a low lexical overlap with the reference in KdConv-ADV may be of high-quality, whereas candidates that have a high overlap with the reference in DSTC7-ADV may be of low-quality, making reference-based metrics almost useless, which are also extremely challenging for reference-free evaluators.

We evaluate ChatGPT and multiple open source LLMs, such as Vicuna Chiang et al. (2023) and ChatGLM Du et al. (2022). Experimental results on KdConv-ADV and DSTC7-ADV show that reference-free evaluators based on LLMs have the following disadvantages: i) insufficient knowledge; ii) insufficient ability to identify unreasonable responses; iii) insufficient differentiation of scores. To summarize, we make the following contributions:

•

We construct two adversarial meta-evaluation dialogue datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD to comprehensively evaluate the reliability of dialogue generation metrics.
•

We propose new challenges for dialogue generation metric evaluation, requiring evaluators to be able to evaluate generated text at a semantic level rather than lexical matching.
•

We evaluate and analyze the performance of reference-based and reference-free evaluators on KdConv-ADV and DSTC7-ADV. Experimental results show that LLM-based reference-free evaluators demonstrate promising performance as alternatives to reference-based methods, particularly for tasks not requiring external knowledge.

2 RELATED WORK

2.1 Reference-based Evaluators

Ngram-based Metrics

Ngram-based metrics evaluate the dialogue models by measuring the lexical overlap between a generated response and a reference text. BLEU Papineni et al. (2002), METEOR Banerjee and Lavie (2005), ROUGE Lin (2004) are widely used metrics for dialogue generation evaluation Bao et al. (2020); Liu et al. (2022b, a). Most of these metrics are based on n-gram overlap between a generated candidate and reference response. They fail to measure the content quality of generated candidates and therefore do not evaluate the dialogue generation systems accurately. Honovich et al. (2021) proposes to use a question answering system for fact consistency evaluation. This method relies on high-quality external knowledge and question answering system. Dziri et al. (2022b) introduces a new benchmark to evaluate the reliability of reference-based metrics.

Embedding-based Metrics

Embedding-based Metrics evaluate the dialogue generation systems by measuring the semantic similarity between the generated candidate and the reference response. Embedding Average is a metric that measures the distance between two texts by averaging the vector representations of their constituent words, which is widely used in textual similarity tasks Wieting et al. (2015); Liu et al. (2016, 2022b). BERTScore Zhang et al. (2019) employs the contextualized representation from BERT to measure the similarity between generated candidate and reference response. MoverScore Zhao et al. (2019) adds soft alignments based on BERTScore to obtain a more robust similarity measurement. Ghazarian et al. (2020) trains a classification task on pooled vectors to evaluate the engagement of responses.

These methods pay more attention to semantic similarity than ngram-based metrics but still fail to make a fair assessment for multiple reasonable responses because embedding-based metrics still consider candidates with high overlap with the reference to be of high quality.

2.2 Reference-free Evaluators

Reference-free evaluation refers to methods of judging the quality of generated text according to the degree of correlation between dialogue history and generated candidates in multiple aspects. Existing works usually trained specific models as reference-free evaluators before LLMs. Mehri and Eskenazi (2020) proposes an unsupervised reference-free metric by training models on downstream tasks to evaluate open-ended examples. Pang et al. (2020) uses data augmentation methods to train a more robust evaluators. Yeh et al. (2021) shows that metrics that rely on a specific data set lack the ability to generalize. Dziri et al. (2022a) create FAITHDIAL datasets based on the Wizard of Wikipedia to reduce factual errors in the training corpus. Khalid and Lee (2022) proposes an adversarial test-suite to evaluate the bias of metrics based on trained models. However, reference-free evaluation by humans is still a must in almost all dialogue generation tasks Zhang et al. (2018b); Zhou et al. (2020); Bao et al. (2020); Liu et al. (2022b). Human evaluation is expensive and only evaluates a small number of examples that are selected. Reference-free evaluators based on LLMs offer hope for solving this problem.

Wang et al. (2023a) believes that ChatGPT achieves competitive correlation with golden human judgments through preliminary meta-evaluation. Liu et al. (2023) proposes the probability of each score calculated by the ChatGPT or GPT-4 OpenAI (2023) as the weight for the corresponding score to improve the alignment with human judgment. GPTScore Fu et al. (2023) takes the sum of the logarithms of the decoding probabilities of the evaluated text as the final score. According to existing experimental conclusions, evaluators based on LLMs have a tendency to replace human evaluaiton. However, these studies lack evaluation on closed-ended examples as well as stability of evaluators testing. Therefore, we construct two adversarial meta-evaluation dialogue datasets to test the reliability of evaluators on closed-ended and adversarial examples. Different from the traditional works (Honovich et al., 2021; Dziri et al., 2022b; Khalid and Lee, 2022) of evaluation reference-based and model-based metrics, we pay more attention to the reliability of reference-free evaluators based on LLMs.

3 Dataset Construction

For the existing meta-evaluation datasets (i.e., Topical-Chat and Persona-Chat) constructed by Mehri and Eskenazi (2020), we manually annotate the types (i.e., open-ended and closed-ended) of examples. The standard for annotation is to mark as closed-ended if the response to be generated does not have semantic diversity according to the dialogue history and facts provided, otherwise it is open-ended. The labels are annotated based on the dialogue history when no facts. We construct two new adversarial meta-evaluation datasets KdConv-ADV and DSTC7-ADV that include lots of closed-ended examples. The statistical results of the datasets are shown in Table 5 (Appendix A.1). We can observe that the datasets we built contains a large number of closed-ended instances with adversarial examples, which can test the reliability of different evaluators on closed-ended examples and the stability on adversarial examples. Datasets KdConv-ADV and DSTC7-ADV are derived from KdConv and DSTC7-AVSD, respectively.

3.1 KdConv-ADV

KdConv is a Chinese multi-domain Knowledge-driven Conversation dataset Zhou et al. (2020), which provides a reference response for each example. We select 91 examples with unique response semantics from KdConv as closed-ended examples. The characteristic of these examples is to use unique information such as location or time as the response. For example, "Tokyo is the capital of Japan." We pick an equal number of instances with multiple response semantics from KdConv as open-ended examples. The questions in these examples are open-ended, and the semantics of the responses are not unique, such as "What do you think of Tokyo?" In order to effectively evaluate whether evaluators have the ability to identify unreasonable responses under low lexical overlap, we ask annotators to generate three new candidate responses for each example.

For closed-ended examples, the generated candidate responses are inconsistent with dialogue histories and are even false information. To achieve this goal, we utilize GPT-4 to generate five candidates according dialogue history. The specific prompt is "Dialogue history: $content$. Please generate five different responses", where $content$ represents the content of the conversation history. For the five generated responses, we intentionally alter key candidate information to be irrelevant or even false. Finally, we select three candidates with the lowest BLEU scores compared to the reference response as test examples. As shown in Table 1, the location information (i.e., "Taipei, Taiwan") provided by the first candidate is inconsistent with fact, and the locations in the second and third candidates are completely fictitious (i.e, "Yamaguchi Prefecture, in Chang’an Kyushu" and Matsuyama City, Ehime Prefecture, Tokyo).

For open-ended examples, we expect the generated candidates to be reasonable. Similarly, we utilize GPT-4 to generate five candidates according dialogue history. Then we then manually perform information correction on the generated candidates based on the original knowledge base provided by KdConv. We also select the three candidates with the lowest BLEU scores compared to the reference response as test examples. As shown in Table 1, the generated candidates are of high quality and reasonable.

Using the responses provided by KdConv as references to score candidates based on different metrics. As shown in Figure 2 (left), we can observe that the BLEU-1 score of the closed-ended examples is 14%, and that of the open-ended examples is 16%, which means the lexical overlap is low between generated candidate and reference responses in KdConv-ADV. For open-ended examples, although there is low lexical overlap between candidate and reference responses, these candidates are high-quality and reasonable responses. It can be found that these reference-based metrics give almost similar scores to closed-ended and open-ended examples, which indicates that reference-based metrics does not have the ability to identify unreasonable responses when the reference and candidate responses have low lexical overlap.

Notably, we refrain from leveraging the knowledge provided by the corpus during response evaluation. This approach stems from our desire to assess the agreement between various evaluators and human judgment without relying on external knowledge, which is made due to the challenge of furnishing accurate knowledge bases for each example in most cases.

3.2 DSTC7-ADV

DSTC7-AVSD is a knowledge-grounded response generation dataset with textual knowledge that is video’s caption and summary Alamri et al. (2019). DSTC7-AVSD provides six reference candidates with similar semantics and different expressions. We consider the first one as the reference response, while the remaining ones are regarded as candidate responses. We select 342 examples with unique response semantics from DSTC7-AVSD as closed-ended examples. In order to effectively evaluate whether evaluators are able to identify unreasonable responses based on semantics rather than matching, we reverse the semantics of responses to obtain the same amount of adversarial examples by negation transformations, such as "can" $\rightarrow$ "can not", "is" $\rightarrow$ "is not", "only" $\rightarrow$ "not only". Therefore, candidate response semantics of adversarial examples are contradictory to the facts provided (i.e., video descriptions). As shown in Table 1, the responses of the adversarial examples (i.e., "…not only one…", "…not only a single…","…two persons…" and "…seven persons…") are contradictory and inconsistent with the facts (i.e., "…one person…"). We analyze the characteristics of DSTC7-ADV by calculating the scores of reference-based evaluators. As shown in Figure 2 (right), we can observe that the BLEU-1 score of the closed-ended examples is 32%, and that of the corresponding adversarial examples (i.e., closed-ADV) is 28%, which means the lexical overlap is higher between generated candidate and reference responses compared to KdConv-ADV. Similar phenomena can also be observed from the results of other metrics. For adversarial examples, although there is high lexical overlap between candidate and reference responses, these candidates are unreasonable responses. Besides, we can also find that these metrics give almost similar scores to closed-ended and adversarial examples, which shows that these metrics do not have the ability to identify unreasonable responses based on semantics.

Different from KdConv-ADV, we employ facts provided when evaluating responses. We follow previous studies Bao et al. (2020); Liu et al. (2022a) to concatenate video descriptions to conversation history beginnings. The primary motivation is to assess the proficiency of various evaluators in comprehending and utilizing knowledge effectively.

Evaluation Dimensions

Based on previous studies Zhang et al. (2018b); Bao et al. (2020); Xu et al. (2022); Liu et al. (2022a), we divide the reference-free evaluation dimensions into two categories: independent and correlated dimensions. Independent dimensions are evaluated solely based on the generated candidates, without considering any other factors or references, mainly including fluency, naturalness and engagingness. The correlated dimensions refers to the evaluation not only based on candidates but also referring to the relationship between candidates and dialogue history even facts, mainly including coherence, relevance, consistency and groundedness.

Existing studies Mehri and Eskenazi (2020); Liu et al. (2023) based on Topical-Chat and Persona-Chat have extensively studied evaluators on four dimensions: naturalness, coherence, engagingness and groundedness. We select coherence, relevance, consistency and fluency not tested before as evaluation dimensions for DSTC7-ADV and KdConv-ADV, and find that evaluators based on LLMs are more likely to make mistakes in correlated evaluation dimensions after preliminary experimental analysis. The definition of each evaluation dimension is defined as follows:

•

Fluency refers to the fluency and grammatical correctness of responses.
•

Coherence refers to the logical and semantic coherence between responses and previous context.
•

Relevance refers to the degree to response is connected or relevant to a particular topic, question, or situation of previous context.
•

Consistency refers to the logical and factual consistency between responses and previous context, facts also include external commonsense knowledge.

[Uncaptioned image] — Table 1: Examples of KDConv-ADV and DSTC7-ADV. ADV indicates that the corresponding candidate is an adversarial example. The score corresponding to green indicates that the evaluation is reasonable, pink indicates a slightly higher score, red indicates that the evaluation is unreasonable, and yellow indicates unreasonable fluctuations in the scores of adversarial examples.

Prompt for Evaluation

Note that the reference-free evaluator is a prompt-based evaluation process. We find that the graded scoring mechanism may lead to the low variance of the scores and the low correlation with human judgments Liu et al. (2023). Another fact is that the ranking mechanism (i.e., ordering of candidates during scoring) will get different results due to the different positions of multiple candidates Wang et al. (2023b). In order to compare with the with reference-based evaluators, we divide each dimension into 10 levels (i.e., L1:(0-0.1), L2(0.1-0.2),…,L10(0.9-1)) (Appendix A.4). We find that the output may focus on one value when asking LLMs to output a level number, such as 10, making it impossible to calculate Spearman Zar (2005) and Spearman coefficientscitep Mukaka (2012). In order to avoid this problem, we require LLMs to output a value of 0-1, which is mapped to the corresponding level finally. We follow previous studies Huang et al. (2023); Liu et al. (2023) to design the prompts, as shown in Figure 4 (Appendix A.2). Note that different LLMs correspond to different delimiters. If there is no fact, the content of the corresponding position is empty. In this manner, the dialogue history, response, corresponding fact and the definition of evaluation dimension are given to LLMs. Next, LLMs will give its judgment (e.g., "The response is consistent with the information provided in the input. Therefore, the score is 1."). Finally, the numerical scores could be easily extracted via heuristic rules. To evaluate the consistency between LLMs and human judgement, we performed human annotation. There are three annotators for human annotation, the average of the three points is used as the final score. The final score will be mapped to the corresponding level. The Fleiss’ Kappa (Moons and Vandervieren, 2023) is 0.766, which indicates better annotation agreement. Please refer to the appendix for details (i.e., Appendix A.4).

4 Experiments

The introduction of the baselines A.5 and detailed experimental setup A.6 are in the Appendix.

4.1 Results

To test the agreement between different evaluators and human on the dialogue response generation task, we compute turn-level Pearson and Spearman correlation on Topical-Chat, Persona-Chat, KdConv-ADV and DSTC7-ADV. Table 6 (Appendix A.1) reports the results of different evaluators on Topical-Chat and Persona-Chat. We can observe that reference-free evaluators have better human agreement compared to reference-based evaluators. Specifically, the spearman correlation of UNIEVAL evaluator on Topical-Chat is 53.3%, and pearson correlation is 57.7%. The ChatGPT evaluator achieve similar results to UNIEVAL on Topical-Chat. ChatGPT evaluator’s spearman is 45% and pearson is 39% on Persona-Chat. On KdConv-ADV and DSTC7-ADV datasets, we can draw the same conclusion from Table 3 that reference-free evaluators have better human agreement.

Table 2 reports the results of different evaluators on KdConv-ADV and DSTC7-ADV. Traditional reference-based evaluators have better human agreement compared to reference-free evaluators based on LLMs on KdConv-ADV. The results of BLEU-3, BLEU-4 and METEOR outperform evaluator based on ChatGPT by an average of 12.5%/14.8% (pearson/spearman), 12.2%/15.2% and 11.6%/16.6% respectively. However, we observe the opposite result where reference-free evaluators based on LLMs outperform reference-based evaluators on DSTC7-ADV. Evaluator based on ChatGPT outperforms BLEU-4 by an average of 27.7% and 30.2%. The disparate phenomena observed in the two datasets suggest that reference-free evaluators encounter reliability issues. The reasons for this phenomenon are complex. We will conduct an in-depth analysis from the perspective of datasets and evaluators.

4.2 Reliability of Reference-based Evaluators

To further analyze the performance of different evaluators on different data types, we report results of evaluators on different data types separately. Table 3 and Table 4 report the performance of different evaluators on open-ended and closed-ended examples (i.e., adversarial examples) on KdConv-ADV and DSTC7-ADV. An interesting phenomenon is that reference-based evaluators show better alignment with humans in KdConv-ADV’s closed-ended examples, which leads to reference-based evaluators having better alignment on KdConv-ADV. On DSTC7-ADV dataset, the results are completely opposite. The main reason is that the candidates and references of KdConv-ADV’s closed-ended examples have low overlap (i.e., Figure 2), which causes reference-based evaluators that judge text quality based on lexical matching to tend to give low scores. And the candidates of KdConv-ADV’s closed-ended examples that are inconsistent with history are of low quality. The tendency to give low scores and the consistency of low-quality responses are important reasons for high alignment between reference-based evaluators and humans. However, a high overlap between candidates and responses does not mean that the candidates are of high quality on DSTC7-ADV, which is the causes the reference-based evaluators to fail. We believe that using reference-based evaluators may result in unfair evaluation for tasks that generate text with high diversity, such as dialogue generation tasks. But for tasks with low diversity, such as translation tasks, extractive generation tasks, etc., reference-based evaluators are still a more credible choice.

4.3 Reliability of Reference-free Evaluators

Different from reference-based evaluators, reference-free evaluators use LLMs to score the generated responses without any reference target. As we observed, the reference-free evaluators have better alignment with humans on open-ended examples (i.e., Table 3 and Table 6). For evaluation tasks involving knowledge, the reference-free evaluators will give unfair judgement without sufficient external knowledge support. According to Table 2, reference-free evaluators have poorer alignment with humans compared to reference-based evaluators on KdConv-ADV. We also observe that the reference-free evaluators achieves better human alignment on DSTC7-ADV when external knowledge is provided. We consider reference-free evaluators to be a reliable alternative to reference-based evaluators on tasks that do not involve external knowledge.

Most tasks involve external knowledge, which requires LLMs to be a knowledgeable evaluators to make a reasonable judgment. However, LLMs cannot update its knowledge in real time. Therefore, how to make full use of external knowledge bases and improve the reliability of reference-free evaluators for knowledge-based task evaluation is challenging.

Discrimination Ability

An effective evaluator should demonstrate the ability to identify unreasonable responses and distinguish responses of varying qualities. To reveal whether LLMs have the ability to distinguish responses of different quality, we take ChatGPT as an example to report its score distribution on KdConv-ADV and DSTC7-ADV, as shown in Figure 3. The scores of KdConv-ADV are mostly concentrated between 0.8 and 1.0 and the scores of DSTC7-ADV are mostly concentrated between 0.5 and 1.0. Although the score distribution of ChatGPT in DSTC7-ADV is more discriminative than that in KdConv-ADV, the scores of ChatGPT still have the tendency of overestimation on DSTC7-ADV (i.e., most of the scores are above 0.5). While it is evident that the responses from the adversarial example of DSTC7-ADV contradict the factual information, it is surprising that lots of adversarial examples achieve a high score of 0.9 on the consistency dimension, which is extremely unreasonable.

There are still some deficiencies in reference-free evaluators based on LLMs. First, LLMs have inherent limitations in their knowledge. Second, the scores of LLMs have a large room for improvement in distinguishing responses of different qualities. The ratings of LLMs exhibit a tendency to cluster within a narrow range, displaying low variance, and sometimes even assigning high ratings to unreasonable responses.

5 Conclusion

We construct two adversarial meta-evaluation dialogue datasets KdConv-ADV and DSTC7-ADV. Based on KdConv-ADV and DSTC7-ADV, we analyze the performance and reliability of reference-free and reference-based evaluators. We think that reference-based evaluators are still reliable for tasks with low diversity, and reference-free evaluators are a reliable alternative to reference-based evaluators on tasks that do not involve external knowledge. Reference-free evaluators may provide unreasonable evaluations for tasks involving knowledge when external knowledge is absent. Besides, reference-free evaluators tend to overestimate the quality of the text and are still deficient in distinguishing text quality.

Limitations

While we analyze the challenges and possibilities and of LLMs as text generation evaluators by constructed benchmarks, the utilization of LLMs as evaluators for text generation is in the exploratory phase. There are limitations that provide avenues for future work: i) the performance of LLMs as an NLG metric is related to prompts, how to reduce the sensitivity of LLMs to prompt and improve the reproducibility of results is an important issue. ii) our work pays more attention to dialogue tasks with high diversity, and lacks the analysis of other generative tasks.

References

Alamri et al. (2019) Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K Marks, Chiori Hori, Peter Anderson, et al. 2019. Audio visual scene-aware dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7558–7567.
Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
Bao et al. (2020) Siqi Bao, Huang He, Fan Wang, Hua Wu, and Haifeng Wang. 2020. Plato: Pre-trained dialogue generation model with discrete latent variable. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 85–96.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
Ding et al. (2022) Bosheng Ding, Chengwei Qin, Linlin Liu, Lidong Bing, Shafiq Joty, and Boyang Li. 2022. Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450.
Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
Dziri et al. (2022a) Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Osmar Zaiane, Mo Yu, Edoardo M Ponti, and Siva Reddy. 2022a. Faithdial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490.
Dziri et al. (2022b) Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter. 2022b. Evaluating attribution in dialogue systems: The begin benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083.
Frieder et al. (2023) Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, and Julius Berner. 2023. Mathematical capabilities of chatgpt. arXiv preprint arXiv:2301.13867.
Fu et al. (2023) **lan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
Ghazarian et al. (2020) Sarik Ghazarian, Ralph Weischedel, Aram Galstyan, and Nanyun Peng. 2020. Predictive engagement: An efficient metric for automatic evaluation of open-domain dialogue systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7789–7796.
Gopalakrishnan et al. (2020) Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anushree Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. 2020. Topical-chat: Towards knowledge-grounded open-domain conversations. pages 7098–7108.
Honovich et al. (2021) Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, and Omri Abend. 2021. Q2:: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7856–7870.
Huang et al. (2023) Fan Huang, Haewoon Kwak, and Jisun An. 2023. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. arXiv preprint arXiv:2302.07736.
Khalid and Lee (2022) Baber Khalid and Sung** Lee. 2022. Explaining dialogue evaluation metrics using adversarial behavioral analysis. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5871–5883.
Kocoń et al. (2023) Jan Kocoń, Igor Cichecki, Oliwier Kaszyca, Mateusz Kochanek, Dominika Szydło, Joanna Baran, Julita Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil Kanclerz, et al. 2023. Chatgpt: Jack of all trades, master of none. arXiv preprint arXiv:2302.10724.
Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian Vlad Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132.
Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
Liu et al. (2022a) Yongkang Liu, Shi Feng, Daling Wang, Hinrich Schütze, and Yifei Zhang. 2022a. Pvgru: Generating diverse and relevant dialogue responses via pseudo-variational mechanism. arXiv preprint arXiv:2212.09086.
Liu et al. (2022b) Yongkang Liu, Shi Feng, Daling Wang, and Yifei Zhang. 2022b. Mulzdg: Multilingual code-switching framework for zero-shot dialogue generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 648–659.
Mehri and Eskenazi (2020) Shikib Mehri and Maxine Eskenazi. 2020. Usr: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 681–707.
Moons and Vandervieren (2023) Filip Moons and Ellen Vandervieren. 2023. Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. a generalisation of fleiss’ kappa. arXiv preprint arXiv:2303.12502.
Mukaka (2012) Mavuto M Mukaka. 2012. A guide to appropriate use of correlation coefficient in medical research. Malawi medical journal, 24(3):69–71.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. ArXiv.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, pages 27730–27744.
Pang et al. (2020) Bo Pang, Erik Nijkamp, Wenjuan Han, Linqi Zhou, Yixian Liu, and Kewei Tu. 2020. Towards holistic and automatic evaluation of open-domain dialogue generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3619–3629.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Popović (2017) Maja Popović. 2017. chrf++: words hel** character n-grams. In Proceedings of the second conference on machine translation, pages 612–618.
Qin et al. (2023) Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
Rao et al. (2023) Haocong Rao, Cyril Leung, and Chunyan Miao. 2023. Can chatgpt assess human personalities? a general evaluation framework. arXiv preprint arXiv:2303.01248.
Sedoc et al. (2019) Joao Sedoc, Daphne Ippolito, Arun Kirubarajan, Jai Thirani, Lyle Ungar, and Chris Callison-Burch. 2019. Chateval: A tool for chatbot evaluation. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (demonstrations), pages 60–65.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Wang et al. (2023a) Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, **an Xu, Jianfeng Qu, and Jie Zhou. 2023a. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
Wang et al. (2023b) Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023b. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
Wieting et al. (2015) John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. Towards universal paraphrastic sentence embeddings.
Xu et al. (2022) Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, Hua Wu, Haifeng Wang, and Shihang Wang. 2022. Long time no see! open-domain conversation with long-term persona memory. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2639–2650.
Yeh et al. (2021) Yi-Ting Yeh, Maxine Eskenazi, and Shikib Mehri. 2021. A comprehensive assessment of dialog evaluation metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 15–33.
Zar (2005) Jerrold H Zar. 2005. Spearman rank correlation. Encyclopedia of biostatistics, 7.
Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
Zhang et al. (2021) Chen Zhang, João Sedoc, Luis Fernando D’Haro, Rafael Banchs, and Alexander Rudnicky. 2021. Automatic evaluation and moderation of open-domain dialogue systems. arXiv preprint arXiv:2111.02110.
Zhang et al. (2018a) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018a. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213.
Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert.
Zhang et al. (2018b) Weinan Zhang, Yiming Cui, Yifa Wang, Qingfu Zhu, Lingzhi Li, Lianqiang Zhou, and Ting Liu. 2018b. Context-sensitive generation of open-domain conversational responses. In Proceedings of the 27th international conference on computational linguistics, pages 2437–2447.
Zhao et al. (2019) Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. 2019. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578.
Zhou et al. (2020) Hao Zhou, Chujie Zheng, Kaili Huang, Minlie Huang, and Xiaoyan Zhu. 2020. Kdconv: A chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7098–7108.

Appendix A Appendix

A.1 Tables

A.2 Prompt Template

A.3 Case Study

In order to analyze the existing problems more intuitively, we present detailed cases (i.e., Table 1). It can be clearly seen that ChatGPT gives high scores for adversarial examples on KDConv-ADV and DSTC7-ADV. In the example of KdConv-ADV, ChatGPT fails to recognize fictional locations (i.e., "Yamaguchi Prefecture in Chang’an Kyushu" and "Matsuyama City, Ehime Prefecture, Tokyo") and gives the highest rating on the consistency dimension. In the example of DSTC7-ADV, ChatGPT cannot identify candidates with semantic inconsistencies with given fact caused by slight perturbations. As mentioned in previous subsection, considerable text comprehension abilities is the premise and basis for using LLMs as evaluators. However, we can conclude that the robustness of LLMs has a lot of room for improvement, and LLMs cannot correctly understand subtle semantic perturbations.

A.4 Evaluation Dimensions

We select coherence, relevance, consistency and fluency as evaluation dimensions. The level definitions for different dimensions are as follows.

Fluency:

•

L1: almost incomprehensible, heavily grammatical errors, poor coherence.
•

L2: There are many grammatical errors, and it is difficult to understand.
•

L3: Many grammatical errors, unclear expressions, require effort to understand.
•

L4: There are some grammatical errors, and the expression is acceptable.
•

L5: There are some grammatical errors, and the expression is generally clear.
•

L6: The grammar is basically correct, and the expression is coherent, but there are some minor errors.
•

L7: The grammar is correct, the expression is fluent and coherent, and there are only a few minor errors.
•

L8: The grammar is almost completely correct, and the expression is fluent and natural, with some minor errors.
•

L9: The grammar is almost perfect, the expression is very fluent, and there are few errors.
•

L10: Whether it is grammar, vocabulary, or expression, it is perfect, with almost no errors.

Consistency:

•

L1: lack logical structure and coherence, contain internal inconsistencies and fake information, making it difficult to follow or understand.
•

L2: contain internal inconsistencies, where statements within contradict each other.
•

L3: contradict the established context, either within the conversation history.
•

L4: contain factual inaccuracies or incorrect information that can be easily identified based on available knowledge.
•

L5: align with the conversation history but may deviate from established facts or external knowledge.
•

L6: demonstrate basic logical consistency within the conversation history, but there may be minor inconsistencies.
•

L7: align well with the conversation history, with some errors or inconsistencies.
•

L8: demonstrate logical coherence, and statements align with each other and the conversation history.
•

L9: not only logically consistent but also reflect factual information, aligning well with the conversation history.
•

L10: not only internally consistent and factually accurate but also align with external commonsense knowledge.

Relevance:

•

L1: no connection to the topic of conversation.
•

L2: touch upon the topic but lack a substantial connection.
•

L3: contain some relevant elements but miss key points.
•

L4: contain most relevant elements but lack a comprehensive or coherent connection to the entire context.
•

L5: show a moderate degree of relevance, involving some aspects of conversation topic.
•

L6: generally relevant to the topic or question, providing a basic understanding for conversation.
•

L7: demonstrate a good level of relevance, involving most aspects of conversation topic.
•

L8: highly relevant, involving the topic and providing a detailed content connection to the context.
•

L9: exhibit great level of relevance, thoroughly involving the conversation topic.
•

L10: not only completely relevant but also involving the topic with clarity and insight.

Coherence:

•

L1: lack both logical and semantic coherence, making it challenging to understand.
•

L2: have semantic gaps or disjointed elements, making it difficult to establish a connection to previous context.
•

L3: lack logical structure, leading to difficulties in understanding the conversation.
•

L4: exhibit partial coherence but still contain logical or semantic gaps.
•

L5: have a basic logical relation, but there are gaps or inconsistencies in semantic coherence.
•

L6: demonstrate a basic level of both logical and semantic coherence, providing a generally understandable ideas.
•

L7: exhibit a good level of logical and semantic coherence, making it easy to understand and follow.
•

L8: display a better level of both logical and semantic coherence, ensuring a smooth and connection to previous context.
•

L9: show an high level of both logical and semantic coherence, with a seamless and clear connection to previous context.
•

L10: achieve flawless logical and semantic coherence, presenting information in a way that is not only easy to follow but also understanding.

Note that these are also reference standards for human annotation. We have three annotators for each dimension, and the average of the three points is used as the final score, and the final score is the result after retaining two decimal places. All annotators are graduate students engaged in NLP research. The fees incurred for labeling are supported by the corresponding funds.

A.5 Baselines

The reference-based evaluators used are as follows:

•

BLEU-1, BLEU-2, BLEU-3, BLEU-4 Papineni et al. (2002), ROUGE-1, ROUGE-2 and ROUGE-L Lin (2004) measure the lexical overlap between the generated text and the candidate text.
•

METEOR Banerjee and Lavie (2005) calculates the similarity between the candidate text and the reference text based on word-level precision and recall, as well as penalties for word order.
•

ChrF++ Popović (2017) uses the F-score statistic for character n-gram matches to judge the similarity between the candidate text and the generated text.
•

BERTScore Zhang et al. (2019) evaluates the semantic similarity via pre-trained BERT model.

The reference-free evaluators used are as follows:

•

ChatGPT is an advanced AI language model developed by OpenAI, trained on a vast amount of text data, capable of understanding and generating human-like text across a wide range of topics.
•

Vicuna-13B is an open-source chatbot trained by fine-tuning LLaMA-13B on user-shared conversations collected from ShareGPT.
•

ChatGLM-6B is an open-source dialogue language model based on General Language Model Du et al. (2022) that supports both Chinese and English.
•

StableLM-13B is an open-source dialogue language model based on vicuna fine-tuned by RLHF Ouyang et al. (2022).

A.6 Experimental Setup

For ChatGPT (i.e., GPT-3.5), we obtain the result by calling the API interface of OpenAI¹¹1https://chat.openai.com. We set parameters temperature to 0.7, the presence penalty to 0, the frequency penalty to 0.2 and the maximum sentence length to 1024. Codes for other LLMs are available online²²2https://github.com/lm-sys/FastChat. The maximum decoding length is set to 512 for Vicuna-13B Chiang et al. (2023), ChatGLM-6B Du et al. (2022); Zeng et al. (2022), StableLM-13B and Dolly-12B. The temperature is set to 0.8 for Vicuna-13B and others are set to 0.7. Except for ChatGPT, the weights of other LLMs can be downloaded from the hugging face³³3https://huggingface.co/models. We employ the hugging face evaluation library ⁴⁴4https://huggingface.co/evaluate-metric to calculate the results for reference-based metrics. The default delimiter is "###". For ChatGPT and Vicuna we use space as delimiter. We use RTX A6000 (48G) for inference of open source LLMs. Note that we cost about $20 to call the ChatGPT API interface.

A.7 Metrics

We employ Spearman correlation Zar (2005) ( $\rho$ ) and Pearson correlation Mukaka (2012) ( $\gamma$ ) to evaluate different metrics correlate with human judgment.