HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2402.14690v1 [cs.CL] 22 Feb 2024

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

Zhaoheng Huang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Zhicheng Dou11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yutao Zhu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTCorresponding author.    Ji-rong Wen11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTGaoling School of Artificial Intelligence, Renmin University of China, Bei**g, China
{huangzh,dou,ytzhu,jrwen}@ruc.edu.cn
Abstract

Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or hallucination. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose UFO, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement five evaluation scenarios based on this framework. Experimental results show that for most QA tasks, human-written evidence and reference documents are crucial, and they can substitute for each other in retrieval-augmented QA tasks. In news fact generation tasks, search engine results and LLM knowledge are essential. Our dataset and code are available at https://github.com/WaldenRUC/UFO.

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models


Zhaoheng Huang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Zhicheng Dou11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTthanks: *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTCorresponding author., Yutao Zhu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, and Ji-rong Wen11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTGaoling School of Artificial Intelligence, Renmin University of China, Bei**g, China {huangzh,dou,ytzhu,jrwen}@ruc.edu.cn

1 Introduction

The advancement of large language models (LLMs) has facilitated the development of generative artificial intelligence Zhao et al. (2023). Many LLM-based applications have been released, such as ChatGPT and Bing Chat (also known as Bing Copilot), which gradually change people’s working habits.111ChatGPT: https://chat.openai.com/chat, Bing Chat: https://www.bing.com/new However, LLMs tend to generate factually inaccurate texts, which lack consistency with human knowledge, and degrade the usability of the model-generated text. Such a shortcoming of LLMs is well-known as hallucination Bang et al. (2023); Ji et al. (2023). The quality of datasets and training paradigms are concerned as the potential factors causing hallucinations in LLMs Li et al. (2022). How to detect and measure the hallucinations in model-generated texts has received increasing attention.

Refer to caption
Figure 1: Our proposed factuality evaluation pipeline UFO . We integrate four fact sources within various evaluation scenarios to assess the factuality score.

Current automatic evaluation metrics employ a specific fact source to evaluate the factuality of LLMs for certain tasks. However, there is still a lack of analysis on the applicability of different fact sources in various tasks. Considering the establishment of a new task, the fact sources relied upon by previous evaluation methods may not be applicable. It’s important to consider whether alternative fact sources can be utilized. For example, when a new QA task arises, collecting human-written evidence can be extremely costly. In such cases, whether search results from a search engine can be used as a substitute for human-written evidence as a fact source remains unexplored.

To address the issue, we propose UFO , a Unified and Flexible framework for factuality evaluatiOn, which allows for: (a) Flexibly integration of various fact sources. (b) A unified verification method that enables switching fact sources in specific tasks. (c) The combination of different fact sources to enhance the factuality evaluation. In our framework shown in Figure 1, we first extract fact units from the text, including verifiable question-answer pairs and keywords of model-generated text. Then, for each fact, we verify it against the set of fact sources until a matching answer is found. Finally, we assign a binary matching score to each fact.

With the support of this evaluation framework, we can systematically analyze the evaluation capabilities of different fact sources across various scenarios in existing evaluation tasks. Specifically, we consider four different fact sources: (1) Human-written evidence. This corresponds to some text generation tasks with labeled data. For example, expert-validated QA tasks often provide human-written answers for evaluation. (2) Reference documents. Many recent studies, e.g., WebBrain Qian et al. (2023), WebGPT Nakano et al. (2022), GopherCite Menick et al. (2022), WebCPM Qin et al. (2023), WebGLM Liu et al. (2023), ALCE Gao et al. (2023) and Bing Chat, have reported that leveraging reference documents can facilitate LLMs generation of more factual text. Therefore, such reference documents can also be a fact source for factuality evaluation. (3) Search engine results. When humans are asked to check the factuality of a text, they usually make judgments by turning to search engines. (4) LLM knowledge. Existing studies Fu et al. (2023) suggest that advanced LLMs (such as GPT-4) can serve as a fact source for verification.

We design five evaluation scenarios where different fact sources and their combinations are used, summarized in Table 1, to demonstrate the flexibility of UFO . In each evaluation scenario, we compute the discriminative power Sakai (2006) of our proposed framework and compare it with eight baseline metrics. We experiment with these evaluation scenarios over five text-generation tasks, including Open-domain QA, Web Retrieval-based QA, Expert-Validated QA, News Fact Generation, and Retrieval-Augmented QA, to investigate the importance of data sources in different task scenarios. The experimental results demonstrated that in most QA tasks, obtaining human-written evidence and reference documents enhances the discriminative power of the evaluation pipeline. In the news fact generation task, we only require the search engine results and LLM knowledge to verify facts. In the retrieval-augmented QA task, the positive effects derived from two fact sources are comparable, thus allowing them to be substituted for each other. Although not the main focus of this paper, we evaluate six existing LLMs: Bing Chat in “precise” generation mode, ChatGPT, LLaMA-7b, LLaMA-13b, Vicuna-7b, and Vicuna-13b. We discovered that the factuality score of Bing Chat in precise mode is lower than that of ChatGPT, yet comparable to Vicuna-13b. In open-source LLMs, increasing the scale of model parameters can enhance factual accuracy.

Our contributions can be summarized as follows:

\bullet We propose UFO , a pipeline integrating flexible plug-and-play fact sources with unified verification methods for evaluating the factuality of LLMs.

\bullet We conduct a systematic analysis of the evaluation capabilities of four fact sources across five factuality evaluation scenarios and five tasks.

\bullet We reveal that human-written evidence and reference documents are essential in QA tasks, while search engine results and LLM knowledge are crucial in news fact generation tasks.

Fact Sources

(1) Human-written evidence (Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT); (2) Reference documents (Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT); (3) Search engine results (Ssesubscript𝑆seS_{\text{se}}italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT); (4) LLM knowledge (Slksubscript𝑆lkS_{\text{lk}}italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT).

Evaluation Scenarios

(1) Sse,Slksubscript𝑆sesubscript𝑆lk\langle S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩; (2) She,Sse,Slksubscript𝑆hesubscript𝑆sesubscript𝑆lk\langle S_{\text{he}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩; (3) Srd,Sse,Slksubscript𝑆rdsubscript𝑆sesubscript𝑆lk\langle S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩; (4) She,Srd,Sse,Slksubscript𝑆hesubscript𝑆rdsubscript𝑆sesubscript𝑆lk\langle S_{\text{he}},S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩; (5) Srd,She,Sse,Slksubscript𝑆rdsubscript𝑆hesubscript𝑆sesubscript𝑆lk\langle S_{\text{rd}},S_{\text{he}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩.

Tasks

(1) Open-domain QA; (2) Web Retrieval-based QA; (3) Expert-Validated QA; (4) News Fact Generation; and (5) Retrieval-Augmented QA.

Table 1: The fact sources, evaluation scenarios, and tasks we study in the paper.

2 Related Work

2.1 Text Generation and Hallucination

The advancement of text generation has been propelled by pre-trained language models (PLMs) like BART Lewis et al. (2020), T5 Raffel et al. (2020), and GPT-2 Radford et al. (2019), utilizing structures that range from encoder-decoder to decoder-only configurations. The emergence of LLMs such as GPT-3 Brown et al. (2020), characterized by their vast parameter counts and extensive training data, marked a significant evolution. These LLMs exhibit “Emergent Abilities” Wei et al. (2022a) like In-Context Learning Dong et al. (2023) and Chain-of-Thought Reasoning Wei et al. (2022b). Despite these advancements, a challenge is the generation of text that deviates from human knowledge, known as hallucination Bang et al. (2023); Li et al. (2022). Even the latest LLMs, such as GPT-4 OpenAI (2023), still suffer from hallucinations, which greatly damages the factuality of the generated text.

In this paper, we propose a unified and flexible pipeline UFO to evaluate the factuality of the generated texts, which can detect hallucinations in various text generation tasks.

2.2 Factuality Evaluation

Factuality evaluation methods have evolved from traditional n-gram-based metrics to more sophisticated approaches leveraging PLMs and LLMs Li et al. (2022). Initially, metrics such as BLEU Papineni et al. (2002), ROUGE Lin and Och (2004), and METEOR Banerjee and Lavie (2005) assumed factual accuracy correlated with n-gram overlaps. Later, metrics like BERTScore Zhang et al. (2019) utilizing contextual embeddings, and BARTScore Yuan et al. (2021) employing generative scoring, captured deep semantic information between texts for evaluating factuality consistency. QAGS Wang et al. (2020) further innovates by combining entity extraction with PLM-based question generation and answering, while Q2superscript𝑄2Q^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Honovich et al. (2021) leverages natural language inference (NLI) for entailment analysis. More recently, LLM-based metrics such as FactScore Min et al. (2023) and FacTool Chern et al. (2023) utilize LLM’s reasoning ability, extracting and verifying facts against sources like Wikipedia dumps.

Different from previous studies, our proposed pipeline UFO integrates human-written evidence, reference documents, search engine results, and LLM knowledge for factuality evaluation.

3 Methodology

3.1 Problem Statement

Given a query qDsubscript𝑞𝐷q_{D}italic_q start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT sourced from a dataset D𝐷Ditalic_D, an evaluated LLM M𝑀Mitalic_M generates a text passage TM(qD)subscript𝑇𝑀subscript𝑞𝐷T_{M}(q_{D})italic_T start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). We define a collection of fact sources, denoted as S𝑆Sitalic_S. The objective is to assign a factuality score s[0,1]𝑠01s\in[0,1]italic_s ∈ [ 0 , 1 ] to the model-generated text TM(qD)subscript𝑇𝑀subscript𝑞𝐷T_{M}(q_{D})italic_T start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). A higher score denotes a greater consistency between the text TM(qD)subscript𝑇𝑀subscript𝑞𝐷T_{M}(q_{D})italic_T start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) and the fact sources S𝑆Sitalic_S, indicating higher factual accuracy of the LLM M𝑀Mitalic_M.

3.2 Fact Sources

Based on the origin of fact sources, we categorize them into four types: human-written evidence (She)S_{\text{he}})italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT ), reference documents (Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT), search engine results (Ssesubscript𝑆seS_{\text{se}}italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT), and LLM knowledge (Slksubscript𝑆lkS_{\text{lk}}italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT). Each type of fact source contains a series of text passages {P1,P2,}superscript𝑃1superscript𝑃2\{P^{1},P^{2},\cdots\}{ italic_P start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ }. The first two types of fact sources (Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT and Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT) are provided by established datasets and require some cost to collect, such as responses and evidence written by users, and selected reference documents while they browse web pages. The latter two (Ssesubscript𝑆seS_{\text{se}}italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT and Slksubscript𝑆lkS_{\text{lk}}italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT) are fact sources relevant to specifically generated questions, independent of any particular dataset. These include text snippets retrieved from the web corpus and answers from the parameterized knowledge within LLMs.

For a given question, it might not be possible to obtain an answer from a certain fact source. Therefore, in an evaluation scenario, we predefine a sequence of fact sources S=S1,S2,𝑆superscript𝑆1superscript𝑆2S=\langle S^{1},S^{2},\cdots\rangleitalic_S = ⟨ italic_S start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ ⟩, and systematically verify each until a matched answer is extracted.

3.3 UFO Evaluation Framework

Our evaluation pipeline includes three LLM-based modules: Fact Unit Extraction, Fact Source Verification, and Fact Consistency Discrimination. We employ OpenAI’s ChatGPT API (gpt-3.5-turbo-1106) for these modules.

Refer to caption
Figure 2: A case of evaluating Vicuna-generated text within the retrieval-augmented QA task where S=She,Srd,Sse,Slk𝑆subscript𝑆hesubscript𝑆rdsubscript𝑆sesubscript𝑆lkS=\langle S_{\text{he}},S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangleitalic_S = ⟨ italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩. Details of the generated text are omitted for clarity. The extracted answers are highlighted.

3.3.1 Fact Unit Extraction

LLMs can generate a text with several sentences for a given input, but not all the generated sentences are fact-related. Therefore, our first problem is to determine the smallest unit for factuality evaluation. We start by analyzing the process of factuality evaluation performed by humans. When faced with a text, humans will first focus on entities and their relevant descriptions that may cause factual errors. Then, they will ask a series of questions about the factuality of these descriptions. For example, when a text describes the date of birth D𝐷Ditalic_D of a famous person X𝑋Xitalic_X, a common question is “when was X𝑋Xitalic_X born?”. Finally, by comparing the golden answer Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (from their knowledge or Internet) with D𝐷Ditalic_D, the factuality of the description can be evaluated.

Based on these analyses, we consider an entity-centric question qksubscript𝑞𝑘q_{k}italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and its corresponding answer eksubscript𝑒𝑘e_{k}italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be used as a basic fact unit fk=qk,eksubscript𝑓𝑘subscript𝑞𝑘subscript𝑒𝑘f_{k}=\langle q_{k},e_{k}\rangleitalic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ⟨ italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩. However, a generated question may have a weak relationship with its context. Therefore, we generate concise keywords of model-generated text. For instance, a model-generated text is related to the demand for vinyl records in the Oxfam Charity Shop. the generated question, “What has led to a rise in demand for vinyl records?” yields search engine results that discuss the recent global music trends. Thus, it is necessary to generate concise keywords t𝑡titalic_t for the model-generated text to refine the search engine’s capability in retrieving content specifically relevant to the Oxfam Charity Shop.

Benefiting from the potent language comprehension capabilities of LLMs, we introduce an LLM-based Fact Unit Extraction (FUE) method to extract the fact units (Prompt A) and generate the keywords (Prompt A) from the model-generated text. These prompts are provided in Appendix A.

{t,q1,e1,,qp,ep}=FUE(TMi(qD)).𝑡subscript𝑞1subscript𝑒1subscript𝑞𝑝subscript𝑒𝑝FUEsubscript𝑇subscript𝑀𝑖subscript𝑞𝐷\displaystyle\{t,\langle q_{1},e_{1}\rangle,\cdots,\langle q_{p},e_{p}\rangle% \}=\text{FUE}(T_{M_{i}}(q_{D})).{ italic_t , ⟨ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ , ⋯ , ⟨ italic_q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ } = FUE ( italic_T start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) ) .

Next, we will utilize the fact source sequence S𝑆Sitalic_S in different scenarios to evaluate the factual accuracy of these fact units.

3.3.2 Fact Source Verification

To verify the accuracy of a given fact unit qi,eisubscript𝑞𝑖subscript𝑒𝑖\langle q_{i},e_{i}\rangle⟨ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ against the keywords t𝑡titalic_t, our target is to identify the correct answer aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the question qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using a specific text passage Pjksubscriptsuperscript𝑃𝑘𝑗P^{k}_{j}italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from a fact source Sksuperscript𝑆𝑘S^{k}italic_S start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. However, not all text passages in the fact source are relevant to the question. To enhance the relevance of extracted answers, we employ the advanced language comprehension abilities of LLMs. We instruct the LLM-based Answer Extraction (AE) module to pinpoint the most relevant answers within the text, generating a “[NOANS]” text if no answer is found. This method involves directly prompting an LLM to retrieve answers from human-written evidence Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT and reference documents Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT (Prompt A), reducing inaccuracies during fact verification Huang et al. (2023). When dealing with search engine results Ssesubscript𝑆seS_{\text{se}}italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT and LLM knowledge Slksubscript𝑆lkS_{\text{lk}}italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT, we prompt the model to first check if an answer is available in the search results before resorting to its internal knowledge (Prompt A). Answers are sequentially sought in each text passage of the fact source until a suitable answer is found. If no text passage yields an answer, it indicates a mismatch with the fact source, leading to a transition to the next fact source Sk+1Ssuperscript𝑆𝑘1𝑆S^{k+1}\in Sitalic_S start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT ∈ italic_S for verification.

Concretely, for a fact unit qi,eisubscript𝑞𝑖subscript𝑒𝑖\langle q_{i},e_{i}\rangle⟨ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ and keywords t𝑡titalic_t, we obtain the answer aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using passage Pjksubscriptsuperscript𝑃𝑘𝑗P^{k}_{j}italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from fact source Sksuperscript𝑆𝑘S^{k}italic_S start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as follows:

aisubscript𝑎𝑖\displaystyle a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =AE(t,Pjk,qi),absentAE𝑡subscriptsuperscript𝑃𝑘𝑗subscript𝑞𝑖\displaystyle=\text{AE}(t,P^{k}_{j},q_{i}),= AE ( italic_t , italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1)
AE(t,Pmk,qi)AE𝑡subscriptsuperscript𝑃𝑘𝑚subscript𝑞𝑖\displaystyle\text{AE}(t,P^{k}_{m},q_{i})AE ( italic_t , italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =[NOANS],absent[NOANS]\displaystyle=\text{[NOANS]},= [NOANS] , (2)
m𝑚\displaystyle mitalic_m {1,,j1}.absent1𝑗1\displaystyle\in\{1,\cdots,j-1\}.∈ { 1 , ⋯ , italic_j - 1 } . (3)

3.3.3 Fact Consistency Discrimination

Given the answer eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT extracted from the model-generated text and the answer aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT extracted from fact sources, our objective is to determine whether the two answers are factually consistent. To achieve this, we employ an LLM-based fact consistency discrimination (FCD) module (Prompt A), assigning a score of 0 or 1 to each fact unit qi,eisubscript𝑞𝑖subscript𝑒𝑖\langle q_{i},e_{i}\rangle⟨ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩. Subsequently, we calculate the average score of all fact units as the factuality score of the model-generated text:

sisubscript𝑠𝑖\displaystyle s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =FCD(ei,ai){0,1},absentFCDsubscript𝑒𝑖subscript𝑎𝑖01\displaystyle=\text{FCD}(e_{i},a_{i})\in\{0,1\},= FCD ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ { 0 , 1 } , (4)
s𝑠\displaystyle sitalic_s =1Ni=1Nsi.absent1𝑁superscriptsubscript𝑖1𝑁subscript𝑠𝑖\displaystyle=\frac{1}{N}\sum_{i=1}^{N}s_{i}.= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (5)

3.4 Evaluation Criteria

We measure the discriminative power (DP) of the evaluation metric, as described by Sakai (2006). Given the collection of evaluated LLMs M𝑀Mitalic_M and all pairs (Mi,Mj)Msubscript𝑀𝑖subscript𝑀𝑗𝑀(M_{i},M_{j})\subset M( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⊂ italic_M, we bootstrap sample the evaluation score on Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Mjsubscript𝑀𝑗M_{j}italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Then, given a threshold value f𝑓fitalic_f, we obtain minority rate (MR) and proportion of ties (PT) values. The MR represents the failure rate of distinguishing the evaluation score differences between a pair of LLMs within the threshold. The PT indicates the percentage of cases where the pair of LLMs cannot be distinguished within the threshold. The smaller the values of MR and PT, the stronger the discriminative power of the evaluation metric. Finally, we have the MR-PT curve as the discriminative power of the evaluation metric. The details of the pseudocode of DP measurement are given in Appendix B.

3.5 Evaluation Scenarios

To assess the importance of each fact source across various tasks, we introduce five evaluation scenarios, each represented by an ordered list of fact sources S𝑆Sitalic_S. (1) S=Sse,Slk𝑆subscript𝑆sesubscript𝑆lkS=\langle S_{\text{se}},S_{\text{lk}}\rangleitalic_S = ⟨ italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩. (2) S=She,Sse,Slk𝑆subscript𝑆hesubscript𝑆sesubscript𝑆lkS=\langle S_{\text{he}},S_{\text{se}},S_{\text{lk}}\rangleitalic_S = ⟨ italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩. (3) S=Srd,Sse,Slk𝑆subscript𝑆rdsubscript𝑆sesubscript𝑆lkS=\langle S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangleitalic_S = ⟨ italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩. (4) S=She,Srd,Sse,Slk𝑆subscript𝑆hesubscript𝑆rdsubscript𝑆sesubscript𝑆lkS=\langle S_{\text{he}},S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangleitalic_S = ⟨ italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩. (5) S=Srd,She,Sse,Slk𝑆subscript𝑆rdsubscript𝑆hesubscript𝑆sesubscript𝑆lkS=\langle S_{\text{rd}},S_{\text{he}},S_{\text{se}},S_{\text{lk}}\rangleitalic_S = ⟨ italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩. To thoroughly verify facts in the text and mitigate the hallucination of LLM knowledge, we retain and fix the verification order of Ssesubscript𝑆seS_{\text{se}}italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT and Slksubscript𝑆lkS_{\text{lk}}italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT. By comparing the DP in scenarios (1), (2), and (3), we can infer the impact of fact sources. Comparing the DP in scenarios (4) and (5) reveals the effects of changing the verification order of fact sources.

Moreover, LLMs incorporating web search modules, such as Bing Chat, have been able to generate text while providing retrieved reference documents. In Section 5.2, we will discuss the impact of using these referenced documents as the supplementary fact source Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT in evaluation scenarios.

4 Experiments

4.1 Datasets

Considering the available human-written evidence and reference documents, we categorize tasks presented in Table 1. We carry out our evaluation pipeline on six datasets: NQ Lee et al. (2019), HotpotQA Yang et al. (2018), TruthfulQA Lin et al. (2022), CNN/DM Hermann et al. (2015), Multi-News Fabbri et al. (2019), and MS MARCO Bajaj et al. (2016). We collect 200 samples from each dataset and prompt evaluated LLMs to generate verifiable facts in sufficient detail (Prompt A).

To compare with reference-based metrics, we construct a golden answer G𝐺Gitalic_G containing more facts for each task. (1) Open-domain QA: In the NQ dataset, we concatenated the provided short answers to form G𝐺Gitalic_G. (2) Web Retrieval-based QA: In the HotpotQA dataset, we combined the short answer and the reference documents as the golden answer G=[a;Srd]𝐺𝑎subscript𝑆rdG=[a;S_{\text{rd}}]italic_G = [ italic_a ; italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ]. (3) Expert-validated QA: In the TruthfulQA dataset, all provided human-written correct answers and best answers were considered as the fact source Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT, forming the golden answer G𝐺Gitalic_G. (4) News Fact Generation: For the CNN/DM and Multi-News datasets, we first prompted ChatGPT (Prompt A) to generate a title of the given summary (considered as Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT). Then, we used the generated title to prompt the evaluated LLMs (Prompt A) to generate an introduction centered around the facts. (5) Retrieval-Augmented QA: In the MS MARCO dataset, the answer a𝑎aitalic_a was regarded as Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT, and all user-clicked documents were considered as Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT. The answer and the selected documents were concatenated to form G𝐺Gitalic_G.

4.2 Baselines

Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT
Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT
Dataset NQ HQA TQA C/D M-N MS
Avg. # of Tokens
Bing Chat 136.96 87.99 196.02 223.64 248.66 287.93
ChatGPT 398.36 336.26 393.41 534.74 531.17 561.91
llama-7b 455.14 427.46 453.58 433.66 436.47 459.84
llama-13b 431.99 406.42 432.24 422.61 424.83 455.64
vicuna-7b 353.72 332.16 387.88 398.84 401.92 413.28
vicuna-13b 341.00 327.38 346.32 366.60 369.31 398.76
Avg. # of Sentences
Bing Chat 5.15 3.42 7.50 8.00 8.20 10.54
ChatGPT 12.46 10.05 12.62 15.47 15.57 18.66
llama-7b 17.34 15.80 18.41 15.73 16.21 18.85
llama-13b 15.88 14.54 16.70 14.64 14.86 17.19
vicuna-7b 12.25 11.39 14.50 13.03 13.17 15.98
vicuna-13b 11.72 10.85 12.31 12.16 12.10 14.77
Avg. # of Facts Extracted Using ChatGPT
Bing Chat 4.12 3.08 4.78 5.32 5.64 5.85
ChatGPT 5.60 5.17 5.41 5.94 5.92 6.04
llama-7b 4.86 4.97 5.06 5.15 5.18 5.30
llama-13b 5.14 5.11 5.00 4.96 5.28 5.11
vicuna-7b 5.76 5.66 5.34 6.04 6.18 5.46
vicuna-13b 5.51 5.56 5.26 5.90 5.66 5.62
Table 2: Statistics of model-generated text on six datasets. “HQA”, “TQA”, “C/D”, “M-N”, and “MS” are abbreviations of “HotpotQA”, “TruthfulQA”, “CNN/DM”, “Multi-News” and “MS MARCO”.

Evaluated Models We evaluate six existing LLMs with varying parameter scales in our experiments: (1) Bing Chat is a GPT-4-based model specifically tailored for web searches. For this model, we choose the “Precise” generation mode to test the factuality when the model is expected to generate the most accurate and detailed fact units.222https://blogs.bing.com/search/march_2023/Confirmed-the-new-Bing-runs-on-OpenAI%E2%80%99s-GPT-4 In each provided URL, we extract all the <p> tags of the corresponding web page. Subsequently, we divide the text into multiple passages, each containing no more than 1024 tokens. (2) ChatGPT: we utilized OpenAI’s ChatGPT API (gpt-3.5-turbo-1106) for text generation.333https://platform.openai.com/docs/api-reference/chat (3) LLaMA Touvron et al. (2023): We select two LLaMA-series fine-tuned models for text generation (LLaMA-2-7b-chat and LLaMA-2-13b-chat). (4) Vicuna Chiang et al. (2023): Vicuna is a chat assistant developed by fine-tuning LLaMA-2 foundation model with user-shared conversations collected from ShareGPT.444https://sharegpt.com/ We select Vicuna-7b-v1.5 and Vicuna-13b-v1.5 to generate text. The statistical data of the text generated by these LLMs is given in Table 2.

Baseline Evaluation Metrics We compare our proposed pipeline with both reference-based and reference-free metrics.

(1) Reference-based metrics. Such metrics require a golden answer G𝐺Gitalic_G and calculate the consistency with the model-generated text. BLEU Papineni et al. (2002) and ROUGE Lin and Och (2004) are used to measure the token-level term overlap. BERTScore Zhang et al. (2019) and BARTScore Yuan et al. (2021) are model-based metrics to evaluate passage-level similarity. QAGS Wang et al. (2020) and Q2superscript𝑄2Q^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Honovich et al. (2021) are the most relevant PLM-based and NLI-based metrics to evaluate factuality.

(2) Reference-free metrics. FactScore Min et al. (2023) first breaks down the model-generated text into several claims. Subsequently, these claims are verified through Wikipedia dumps. In this study, we form all golden answers as the corpus for FactScore verification. FacTool Chern et al. (2023) performs the verification of each claim by employing a search engine and derives factuality scores at the claim level.

Refer to caption
Figure 3: MR-PT discriminative power curve of evaluation metrics on datasets. The closer the curve is to the bottom-left corner, the better the evaluation metric is. Our proposed model curve is represented by a solid line, while the baseline model curve is depicted using a dashed line.
Refer to caption
Figure 4: MR-PT discriminative power curve of evaluation metrics on datasets. The closer the curve is to the bottom-left corner, the better the evaluation metric is. We incorporate reference documents retrieved by Bing Chat as part of the fact source Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT. Non-LLM-based methods are omitted for clarity.
Scenarios Pearson \uparrow Spearman \uparrow #Avg. Tokens \downarrow #Avg. Time \downarrow
FacTool FactScore FacTool FactScore FacTool FactScore FacTool FactScore
Sse,Slksubscript𝑆sesubscript𝑆lk\langle S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ 0.296 0.272 0.308 0.283 +1.20% -6.71% -3.44% -22.31%
She,Sse,Slksubscript𝑆hesubscript𝑆sesubscript𝑆lk\langle S_{\text{he}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ 0.275 0.269 0.276 0.270 -3.79% -12.21% -2.03% -19.26%
Srd,Sse,Slksubscript𝑆rdsubscript𝑆sesubscript𝑆lk\langle S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ 0.279 0.278 0.283 0.290 +12.33% -9.97% -2.31% -16.02%
She,Srd,Sse,Slksubscript𝑆hesubscript𝑆rdsubscript𝑆sesubscript𝑆lk\langle S_{\text{he}},S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ 0.263 0.269 0.271 0.281 +6.25% -11.03% +2.43% -18.75%
Srd,She,Sse,Slksubscript𝑆rdsubscript𝑆hesubscript𝑆sesubscript𝑆lk\langle S_{\text{rd}},S_{\text{he}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ 0.267 0.283 0.271 0.295 +11.20% -10.35% +7.13% -19.63%
Table 3: Pearson’s and Spearman’s correlation coefficients for UFO and the LLM-based baseline evaluation metrics under five evaluation scenarios. All p𝑝pitalic_p-values are less than 0.010.010.010.01. Additionally, we compared the usage of ChatGPT API tokens and the average time required for evaluating samples. The best results are marked bold.

5 Results and Analysis

5.1 Discriminative Power Results

Our goal is to evaluate the discriminative power Sakai (2006) of the proposed evaluation pipeline UFO in each scenario. For simplicity, in the scenario such as S=Sse,Slk𝑆subscript𝑆sesubscript𝑆lkS=\langle S_{\text{se}},S_{\text{lk}}\rangleitalic_S = ⟨ italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩, we name our framework “ufo(se+lk)”. The experimental results are shown in Figure 3. Each point on the curve represents the values of MR and PT calculated at a given threshold f𝑓fitalic_f. We have the following findings:

(1) Among baseline metrics, BARTScore demonstrates minimal variance with a notably low MR value, while QA-based metrics like QAGS and Q2superscript𝑄2Q^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT show suboptimal discriminative power across all tasks. It indicates that PLM-based methods are particularly reliant on the quality of the golden answer, especially its entities and relationships. FactScore Min et al. (2023) verifies each extracted claim against a predefined fact source and enhances the LLM-based method’s capability through In-Context Learning with demonstrations. Thus it shows relatively high discriminative power with additional time and token usage. We will discuss the API and time usage in Section 5.4.

(2) HotpotQA and TruthfulQA dataset provide Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT and Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT respectively. However, the proposed pipeline UFO in the scenario Srd,Sse,Slksubscript𝑆rdsubscript𝑆sesubscript𝑆lk\langle S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ in HotpotQA resulted in weaker DP. It suggests that the quality of Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT is inferior to search engine results Ssesubscript𝑆seS_{\text{se}}italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT and LLM knowledge Slksubscript𝑆lkS_{\text{lk}}italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT. Search engines can retrieve more relevant details for entities involved in multi-hop reasoning. In TruthfulQA, there is a significant presence of questions with confusion, hence the fact source Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT has higher quality. UFO in the scenario She,Sse,Slksubscript𝑆hesubscript𝑆sesubscript𝑆lk\langle S_{\text{he}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ show a substantial increase in DP compared to the scenario without Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT. This also indicates the necessity of incorporating human-written evidence in expert-validated QA tasks.

(3) CNN/DM, Multi-News, and MS MARCO provide both Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT and Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT. However, in the task of news fact generation, UFO in the scenario S=Sse,Slk𝑆subscript𝑆sesubscript𝑆lkS=\langle S_{\text{se}},S_{\text{lk}}\rangleitalic_S = ⟨ italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ achieves the highest DP, indicating that both Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT and Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT exhibit a negative impact on DP. In fact, within provided documents and summaries, the factual details are considerably limited and not specific enough. Thus, employing search engines enables precise retrieval of factual details based on specific extracted questions. In the retrieval-augmented QA task, Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT and Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT significantly enhance DP. Moreover, changing the order of verification between Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT and Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT does not significantly affect the DP value. This reflects a high degree of consistency between user-clicked reference documents and human-written evidence. It suggests that for this task, reference documents and human-written evidence can be substituted for each other, and we can rely solely on user-clicked reference documents without the need for collecting human-written evidence.

Dataset NQ HotpotQA TruthfulQA CNN/DM Multi-News MS MARCO
Metrics UFO FT UFO FT UFO FT UFO FT UFO FT UFO FT
Bing Chat 0.752 0.615 0.630 0.709 0.649 0.594 0.628 0.742 0.685 0.745 0.725 0.787
ChatGPT 0.762 0.776 0.635 0.725 0.662 0.700 0.669 0.806 0.708 0.806 0.765 0.845
llama-7b 0.610 0.480 0.465 0.453 0.630 0.560 0.607 0.673 0.589 0.689 0.711 0.757
llama-13b 0.674 0.596 0.537 0.537 0.597 0.564 0.573 0.688 0.661 0.731 0.735 0.731
vicuna-7b 0.670 0.631 0.515 0.562 0.692 0.610 0.603 0.714 0.593 0.688 0.662 0.755
vicuna-13b 0.676 0.658 0.514 0.570 0.717 0.664 0.652 0.740 0.646 0.731 0.739 0.778
Table 4: Factuality scores of our proposed evaluation framework UFO in the scenario of S=She,Srd,Sse,Slk𝑆subscript𝑆hesubscript𝑆rdsubscript𝑆sesubscript𝑆lkS=\langle S_{\text{he}},S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangleitalic_S = ⟨ italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ and FacTool (abbreviated to “FT”) on six datasets. In the evaluation of the group of evaluated LLMs, the highest factuality score is bold, and the second highest score is underlined.

5.2 Effect of Model-Retrieved Documents

Some existing LLMs provide retrieved reference documents during text generation. We incorporate these as part of Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT to evaluate the certain LLM (i.e., Bing Chat in our experiments). The discriminative power of LLM-based evaluation metrics is shown in Figure 4. We have the following findings:

(1) In the NQ, HotpotQA, and TruthfulQA datasets, incorporating retrieved reference documents in the evaluation scenarios enhances the DP. Specifically on the TruthfulQA dataset, we further observe that UFO in the scenario Srd,She,Sse,Slksubscript𝑆rdsubscript𝑆hesubscript𝑆sesubscript𝑆lk\langle S_{\text{rd}},S_{\text{he}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ significantly boosts discriminative power and surpasses FactScore. This implies that the model accurately retrieves relevant reference documents based on easily confused facts during text generation.

(2) In the CNN/DM dataset, UFO in the scenario Srd,Sse,Slksubscript𝑆rdsubscript𝑆sesubscript𝑆lk\langle S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ shows a slight improvement when incorporating model-retrieved documents compared to the use of search engines and LLM knowledge Sse,Slksubscript𝑆sesubscript𝑆lk\langle S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩. It suggests that retrieved reference documents serve as a beneficial complement to search engine results in this task.

(3) In the MS MARCO dataset, the enhancement of DP brought by incorporating retrieved reference documents is minimal, indicating a factual consistency between human-written evidence, clicked reference documents, and retrieved reference documents.

5.3 Factuality Scores of LLMs

In addition to evaluating discriminative power, we also calculated the factuality scores of evaluated LLMs on various datasets. Under the evaluation scenario S=She,Srd,Sse,Slk𝑆subscript𝑆hesubscript𝑆rdsubscript𝑆sesubscript𝑆lkS=\langle S_{\text{he}},S_{\text{rd}},S_{\text{se}},S_{\text{lk}}\rangleitalic_S = ⟨ italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩, the comparative experimental results between our proposed framework UFO and FacTool are presented in Table 4. Both evaluation methods show that in most datasets, the factuality score of Bing Chat in “precise” mode is slightly lower than that of ChatGPT, but close to the score of Vicuna-13b. This implies that hallucinations occur during the retrieval-augmented generation process, thereby reducing the factual accuracy of the generated text. We also observe that increasing the parameter scale of open-source LLMs (LLaMA and Vicuna) can enhance factual accuracy in most datasets.

5.4 Cost of LLM-based Metrics

Our proposed pipeline sequentially extracts answers from listed text passages in the fact source. To assess the efficiency of our proposed evaluation pipeline, we compared the average evaluation time, API token costs, and the correlation coefficient with existing LLM-based evaluation metrics. Experimental results (shown in Table 3) demonstrate that our proposed pipeline achieves the highest correlation coefficient with FacTool and FactScore in the scenarios where the fact sources are set to Sse,Slksubscript𝑆sesubscript𝑆lk\langle S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ and Srd,She,Sse,Slksubscript𝑆rdsubscript𝑆hesubscript𝑆sesubscript𝑆lk\langle S_{\text{rd}},S_{\text{he}},S_{\text{se}},S_{\text{lk}}\rangle⟨ italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT ⟩ respectively.

Besides, in comparison to FactScore, our proposed evaluation pipeline reduces token consumption by about 10% and time cost by about 20%, while our proposed pipeline maintains a relatively comparable discriminative power. Compared to FacTool, in most five evaluation scenarios, we achieved greater discriminative power with an affordable additional cost of about 10% in all tasks. It implies that the incorporation of fact sources can enhance the discriminative power of evaluation metrics.

6 Conclusion

In this paper, we propose UFO, a factuality evaluation pipeline incorporating flexible plug-and-play fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge with unified verification methods. Experimental results on five evaluation scenarios show that open-domain QA, web retrieval-based QA, and expert-validated QA tasks require high-quality human-written evidence and model-retrieved reference documents, and retrieval-augmented QA needs either human-written evidence or user-clicked reference documents, while news fact generation tasks rely on search engine results and LLM knowledge.

Limitations

Though our proposed pipeline analyzes different fact sources, there are still several limitations: (1) We utilize ChatGPT (gpt-3.5-turbo-1106) in the modules of our proposed evaluation framework, therefore the updating of the LLM will affect both the cost and effectiveness of our evaluation. (2) Currently, our approach has not yet been integrated with the training process of LLMs. In future work, we will consider incorporating the factuality score evaluated by our framework into the training process of the LLMs through reinforcement learning methods.

Ethics Statement

The datasets used in this paper are available publicly online. Particularly, any data involving sensitive information has been anonymized, ensuring that it cannot be traced back to individuals.

References

Appendix A Prompt

Your task is to segment a given document into several atomic claims. For each claim, you need to generate several questions related to it and extract an answer for each question from that claim. Your output is a JSON list. Each element includes the question, the answer, and the sentence from the document containing the atomic claims. You MUST only respond in the JSON List format as described below. DO NOT RESPOND WITH ANYTHING ELSE. ADDING ANY OTHER EXTRA NOTES THAT VIOLATE THE RESPONSE FORMAT IS BANNED. START YOUR RESPONSE WITH ’[’. [Response Format] ["question": "Informative question", "answer": "A concise phrase under 10 words", "sentence": "Sentence containing the answer.", …] document: {document}
Generate keywords for the following document. Do not provide any explanations. document: {document} keywords:

Prompt 3: LLM-based answer extraction with Shesubscript𝑆heS_{\text{he}}italic_S start_POSTSUBSCRIPT he end_POSTSUBSCRIPT and Srdsubscript𝑆rdS_{\text{rd}}italic_S start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT

You are an answer-extraction expert. Your task is to extract a short answer from the evidence to the question. Directly answer without any explanations. If the evidence is irrelevant to the question, respond ONLY with "NOANS". keywords: {keywords} evidence: {evidence} question: {question} your answer:

Prompt 4: LLM-based answer extraction with Ssesubscript𝑆seS_{\text{se}}italic_S start_POSTSUBSCRIPT se end_POSTSUBSCRIPT and Slksubscript𝑆lkS_{\text{lk}}italic_S start_POSTSUBSCRIPT lk end_POSTSUBSCRIPT

You are a question-answering expert. You are given a question, keywords, and some snippets. Your task is to output a short answer to the question based on the snippets or the knowledge you possess, while your answer is factually consistent with the given keywords. If your answer is based on the snippets, you should provide the indices of the snippets. If there is no relevant snippet, you should answer with the knowledge you possess, and the output index is [-1]. If you are uncertain about the correctness and timeliness of your answer, your answer should be formed as [NOANS] instead. An example output format: [<your answer>]; [<index1>, <index2>, …]. Your output MUST begin with ‘[‘. DO NOT GIVE ANY EXPLANATIONS. question: {question} keywords: {keywords} snippets: {snippets}
Your task is to judge whether the following two answers are factually consistent. Directly answer yes or no. Answer 1: {eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT} Answer 2: {aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT}
You have been presented with the following title. Your task is to provide a comprehensive introduction to the query topic with sufficient verifiable facts based on the knowledge you possess. Your output must be in English. Title: {title} Introduction:
Generate a summarized title for the following document. Do not provide any explanations. document: {document} title:

Appendix B Pseudocode for DP Measurement

Algorithm 1 Discriminative Power Measurement with MR-PT Curve
1:for each threshold{0,0.01,,0.20}𝑡𝑟𝑒𝑠𝑜𝑙𝑑00.010.20threshold\in\{0,0.01,\cdots,0.20\}italic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_d ∈ { 0 , 0.01 , ⋯ , 0.20 } do
2:     fthreshold𝑓𝑡𝑟𝑒𝑠𝑜𝑙𝑑f\leftarrow thresholditalic_f ← italic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_d
3:     B1000𝐵1000B\leftarrow 1000italic_B ← 1000
4:     for each (Mi,Mj)Msubscript𝑀𝑖subscript𝑀𝑗𝑀(M_{i},M_{j})\in M( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_M do
5:         for b=1𝑏1b=1italic_b = 1 to B𝐵Bitalic_B do
6:              Qi=mean(Bootstrap(Mi))subscript𝑄𝑖𝑚𝑒𝑎𝑛𝐵𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝subscript𝑀𝑖Q_{i}=mean(Bootstrap(M_{i}))italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_m italic_e italic_a italic_n ( italic_B italic_o italic_o italic_t italic_s italic_t italic_r italic_a italic_p ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
7:              Qj=mean(Bootstrap(Mj))subscript𝑄𝑗𝑚𝑒𝑎𝑛𝐵𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝subscript𝑀𝑗Q_{j}=mean(Bootstrap(M_{j}))italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_m italic_e italic_a italic_n ( italic_B italic_o italic_o italic_t italic_s italic_t italic_r italic_a italic_p ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) )
8:              m=f*max(Qi,Qj)𝑚𝑓maxsubscript𝑄𝑖subscript𝑄𝑗m=f*\text{max}(Q_{i},Q_{j})italic_m = italic_f * max ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
9:              if |QiQj|<msubscript𝑄𝑖subscript𝑄𝑗𝑚|Q_{i}-Q_{j}|<m| italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | < italic_m then
10:                  EQ(i,j)EQ(i,j)+1𝐸𝑄𝑖𝑗𝐸𝑄𝑖𝑗1EQ(i,j)\leftarrow EQ(i,j)+1italic_E italic_Q ( italic_i , italic_j ) ← italic_E italic_Q ( italic_i , italic_j ) + 1
11:              else if Qi>Qjsubscript𝑄𝑖subscript𝑄𝑗Q_{i}>Q_{j}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT then
12:                  GT(i,j)GT(i,j)+1𝐺𝑇𝑖𝑗𝐺𝑇𝑖𝑗1GT(i,j)\leftarrow GT(i,j)+1italic_G italic_T ( italic_i , italic_j ) ← italic_G italic_T ( italic_i , italic_j ) + 1
13:              else
14:                  GT(j,i)GT(j,i)+1𝐺𝑇𝑗𝑖𝐺𝑇𝑗𝑖1GT(j,i)\leftarrow GT(j,i)+1italic_G italic_T ( italic_j , italic_i ) ← italic_G italic_T ( italic_j , italic_i ) + 1
15:              end if
16:         end for
17:     end for
18:     MRfMi,Mjmin(GT(i,j),GT(j,i))BMi,Mj𝑀subscript𝑅𝑓subscriptsubscript𝑀𝑖subscript𝑀𝑗min𝐺𝑇𝑖𝑗𝐺𝑇𝑗𝑖𝐵subscriptsubscript𝑀𝑖subscript𝑀𝑗MR_{f}\leftarrow\frac{\sum_{M_{i},M_{j}}\text{min}(GT(i,j),GT(j,i))}{B\sum_{M_% {i},M_{j}}}italic_M italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ← divide start_ARG ∑ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT min ( italic_G italic_T ( italic_i , italic_j ) , italic_G italic_T ( italic_j , italic_i ) ) end_ARG start_ARG italic_B ∑ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG
19:     PTfMi,MjEQ(i,j)BMi,Mj𝑃subscript𝑇𝑓subscriptsubscript𝑀𝑖subscript𝑀𝑗𝐸𝑄𝑖𝑗𝐵subscriptsubscript𝑀𝑖subscript𝑀𝑗PT_{f}\leftarrow\frac{\sum_{M_{i},M_{j}}EQ(i,j)}{B\sum_{M_{i},M_{j}}}italic_P italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ← divide start_ARG ∑ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_E italic_Q ( italic_i , italic_j ) end_ARG start_ARG italic_B ∑ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG
20:end for