Molecular Facts: Desiderata for Decontextualization
in LLM Fact Verification
Abstract
Automatic factuality verification of large language model (LLM) generations is becoming more and more widely used to combat hallucinations. A major point of tension in the literature is the granularity of this fact-checking: larger chunks of text are hard to fact-check, but more atomic facts like propositions may lack context to interpret correctly. In this work, we assess the role of context in these atomic facts. We argue that fully atomic facts are not the right representation, and define two criteria for molecular facts: decontextuality, or how well they can stand alone, and minimality, or how little extra information is added to achieve decontexuality. We quantify the impact of decontextualization on minimality, then present a baseline methodology 111https://github.com/anisha2102/molecular_facts for generating molecular facts automatically, aiming to add the right amount of information. We compare against various methods of decontextualization and find that molecular facts balance minimality with fact verification accuracy in ambiguous settings.
Molecular Facts: Desiderata for Decontextualization
in LLM Fact Verification
Anisha Gunjal Greg Durrett The University of Texas at Austin [email protected]
1 Introduction
Large language models (LLMs) have emerged as powerful tools for delivering knowledge to users, either via closed-book generation or retrieval-augmented systems. However, these systems may not always produce correct facts Liu et al. (2023a), an instance of the “hallucination” problem Zhang et al. (2024); Ji et al. (2022); Zhang et al. (2023). Recent research has shown the potential of LLMs to identify unfaithful content and enable automatic fact-checking and attribution against sources Falke et al. (2019); Goyal and Durrett (2021); Min et al. (2023); Wang et al. (2024); Chern et al. (2023); Wei et al. (2024); Chen et al. (2023a); Malaviya et al. (2024); Gao et al. (2023b); Tang et al. (2024).
A key step in this process is to break down generated content into individual atomic claims Fabbri et al. (2022); Chen et al. (2023b); Kamoi et al. (2023b); Min et al. (2023). This decomposition allows for retrieval of evidence focused on a particular part of the generated content Gao et al. (2023a); Wang et al. (2024); Chen et al. (2024) and also error localization by determining which parts of the content are supported or not. However, this step is not straightforward. Wanner et al. (2024) highlights that the effectiveness of automatic factuality verification is heavily dependent on the strategies employed for decomposing content into claims. In particular, LLMs have a propensity to incorrectly merge information about similarly named entities Lee et al. (2024) and current evaluation methods struggle to handle these ambiguities in atomic claims Chiang and yi Lee (2024). Figure 1 shows a possible issue: a fact that is “too atomic” can be validated against evidence that doesn’t actually support it.
In this work, we address the problem of how to find minimal yet still unambiguous facts for LLM fact verification. We frame this problem as one of decontextualization, adding context to a sentence to make it stand alone while retaining its original meaning Choi et al. (2021). This process draws on the idea of specificity from discourse Louis and Nenkova (2012), specifically whether sentences can express key information about the participants without ambiguity Li et al. (2016). However, making a claim unambigous is not enough: when escalating from simple pronoun replacement in atomic facts to elaborations like a Swedish footballer in Figure 1, we must balance the specificity of the fact with how easy it will be to verify. It is not trivial to select the “right” information to elaborate on a claim without compromising the ease of verification.
We define two criteria needed in this fact-checking setting: decontextuality, where the claim should uniquely specifying entities, events, and context, and minimality, maintained by avoiding excessive additional information that could complicate verification. We propose a notion of molecular facts, which balances these two criteria: molecular facts should be fully specific while compatible with the maximum number of possible evidence documents. We explore these criteria and our molecular facts in two settings. First, we address the question of how much non-minimality could be a problem for error localization with standard decontextualization techniques. We devise a synthetic fact-checking experiment where particular nuances of an output generation are unsupported and show that an average of 6% of claims may pose problems for error localization. In a setting with LLM responses of 5 sentences with 3 claims each, this would lead to localization errors in a large fraction of responses. We then evaluate the opposite problem, whether decontextualization is too minimal. We study a dataset of fact-checking with ambiguous entity names presented in Chiang and yi Lee (2024). We show that our method of molecular fact generation balances accuracy under ambiguous entity references with minimality of claims.
Our main contributions are: (1) We re-examine the decontextualization process for fact-checking and define molecular claims following the desiderata of decontextuality and minimality. (2) We investigate the loss of minimality due to claim decontextualization and its impacts on error localization. (3) We find that molecular claims are more performant and minimal for long-form generations than existing decontextualization methods.
2 Desiderata for Decontextualization
We propose desiderata to determine the optimal level of decontextualization required for atomic facts. An atomic fact is defined as a discrete unit of information, derived from a broader claim, and variously described in the literature as propositions, subclaims, summary content units, or atomic content units Nenkova and Passonneau (2004); Liu et al. (2023b); Zhang and Bansal (2021); Chen et al. (2023b); Min et al. (2023); Kamoi et al. (2023b).
Although an atomic fact theoretically represents a singular conceptual unit, recent NLP work using this does not typically give this a rigorous definition from the standpoint of semantics. Wanner et al. (2024) demonstrate a high variation in the number of subclaims generated by different decomposition methods, with the macro-average of subclaims per biography ranging from 20.2 using the method by Kamoi et al. (2023b) to 32.9 with the approach by Chen et al. (2023b). Note that in Figure 1, She was a medallist at the European Athletics Championships in 1986 could be kept as one unit or broken into three facts evaluating her status as a medallist, the venue, and the date.
2.1 Desiderata
Preliminaries
We define as a response from a language model to an input prompt , consisting of a series of claims to be verified. Claims are extracted through an upstream process of decomposition and potentially filtering for “check-worthiness” (i.e., does the claim present factual content or does it present an opinion?). We describe the prompting in Appendix A.
We assume that in the context of and , a claim can be fully interpreted with a truth-conditional meaning . In the terminology of Rashkin et al. (2021) and Choi et al. (2021), represents interpreted in the linguistic context of and .
We can construct a standalone proposition with truth conditional meaning equivalent to by being sufficiently specific. For example, the statement in Figure 1 could be completely specified as Ann Jansson, the Swedish footballer born on 6 May 1957 who played for Hammarby IF, won a medal at the European Athletics Championship, the biennial event organized by the European Athletics Association, in 1986.
Decontextualization
Our goal in this work is to produce rewritten molecular claims. Denote by the rewritten form of , which should have semantics when interpreted as a standalone proposition. As in Figure 1, this requires adding disambiguating information that could provide information needed to identify an entity (specifying that Jansson is a Swedish footballer), identify an event (specifying that the event happened in 1986), specify a qualification (in the field of biochemistry, …), or more.
Criterion 1 (Decontextuality)
When interpreted as a standalone statement, must have the truth conditional meaning . That is, it should uniquely specify entities, events, and other context such that the claim is now interpretable.
This criterion is equivalent to Definition 1 from Choi et al. (2021). For the settings we consider, the level of added information needed to specify the meaning of a statement like that in Figure 1 may be higher than in past applications like Choi et al. (2021). It is not sufficient to replace the pronoun she with Ann Jansson; we need to specify Ann Jansson, the Swedish footballer. Similarly, the city George Town could refer to a city in the Cayman Islands or Malaysia, therefore it must be decontextualized appropriately with a descriptor like George Town, a city in Cayman Islands.
Other work such as question answering frameworks based on clarifying questions can target this information Newman et al. (2023), but may fail to integrate the minimal new information needed, which we describe next.
Minimality
Adding too much information to a claim makes it less minimal. For instance, replacing “Ann Jansson” with “Ann Jansson, a Swedish footballer” requires verifying that a context referring to Ann Jansson is indeed talking about the Swedish footballer. Taken further, the reference “Ann Jansson, the Swedish footballer born on 6 May 1957 who played for Hammarby IF” is clearly suboptimal. It requires verifying Jansson’s birthdate as an additional detail, and crucially, this detail won’t be frequently reported in documents about Ann Jansson.
Define as the set of set of evidence documents that support the statement with an oracle understanding of the entities involved. For instance, this would contain a document describing the correct Ann Jansson, even if it did not confirm all the details about her life. Define to be the set of evidence documents that fully support a statement . For instance, in the case of Ann Jansson above, the document would need to specify Jansson’s birthdate if this is contained in .
Criterion 2 (Minimality)
Given a set of statements that all decontextualize a claim , we should select to maximize the size of the set of supporting evidence documents.
This criterion means that, when selecting distinguishing details for an entity, we should choose those that can typically be inferred from evidence. For instance, “Jason Martin” may be characterized either as a “rugby player” or specifically as a “former player for North Queensland Cowboys.” Since “rugby player” is a more enduring and widely recognized description, yet still specific enough to indicate Jason Martin, it is more likely to be supported by a larger number of documents.
Past work like Choi et al. (2021) instructs annotators to make minimal edits to statements. However, they do not provide guidance on what criteria should be used to choose from among multiple candidate edits.
Molecular facts
These two criteria suggest two things. First, atomic facts can be “too atomic:” they may need to be decontextualized. However, it is still valuable to have a reasonably minimal fact so it can be supported by many possible evidence documents.
Molecular Fact
A molecular fact is a statement corresponding to claim that obeys criteria 1 and 2: it should uniquely specify the interpretation of even when considered on its own, while adding as little information as possible to do so.
2.2 Task Definition: Fact-checking LLMs
Recall our setting where an LLM has generated a response to input prompt , and has associated claims . For each , we have a corresponding set of evidence documents, , that are referenced to assess the accuracy of . Furthermore, we have access to a gold standard of human-annotated labels for each atomic fact, represented as , where each can be either SUPPORTED or NOT_SUPPORTED. Our goal is to make judgments about the supportedness of the , which requires appropriately decontextualizing each fact.
We augment each atomic claim to a corresponding molecular claim as described in Section 3, resulting in a set of facts . We represent the model’s factuality judgment prediction as a set of supported documents . In other words, the prediction of is accurate when it supports the molecular claim with the same evidence docs as humans.
3 Method: Producing Molecular Facts
We use a two-step process to refine an atomic fact into a molecular fact using gpt-4-turbo-2024-04-09 Achiam et al. (2023). Our methodology makes the assumption that the ambiguity is typically restricted to a single entity in the claim. This is the case for the datasets we study in this work, described in Section 4.5.
Stage 1: Identifying Ambiguity
We identify the primary subject of the claim and to assess potential ambiguities based on its parametric knowledge: does the model know of multiple entities with this name? This step identifies the main subject of the claim and provides a disambiguation criteria for the subject . The disambiguation criteria can be ‘None‘ when there is no ambiguity, or a type of criteria such as profession, birthyear, or location when disambiguation is required.
For example, if the claim is about ‘Charles Osgood’, with multiple possible referents, is ‘Charles Osgood’, while could be ‘profession’ or ‘birthyear’ to clarify which Charles Osgood is being referred to. Conversely, if the claim concerns the unambiguous ‘Julius Robert Oppenheimer’, is ‘Julius Robert Oppenheimer’, and is ‘None’.
Stage 2: Molecular Facts Generation
We then prompt the LLM to disambiguate the subject within the claim , harnessing both the identified disambiguation criteria and the claim’s context .The output of this stage is a molecular fact for the atomic claim .
3.1 Baselines
We analyze the robustness of fact verification across various systems on the defined criteria of minimality and decontextuality. Outputs for baselines are generated with gpt-4-turbo-2024-04-09.
- ATOMIC:
-
Atomic claims are generated from the LLM’s response using Min et al. (2023).
- SIMPLE-DECONTEXT:
-
Atomic claims are decontextualized with a prompt described in 8 using the LLM’s generated response as context for the atomic claim.
- SAFE-DECONTEXT:
-
Decontextualization of atomic claims is performed using the revision prompt described in Wei et al. (2024).
- MOLECULAR-DECONTEXT:
-
This approach follows a two-stage process described in section 3 to identify disambiguation criteria and subsequently decontextualize the atomic claim.
4 Experiment: Minimality & Localization
We begin our analysis of decontextualization with a controlled experiment to illustrate problems with error localization due to loss of minimality discussed in Criterion 2 in Section 2.1. Minimality is more difficult to evaluate than decontextuality. Less minimal facts impact error localization and can potentially lead to errors where an ancillary part of the claim leads to the whole claim being judged as wrong Kamoi et al. (2023a). However, precisely measuring the harms of this is not easy without taking into account the downstream uses of error localization systems such as answer refinement Xu et al. (2023) or fine-tuning Wu et al. (2024); Roit et al. (2023).
To measure the effects in a controlled way, we design a method for synthetic evidence generation as summarized in Figure 2. Our goal is to illustrate when decontextualized atomic facts actually contain multiple facts in a way that could impact error localization. We then study how many of these cases truly show this problem. To study the impact of information addition, we consider two baselines SIMPLE-DECONTEXT and SAFE-DECONTEXT which respectively have less and more restrictive prompts for including new information from the context to revise an atomic claim.
4.1 Controlled Dataset Construction
We now detail the dataset construction process as illustrated in Figure 2. We take a dataset of 812 claims from the Factcheck-Bench dataset Wang et al. (2024) which consists of long form ChatGPT responses with human-annotated factuality labels.
Step 1: Extract Atomic Facts For each response , we extract atomic facts using the method of Min et al. (2023).
Step 2: Decontextualization: We perform decontextualization of the extracted atomic facts using SIMPLE-DECONTEXT and SAFE-DECONTEXT. Let the decontextualization for claim be denoted as . We refer to the that was created from as its core atomic fact; however, note that might support other facts as well.
Step 3: Identifying Claims with Multiple Atomic Facts: We identify decontextualized claims that entail information of more than one atomic fact. We use the entailment model from Liu et al. (2022) to determine ; is each supported by ? We retain cases where and where ; that is, at least two atomic facts are supported by . For example, in Figure 2, the claim (), ‘The “Blackpink in Your Area” compilation album was released in 2018‘, is a decontextualized claim derived from the core atomic claim (), ‘The album was released in 2018.’. The decontextualized claim () entails the core atomic fact () and an additional atomic fact () ‘ “Blackpink in Your Area” is a compilation album’. Let denote this filtered set.
Step 4: Generating Evidence for Partial Support: Whenever multiple atomic facts are merged, we could theoretically see a loss in localization capability from a model: if one fact is not supported, the entire claim will be determined to be not supported. To demonstrate this possibility, we now generate evidence that partially supports our multi-fact claims. As an example, in Figure 2, our goal in step 4 is to generate a paragraph that should not include details about “Blackpink in Your Area” being a compilation album. Then, if the statement ‘The album was released in 2018’ is decontextualized to include information about it being a compilation album, this paragraph will enable us to identify this: the evidence will no longer support the decontextualized fact, reflecting a failure of error localization.
By construction of , is supported by at least two facts, its core atomic fact and auxiliary atomic fact(s). From this set of auxiliary atomic fact(s), we sample a banned fact . For each , we sample a set of key facts such that contains the all atomic facts of the response except . We then prompt the LLM to generate an evidence article supporting the facts and not supporting the fact . Each of these evidence articles ideally should support all the key facts and not support the banned fact.
4.2 Evaluation Criteria
We evaluate the impacts of loss of minimality on the recall of fact-checking. We measure the percentage of cases that change their label from SUPPORTED to NOT_SUPPORTED after decontextualization on the set . We employ the roberta-large from AlignScore Zha et al. (2023) as our function.222We conducted preliminary analysis with GPT-4 as well, and found it gave very similar results. Using , we identify cases where the core key fact is SUPPORTED by the generated evidence while the decontextualization and banned fact are NOT_SUPPORTED. We call this set auto non-minimal.
Baseline | Potential | Auto |
---|---|---|
Non-minimal | Non-minimal | |
SAFE-DECONTEXT | 8.49% | 3.94% |
SIMPLE-DECONTEXT | 23.39% | 13.42% |
4.3 Results
Table 1 shows the fraction of claims which are included in the set , which yields 8.49% for SAFE-DECONTEXT and 23.39% for SIMPLE-DECONTEXT. We refer to these claims as potential non-minimal claims: they have passed the checks in our pipeline and contain multiple atomic facts. Next we apply the function to identify auto non-minimal claims, and find that they occur at a rate of 3.94% to 13.42% (Table 1).
Category | Minimal | Non-minimal |
---|---|---|
SAFE-DECONTEXT | 56.2% | 43.8% |
SIMPLE-DECONTEXT | 27.5% | 72.5% |
ACCURACY | ACCURACY | ACCURACY | MODIFICATION | AVG LENGTH | |
---|---|---|---|---|---|
Subset | OVERALL | SUPPORTED | NOT_SUPPORTED | RATE | (# of words) |
ATOMIC | 68.7% | 77.5% | 22.4% | - | 7.613.03 |
SIMPLE-DECONTEXT | 76.2% | 84.3% | 33.6% | 99.5% | 15.555.65 |
SAFE-DECONTEXT | 73.4% | 81.3% | 31.9% | 72.6% | 9.864.38 |
MOLECULAR-DECONTEXT | 74.7% | 81.5% | 38.8% | 96.8% | 14.965.6 |
Human Label | SUPPORTED | NOT_SUPPORTED | |||
Baseline Pred | SUPPORTED | SUPPORTED | NOT_SUPPORTED | SUPPORTED | |
Matching Type | Multi-Evidence | Single-Evidence | No Evidence | Single/Multiple | Overall |
Baseline | matched | Wrong Entity | matched | Evidence matched | |
ATOMIC | 16.2% | 0.8% | 1.8% | 12.4% | 31.1% |
SIMPLE-DECONTEXT | 7.9% | 1.5% | 3.9% | 10.6% | 23.8% |
SAFE-DECONTEXT | 12.0% | 1.0% | 2.8% | 10.9% | 26.6% |
MOLECULAR-DECONTEXT | 9.2% | 1.5% | 4.8% | 9.8% | 25.3% |
4.4 Human Evaluation
Susceptibility to Error Localization
We perform human evaluation on the auto non-minimal claims in Table 1. First, we categorize these into human judgments of whether a claim in this subset is minimal or not in Table 2. We categorize a decontextualization as minimal based on the criteria outlined in 2.1. This annotation is performed by the authors of the paper. We find that for SAFE-DECONTEXT, 43.8% of these cases are truly non-minimal in our judgment which represent 1.7% of the dataset . For the SIMPLE-DECONTEXT baseline, we find that a staggering 72.5% of the auto non-minimal subset represents truly non-minimal claims. This represents 9.6% of the dataset . We note that the remaining fraction of decontextualization cases not identified by the auto methods are those which entail more than one atomic fact but it is a necessary addition to make the atomic claim standalone.
Decontextualization and Loss of Minimality We highlight that addition of information to a claim does not always make it less entailed to the evidence. In fact, in many cases information addition makes the sentence more specific. This is evident from Table 2 which shows that automatically flagged cases for non-minimality have a large percentage of minimal claims after human evaluation. For instance, “All taxes must be paid by April 15” “In the US, all taxes must be paid by April 15” is a necessary addition for claim specificity.
4.5 Conclusion: Problem of Non-minimality
We find through our controlled experiment and human evaluation that decontextualization can lead to non-minimal cases for between 1.7% to 9.6% of decontextualizations. These cases could cause error localization issues due to too much information added to the claims. In absolute terms, this is a low fraction for the baseline SAFE-DECONTEXT. However, we note that a biography from FActScore Min et al. (2023) contains dozens of atomic facts, meaning that in a single response from an LLM, there can easily be a handful of facts posing localization problems. Given the increasing adoption of the decomposition and decontextualization pipeline for automatic fact verification systems, we argue that multiple localization errors per response is cause to re-examine that pipeline. Next, we analyze tradeoffs between minimality and decontextuality for fact checking of ambiguous biographies.
5 Experiment: Ambiguous Biographies
We now analyze to what extent our molecular facts add the correct information to decontextualize on an existing dataset with ambiguous entity references.
Dataset
We use the ambiguous biographies dataset introduced in Chiang and yi Lee (2024) which comprises biographies generated by LLMs for multiple entities that share similar names, such as Dick Hanley (swimmer) and Dick Hanley (footballer). In this dataset we represent the biographies generated by the LLMs as and correspond to atomic claims generated using the methodology outlined in Min et al. (2023). For this setting, we define each claim to have a subject , which is ambiguous due to the nature of the dataset. The dataset provides a set of evidence documents sourced from Wikipedia page of the subject disambiguation, for subjects sharing similar names as . This dataset is suitable for evaluating decontextuality as it consists of two properties: (i) atomic claims that require decontextualization (such as entity specification, noun completion), (ii) multiple entities with the same name that require additional disambiguation such as specifying location, occupation, or time-period.
Our goal is to verify the claims with the set of documents using . We randomly sample 726 claims from the human-annotated set for this study which belong to either SUPPORTED or NOT_SUPPORTED categories. For each claim we construct a revision using the methods and baselines described in section 3 and compare the prediction with human labels.
Evaluation Criteria
We evaluate our judgment of a claim on two axes: (1) whether it aligns with the human annotation of SUPPORTED or NOT_SUPPORTED, and (2) whether it is supported by the correct evidence. For each evidence associated with the claim, we compute where is the claim processed by the particular baseline and represents the th ambiguous subject related document for the claim. We consider the judgment to be correct only if the prediction of the claim matches the human label and the prediction is supported by the correct entity’s evidence document.
Baseline | Minimal | Non-Minimal | Ambig. |
---|---|---|---|
SIMPLE | 16.0% | 56.0% | 28.0% |
SAFE | 24.0% | 0.0% | 76.0% |
MOLECULAR | 52.0% | 24.0% | 24.0% |
6 Results: Ambiguous Biographies
Table 3 presents the results of this experiment. All methods of decontextualization baselines yield higher accuracy rates compared to atomic claims, across all subsets. We see that Molecular and Simple decontextualization methods have a higher proclivity to modify the atomic claims than the SAFE decontextualization baseline. Consequently, the average sentence lengths of the former methods is also larger than the SAFE baseline. Higher degrees of modification generally lead to higher accuracy. All three methods are on a Pareto frontier of length versus accuracy.
However, accuracy using the function does not incorporate minimality. We investigate the minimality of the baselines by performing a human evaluation of randomly sampled 25 claims in Table 5. We see that the baseline SIMPLE-DECONTEXT has a large fraction of non-minimal and ambiguous claims as compared to MOLECULAR-DECONTEXT. Analysis in Section 4.4 shows that SAFE-DECONTEXT is more minimal than SIMPLE-DECONTEXT; however, it struggles with ambiguity.
Overall, we observe that molecular claims strike a balance by maintaining minimality with ambiguity removal and improving accuracy. They are significantly more minimal than SIMPLE-DECONTEXT and more performant in ambiguous generations than SAFE-DECONTEXT.
Error breakdown
To analyze the nature of errors encountered, we detail a case-wise error distribution in Table 4. Specifically, we study the behavior of various baselines to mispredict the label as SUPPORTED or NOT_SUPPORTED in comparison to human annotation. Note that due to the ambiguous nature of this dataset, claims may be erroneously validated by several distracting pieces of evidence. Therefore, we further partition the error analysis table to reflect the model’s prediction on (i) Single/Multi/No Evidence: whether a claim is supported by single, multiple, or no pieces of evidence, and (ii) (Correct/Wrong Entity): whether the set of supporting evidence contains the accurate evidence with which the claim ought to be aligned. Overall, all decontextualization methods show a lower error rate than atomic claims.
Baseline Pair | Overlap |
---|---|
ATOM & SIMPLE-DECONTEXT | 7% |
ATOM & SAFE-DECONTEXT | 44% |
ATOM & MOLECULAR-DECONTEXT | 15% |
SIMPLE-DECONTEXT & SAFE-DECONTEXT | 27% |
SIMPLE-DECONTEXT & MOLECULAR-DECONTEXT | 36% |
MOLECULAR-DECONTEXT & SAFE-DECONTEXT | 32% |
Information Overlap
We perform an information overlap analysis shown in Table 6 using the model from Liu et al. (2022) to check bidirectional entailment of the fraction of cases where the information is equivalent between two baselines Gunjal and Durrett (2023). We find in a large fraction of cases each baseline adds different information to modify the atomic claim. SAFE-DECONTEXT has least amount of modification albeit suffers with ambiguity and SIMPLE-DECONTEXT has most amount of modification at the cost of minimality loss.
7 Related Work
Recent research in factuality verification of LLM generations advocates decomposing LLM generations into atomic facts or subclaims and verifying each against retrieved evidence Min et al. (2023); Kamoi et al. (2023b); Fabbri et al. (2022). End-to-end pipelines for factuality verification have been proposed, involving steps such as claim extraction, revision, determining checkworthiness, evidence retrieval, and verification Wang et al. (2024); Chern et al. (2023); Wei et al. (2024); Chen et al. (2024). These papers often evaluate on recently-released datasets of errors in generations Liu et al. (2023a); Malaviya et al. (2024); Chen et al. (2023a). Our work comments on the decontextualization step frequently used in these pipelines.
Our work fits into a broader ecosystem of techniques in this area. Gao et al. (2023b) enable LLMs to generate text with citations. For faithful LLM generations, Gao et al. (2023a) use evidence retrieval for revision, and He et al. (2022) utilize chain-of-thought coupled with retrieval for faithful explanations. Fine-tuned systems, such as that by Zha et al. (2023), predict alignment scores for verification, while Tang et al. (2024) propose LLM-AggreFact for sentence-level factuality labels. Wanner et al. (2024) find that evaluation metrics for fact verification are sensitive to the claim decomposition method used.
Prior work on decontextualization has investigated basic notions like anaphora resolution Choi et al. (2021), question answering frameworks Newman et al. (2023), and extract-then-decontextualize methods for summarization Potluri et al. (2023). In fact verification, atomic claims are made standalone before evidence retrieval via decontextualization Wang et al. (2024) or claim revision Wei et al. (2024). Decontextualization is also used to resolve ambiguity Zhang and Choi (2021); Lee et al. (2024); our work shares this focus.
8 Conclusion
We introduce molecular facts and the desiderata of decontextualization in LLM fact verification. We define the criteria of decontextuality and minimality in this context. Through a controlled experiment, we show that localization errors due to loss of minimality by decontextualization is sensitive to the method used. We propose a method of “molecular facts” and find that they improve fact verification precision for claims from generation about ambiguous entities. We show that molecular facts strike a balance between maintaining minimality and accuracy of fact-verification.
Limitations
Scope
We illustrate the phenomenon of ambiguity in atomic claims; however, our main evaluation of molecular facts is in the domain of English-language biographies. This is due to the availability of the dataset, Wikipedia evidence, and the prevalence of biography benchmarks in recent work. Conceptually, the ambiguity in the subject or predicate of the claim can be extended to other realistic datasets, but we leave that exploration to future work. Relatedly, we focus on entity ambiguity for illustration of our method. There may be other types of ambiguities that molecular fact generation can address in other contexts and other datasets.
Furthermore, we focus our experiments on high-performing LLMs in this work. The extension of decontextualization and molecular fact generation to smaller, open-source models and the improvement in this regime is a good subject for further study.
Finally, we believe our approach should be evaluated fully end-to-end in an LLM pipeline that generates responses and then verifies their factuality. However, despite substantial research in these directions, we are not aware of an off-the-shelf experimental pipeline that is usable for this setting.
Decomposition Quality
We do not consider the errors introduced due to poor decomposition of atomic facts in this work. It is possible that some of these errors are resolved due to decontextualization or disambiguation implicitly, but we do not make any specific claims about this.
Coverage of Domains and Languages
The datasets utilized for ambiguous biographies are limited to English-language claims focused on English-centric concepts within Wikipedia. Similarly, the synthetic data generation experiment for minimality analysis is confined to English language outputs and relies on GPT-4’s parametric knowledge, which may limit the breadth of topics and domains covered.
Acknowledgments
We thank Jessy Li for comments and feedback on the initial draft of this work. This work was supported by NSF CAREER Award IIS-2145280 and the NSF AI Institute for Foundations of Machine Learning (IFML). This material is also based on research that is in part supported by the Air Force Research Laboratory (AFRL), DARPA, for the KAIROS program under agreement number FA8750-19-2-1003.
References
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Chen et al. (2024) Jifan Chen, Grace Kim, Aniruddh Sriram, Greg Durrett, and Eunsol Choi. 2024. Complex claim verification with evidence retrieved in the wild. In Proceedings of the North American Chapter of the Association for Computational Linguistics.
- Chen et al. (2023a) Shiqi Chen, Yiran Zhao, **ghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, and Junxian He. 2023a. FELM: Benchmarking Factuality Evaluation of Large Language Models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Chen et al. (2023b) Sihao Chen, Senaka Buthpitiya, Alex Fabrikant, Dan Roth, and Tal Schuster. 2023b. PropSegmEnt: A Large-Scale Corpus for Proposition-level Segmentation and Entailment Recognition. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8874–8893, Toronto, Canada. Association for Computational Linguistics.
- Chern et al. (2023) I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu, et al. 2023. FacTool: Factuality Detection in Generative AI–A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. arXiv preprint arXiv:2307.13528.
- Chiang and yi Lee (2024) Cheng-Han Chiang and Hung yi Lee. 2024. Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations. arXiv 2402.05629.
- Choi et al. (2021) Eunsol Choi, Jennimaria Palomaki, Matthew Lamm, Tom Kwiatkowski, Dipanjan Das, and Michael Collins. 2021. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics, 9:447–461.
- Fabbri et al. (2022) Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2022. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, Seattle, United States. Association for Computational Linguistics.
- Falke et al. (2019) Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics.
- Gao et al. (2023a) Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. 2023a. RARR: Researching and Revising What Language Models Say, Using Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
- Gao et al. (2023b) Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023b. Enabling Large Language Models to Generate Text with Citations. In Empirical Methods in Natural Language Processing (EMNLP).
- Goyal and Durrett (2021) Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Proceedings of NAACL.
- Gunjal and Durrett (2023) Anisha Gunjal and Greg Durrett. 2023. Drafting Event Schemas using Language Models. arXiv preprint arXiv:2305.14847.
- He et al. (2022) Hangfeng He, Hongming Zhang, and Dan Roth. 2022. Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303.
- Ji et al. (2022) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye** Bang, Andrea Madotto, and Pascale Fung. 2022. Survey of hallucination in natural language generation. ACM Computing Surveys.
- Kamoi et al. (2023a) Ryo Kamoi, Tanya Goyal, and Greg Durrett. 2023a. Shortcomings of question answering based factuality frameworks for error localization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 132–146.
- Kamoi et al. (2023b) Ryo Kamoi, Tanya Goyal, Juan Rodriguez, and Greg Durrett. 2023b. WiCE: Real-World Entailment for Claims in Wikipedia. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7561–7583, Singapore. Association for Computational Linguistics.
- Lee et al. (2024) Yoonsang Lee, Xi Ye, and Eunsol Choi. 2024. Ambigdocs: Reasoning across documents on different entities under the same name. arXiv 2404.12447.
- Li et al. (2016) Junyi Jessy Li, Bridget O’Daniel, Yi Wu, Wenli Zhao, and Ani Nenkova. 2016. Improving the annotation of sentence specificity. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3921–3927.
- Liu et al. (2022) Alisa Liu, Swabha Swayamdipta, Noah A. Smith, and Ye** Choi. 2022. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Liu et al. (2023a) Nelson Liu, Tianyi Zhang, and Percy Liang. 2023a. Evaluating Verifiability in Generative Search Engines. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001–7025, Singapore. Association for Computational Linguistics.
- Liu et al. (2023b) Yixin Liu, Alex Fabbri, Pengfei Liu, Yilun Zhao, Linyong Nan, Ruilin Han, Simeng Han, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, and Dragomir Radev. 2023b. Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4140–4170, Toronto, Canada. Association for Computational Linguistics.
- Louis and Nenkova (2012) Annie Louis and Ani Nenkova. 2012. A corpus of general and specific sentences from news. In LREC, volume 1818, page 10. Citeseer.
- Malaviya et al. (2024) Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. 2024. ExpertQA: Expert-curated questions and attributed answers. In Proceedings of the North American Chapter of the Association for Computational Linguistics.
- Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
- Nenkova and Passonneau (2004) Ani Nenkova and Rebecca J Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the human language technology conference of the north american chapter of the association for computational linguistics: Hlt-naacl 2004, pages 145–152.
- Newman et al. (2023) Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, and Kyle Lo. 2023. A question answering framework for decontextualizing user-facing snippets from scientific documents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3194–3212.
- Pezzelle (2023) Sandro Pezzelle. 2023. Dealing with semantic underspecification in multimodal nlp. arXiv preprint arXiv:2306.05240.
- Potluri et al. (2023) Abhilash Potluri, Fangyuan Xu, and Eunsol Choi. 2023. Concise answers to complex questions: Summarization of long-form answers. arXiv preprint arXiv:2305.19271.
- Rashkin et al. (2021) Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and D. Reitter. 2021. Measuring Attribution in Natural Language Generation Models. Computational Linguistics, 49:777–840.
- Roit et al. (2023) Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Leonard Hussenot, Orgad Keller, Nikola Momchev, Sabela Ramos Garea, Piotr Stanczyk, Nino Vieillard, Olivier Bachem, Gal Elidan, Avinatan Hassidim, Olivier Pietquin, and Idan Szpektor. 2023. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6252–6272, Toronto, Canada. Association for Computational Linguistics.
- Schilder (1998) Frank Schilder. 1998. An underspecified segmented discourse representation theory (usdrt). In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2, pages 1188–1192.
- Tang et al. (2024) Liyan Tang, Philippe Laban, and Greg Durrett. 2024. MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents. arXiv preprint arXiv:2404.10774.
- Wang et al. (2024) Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, and Preslav Nakov. 2024. Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers.
- Wanner et al. (2024) Miriam Wanner, Seth Ebner, Zheng** Jiang, Mark Dredze, and Benjamin Van Durme. 2024. A Closer Look at Claim Decomposition. arXiv preprint arXiv:2403.11903.
- Wei et al. (2024) Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. 2024. Long-form factuality in large language models. arXiv preprint arXiv:2403.18802.
- Wu et al. (2024) Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2024. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36.
- Xu et al. (2023) Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei Li, and Markus Freitag. 2023. Pinpoint, not criticize: Refining large language models via fine-grained actionable feedback. arXiv preprint arXiv:2311.09336.
- Zha et al. (2023) Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
- Zhang and Choi (2021) Michael J.Q. Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Zhang et al. (2024) Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A. Smith. 2024. How Language Model Hallucinations Can Snowball. In Forty-first International Conference on Machine Learning.
- Zhang and Bansal (2021) Shiyue Zhang and Mohit Bansal. 2021. Finding a Balanced Degree of Automation for Summary Evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6617–6632, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Zhang et al. (2023) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023. Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
Appendix A Prompts
We give details on all the prompts used throughout this work.
Decontextuality Experiment Prompts
Minimality Experiment Prompts
The prompt for generating controlled evidence for the minimality experiment is given in Figure 10.
Appendix B Additional Related Work
Decomposition in Text Summarization
Decomposition of responses is also prevelant in the text summarization literature. Nenkova and Passonneau (2004) introduced the Pyramid protocol for summarization evaluation which extracts weighted Summarization Content Units (SCUs) which represent the importance of various facts present in multiple human-generated summaries of a text. Zhang and Bansal (2021) propose using Semantic Triplet Units (STUs), which are summary content units generated automatically using SRL parsers, to evaluate generated summaries with textual entailment models. Similarly, Liu et al. (2023b) propose Atomic Content Units (ACUs) as a new summarization salience protocol that allows for higher inter-annotator agreement. Chen et al. (2023b) propose using entailment judgments on a set of sentence propositions within a document.
Decontextualization and Specificity
Decontextualization is a process of making sentences stand-alone by resolving missing context while preserving its meaning Choi et al. (2021). A related phenomenon is the notion of specificity. Louis and Nenkova (2012) presented the first corpus of sentences distinguished on the criteria of being general or specific. Their idea of classification was based on examples and intuition by defining general sentences to be broad statements about a topic that would need additional evidence or examples for a reader to understand, whereas, specific sentences can stand by themselves. Li et al. (2016) make this definition more specific by grounding specificity for a sentence to three requirements: (i) it is easy to understand the meaning and identify of the intended references without ambiguity; (ii) the truth of the statement can be assessed based on the sentence itself and general shared knowledge; and (iii) the sentence fully expresses key information about the participants and causes of an event. Another related notion is underspecification in discourse, which is an intentional feature to maintain communication efficiency Schilder (1998). This has been annotated by Li et al. (2016) and highlighted in a multimodal setting by Pezzelle (2023).
Appendix C Human Annotation Criteria for Categorizing the Non-minimal Subset
We describe the criteria for annotating the auto non-minimal subset into minimal vs. non-minimal as shown in Table 2. For each instance, we compare the original claim, the decontextualization, and the banned fact. We label cases as minimal when either of the following applies: (1) the banned fact is closely related the atomic fact and it is a necessary addition to the atomic claim to make it standalone. In other words, the banned fact is a necessary addition to the atomic claim to add context and/or resolve ambiguity. For example, “The album is their first full-length studio album.” is decontextualized to “The album released in 2020 is Blackpink’s first full-length studio album.” and the banned fact is “The album was released in 2020.”. The information in the banned fact is necessary addition to disambiguate “the album” in this case. (2) The banned fact entailed by the decontextualization, but it is due to an entailment error. For example, the decontextualization “Mey Eden, one of the largest bottled water companies in Israel, offers flavored water products." is erroneously entailed by the banned fact “Mey Eden offers still water products.".
Appendix D Human Analysis Criteria for Categorizing Minimality and Ambiguity
We describe the criteria for the human analysis for on the decontextualization of each baseline on the axis of minimality and ambiguity shown in Table 5. We categorize a claim decontextualization as non-minimal when it contains additional information that goes beyond making the sentence stand-alone and can potentially cause loss of error-localization. We categorize a claim decontextualization as ambiguous when it lacks clarifications for entities that could refer to different ambiguous subjects or add enough context to disambiguate the main entity. If both of the above conditions are not violated, we categorize the decontextualization as minimal.
Appendix E Models, Datasets and Computation Cost
The gpt-4-turbo-2024-04-09 model was employed for running baselines and generating outputs, while the gpt-3.5-turbo model was used for evaluation through FActScore Achiam et al. (2023). For generation experiments, we set the temperature to 0.75. The total cost for generating decontextualizations and evaluating the ambiguous biography experiment was approximately $120.
In the minimality experiment, gpt-3.5-turbo was used to extract atomic facts, and gpt-4-turbo-2024-04-09 was used for decontextualization and generation tasks. This resulted in a total cost of around $100. We use a NVIDIA A40 GPU for evaluation using AlignScore Zha et al. (2023) and entailment computation using WANLI Liu et al. (2022),
We use ChatGPT for improving writing formatting and generating boilerplate code for figure generation in this paper.
Appendix F Controlled Experiment on Minimality Generation Details
Filtering Criteria applied in Step 3
Before filtering claims which are supported by more than two atomic facts, we do not consider cases where one atomic fact is a substring of another one.
Filtering Criteria applied in Step 4
We detail the filtering criteria applied in evidence generation for partial support detailed in 4.1. After we sample a set of key facts such that contains the all atomic facts of the response except , we also apply a filtering criteria to remove cases where the banned fact and any of the key facts is similar; i.e., for in , we filter cases where . At the end of step 4 after we prompt the LLM to generate an evidence article, we also account for generation errors and remove the cases where banned fact is supported by the generated evidence.
Appendix G Remaining Challenges
To shed light on the remaining challenges, we focus on one of the most challenging scenarios for decontextualization. In the ambiguous biography dataset from Chiang and yi Lee (2024), we often observe what we call an entity switch point: a claim that draws on information about entity B, when sentences all refer to entity A. This is where decontextualization is crucial to recognize that in context does not refer to the correct entity.
Molecular claims recover fastest at the entity-switching point
We investigate the performance of baselines under the lens of ambiguity resolution. Note that these results are reported on baselines tested with gpt3.5-turbo. We find that the dataset of ambiguous biographies becomes the most confusing at the entity switch point. Figure 4 shows a significant performance drop at the switch across all methods. Basic decontextualization methods (DECONTEXT, SAFE-DECONTEXT) perform the worst, underperforming the ATOMIC baseline at the switch, but molecular claims, which incorporate richer disambiguation information, show relative robustness, improving by 3.5% over the most effective decontextualization approach (SAFE-DECONTEXT).
Gap from human performance
To estimate the upper bound of ideal performance at the entity switch point in Figure 4, we generate molecular claims at the entity-switch point with weak supervision human-in-the-loop supervision. We use the prompt shown Figure 9 in which has access to gold disambiguations from Wikipedia about the entities in the passage. This method’s performance even with weak human supervision is significantly better than automated decontextualization methods, bringing attention to this limitation of current fact-checking pipelines.