Molecular Facts: Desiderata for Decontextualization
in LLM Fact Verification

Anisha Gunjal    Greg Durrett
The University of Texas at Austin
[email protected]
Abstract

Automatic factuality verification of large language model (LLM) generations is becoming more and more widely used to combat hallucinations. A major point of tension in the literature is the granularity of this fact-checking: larger chunks of text are hard to fact-check, but more atomic facts like propositions may lack context to interpret correctly. In this work, we assess the role of context in these atomic facts. We argue that fully atomic facts are not the right representation, and define two criteria for molecular facts: decontextuality, or how well they can stand alone, and minimality, or how little extra information is added to achieve decontexuality. We quantify the impact of decontextualization on minimality, then present a baseline methodology 111https://github.com/anisha2102/molecular_facts for generating molecular facts automatically, aiming to add the right amount of information. We compare against various methods of decontextualization and find that molecular facts balance minimality with fact verification accuracy in ambiguous settings.

Molecular Facts: Desiderata for Decontextualization
in LLM Fact Verification


Anisha Gunjal    Greg Durrett The University of Texas at Austin [email protected]


1 Introduction

Large language models (LLMs) have emerged as powerful tools for delivering knowledge to users, either via closed-book generation or retrieval-augmented systems. However, these systems may not always produce correct facts Liu et al. (2023a), an instance of the “hallucination” problem Zhang et al. (2024); Ji et al. (2022); Zhang et al. (2023). Recent research has shown the potential of LLMs to identify unfaithful content and enable automatic fact-checking and attribution against sources Falke et al. (2019); Goyal and Durrett (2021); Min et al. (2023); Wang et al. (2024); Chern et al. (2023); Wei et al. (2024); Chen et al. (2023a); Malaviya et al. (2024); Gao et al. (2023b); Tang et al. (2024).

Refer to caption
Figure 1: Breaking a paragraph into atomic facts can cause errors in attribution: facts out of context appear to be true when they are not. The right granularity of decontextualization, “molecular facts,” balances contextual grounding with atomicity.

A key step in this process is to break down generated content into individual atomic claims Fabbri et al. (2022); Chen et al. (2023b); Kamoi et al. (2023b); Min et al. (2023). This decomposition allows for retrieval of evidence focused on a particular part of the generated content Gao et al. (2023a); Wang et al. (2024); Chen et al. (2024) and also error localization by determining which parts of the content are supported or not. However, this step is not straightforward. Wanner et al. (2024) highlights that the effectiveness of automatic factuality verification is heavily dependent on the strategies employed for decomposing content into claims. In particular, LLMs have a propensity to incorrectly merge information about similarly named entities Lee et al. (2024) and current evaluation methods struggle to handle these ambiguities in atomic claims Chiang and yi Lee (2024). Figure 1 shows a possible issue: a fact that is “too atomic” can be validated against evidence that doesn’t actually support it.

In this work, we address the problem of how to find minimal yet still unambiguous facts for LLM fact verification. We frame this problem as one of decontextualization, adding context to a sentence to make it stand alone while retaining its original meaning Choi et al. (2021). This process draws on the idea of specificity from discourse Louis and Nenkova (2012), specifically whether sentences can express key information about the participants without ambiguity Li et al. (2016). However, making a claim unambigous is not enough: when escalating from simple pronoun replacement in atomic facts to elaborations like a Swedish footballer in Figure 1, we must balance the specificity of the fact with how easy it will be to verify. It is not trivial to select the “right” information to elaborate on a claim without compromising the ease of verification.

We define two criteria needed in this fact-checking setting: decontextuality, where the claim should uniquely specifying entities, events, and context, and minimality, maintained by avoiding excessive additional information that could complicate verification. We propose a notion of molecular facts, which balances these two criteria: molecular facts should be fully specific while compatible with the maximum number of possible evidence documents. We explore these criteria and our molecular facts in two settings. First, we address the question of how much non-minimality could be a problem for error localization with standard decontextualization techniques. We devise a synthetic fact-checking experiment where particular nuances of an output generation are unsupported and show that an average of 6% of claims may pose problems for error localization. In a setting with LLM responses of 5 sentences with 3 claims each, this would lead to localization errors in a large fraction of responses. We then evaluate the opposite problem, whether decontextualization is too minimal. We study a dataset of fact-checking with ambiguous entity names presented in Chiang and yi Lee (2024). We show that our method of molecular fact generation balances accuracy under ambiguous entity references with minimality of claims.

Our main contributions are: (1) We re-examine the decontextualization process for fact-checking and define molecular claims following the desiderata of decontextuality and minimality. (2) We investigate the loss of minimality due to claim decontextualization and its impacts on error localization. (3) We find that molecular claims are more performant and minimal for long-form generations than existing decontextualization methods.

2 Desiderata for Decontextualization

We propose desiderata to determine the optimal level of decontextualization required for atomic facts. An atomic fact is defined as a discrete unit of information, derived from a broader claim, and variously described in the literature as propositions, subclaims, summary content units, or atomic content units Nenkova and Passonneau (2004); Liu et al. (2023b); Zhang and Bansal (2021); Chen et al. (2023b); Min et al. (2023); Kamoi et al. (2023b).

Although an atomic fact theoretically represents a singular conceptual unit, recent NLP work using this does not typically give this a rigorous definition from the standpoint of semantics. Wanner et al. (2024) demonstrate a high variation in the number of subclaims generated by different decomposition methods, with the macro-average of subclaims per biography ranging from 20.2 using the method by Kamoi et al. (2023b) to 32.9 with the approach by Chen et al. (2023b). Note that in Figure 1, She was a medallist at the European Athletics Championships in 1986 could be kept as one unit or broken into three facts evaluating her status as a medallist, the venue, and the date.

2.1 Desiderata

Preliminaries

We define 𝐫𝐫\mathbf{r}bold_r as a response from a language model to an input prompt 𝐱𝐱\mathbf{x}bold_x, consisting of a series of claims (𝐜1,,𝐜n)subscript𝐜1subscript𝐜𝑛(\mathbf{c}_{1},\ldots,\mathbf{c}_{n})( bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) to be verified. Claims are extracted through an upstream process of decomposition and potentially filtering for “check-worthiness” (i.e., does the claim present factual content or does it present an opinion?). We describe the prompting in Appendix A.

We assume that in the context of 𝐫𝐫\mathbf{r}bold_r and 𝐱𝐱\mathbf{x}bold_x, a claim 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be fully interpreted with a truth-conditional meaning I(𝐜i𝐱,𝐫)𝐼conditionalsubscript𝐜𝑖𝐱𝐫I(\mathbf{c}_{i}\mid\mathbf{x},\mathbf{r})italic_I ( bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_x , bold_r ). In the terminology of Rashkin et al. (2021) and Choi et al. (2021), I(𝐜i𝐱,𝐫)𝐼conditionalsubscript𝐜𝑖𝐱𝐫I(\mathbf{c}_{i}\mid\mathbf{x},\mathbf{r})italic_I ( bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_x , bold_r ) represents 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT interpreted in the linguistic context of 𝐱𝐱\mathbf{x}bold_x and 𝐫𝐫\mathbf{r}bold_r.

We can construct a standalone proposition with truth conditional meaning equivalent to I𝐼Iitalic_I by being sufficiently specific. For example, the statement in Figure 1 could be completely specified as Ann Jansson, the Swedish footballer born on 6 May 1957 who played for Hammarby IF, won a medal at the European Athletics Championship, the biennial event organized by the European Athletics Association, in 1986.

Decontextualization

Our goal in this work is to produce rewritten molecular claims. Denote by 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the rewritten form of 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which should have semantics I𝐼Iitalic_I when interpreted as a standalone proposition. As in Figure 1, this requires adding disambiguating information that could provide information needed to identify an entity (specifying that Jansson is a Swedish footballer), identify an event (specifying that the event happened in 1986), specify a qualification (in the field of biochemistry, …), or more.

Criterion 1 (Decontextuality)

When interpreted as a standalone statement, 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT must have the truth conditional meaning I(𝐜i,𝐱,𝐫)𝐼subscript𝐜𝑖𝐱𝐫I(\mathbf{c}_{i},\mathbf{x},\mathbf{r})italic_I ( bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x , bold_r ). That is, it should uniquely specify entities, events, and other context such that the claim 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is now interpretable.

This criterion is equivalent to Definition 1 from Choi et al. (2021). For the settings we consider, the level of added information needed to specify the meaning of a statement like that in Figure 1 may be higher than in past applications like Choi et al. (2021). It is not sufficient to replace the pronoun she with Ann Jansson; we need to specify Ann Jansson, the Swedish footballer. Similarly, the city George Town could refer to a city in the Cayman Islands or Malaysia, therefore it must be decontextualized appropriately with a descriptor like George Town, a city in Cayman Islands.

Other work such as question answering frameworks based on clarifying questions can target this information Newman et al. (2023), but may fail to integrate the minimal new information needed, which we describe next.

Minimality

Adding too much information to a claim makes it less minimal. For instance, replacing “Ann Jansson” with “Ann Jansson, a Swedish footballer” requires verifying that a context referring to Ann Jansson is indeed talking about the Swedish footballer. Taken further, the reference “Ann Jansson, the Swedish footballer born on 6 May 1957 who played for Hammarby IF” is clearly suboptimal. It requires verifying Jansson’s birthdate as an additional detail, and crucially, this detail won’t be frequently reported in documents about Ann Jansson.

Define (I(𝐜,𝐱,𝐫))superscript𝐼𝐜𝐱𝐫\mathcal{E}^{*}(I(\mathbf{c},\mathbf{x},\mathbf{r}))caligraphic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_I ( bold_c , bold_x , bold_r ) ) as the set of set of evidence documents that support the statement I𝐼Iitalic_I with an oracle understanding of the entities involved. For instance, this would contain a document describing the correct Ann Jansson, even if it did not confirm all the details about her life. Define (𝐦i)subscript𝐦𝑖superscript\mathcal{E}(\mathbf{m}_{i})\subset\mathcal{E}^{*}caligraphic_E ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊂ caligraphic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to be the set of evidence documents that fully support a statement 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For instance, in the case of Ann Jansson above, the document would need to specify Jansson’s birthdate if this is contained in 𝐦𝐦\mathbf{m}bold_m.

Criterion 2 (Minimality)

Given a set of statements \mathcal{M}caligraphic_M that all decontextualize a claim 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we should select argmax𝐦|(𝐦)|subscriptargmax𝐦𝐦\mathrm{argmax}_{\mathbf{m}\in\mathcal{M}}|\mathcal{E}(\mathbf{m})|roman_argmax start_POSTSUBSCRIPT bold_m ∈ caligraphic_M end_POSTSUBSCRIPT | caligraphic_E ( bold_m ) | to maximize the size of the set of supporting evidence documents.

This criterion means that, when selecting distinguishing details for an entity, we should choose those that can typically be inferred from evidence. For instance, “Jason Martin” may be characterized either as a “rugby player” or specifically as a “former player for North Queensland Cowboys.” Since “rugby player” is a more enduring and widely recognized description, yet still specific enough to indicate Jason Martin, it is more likely to be supported by a larger number of documents.

Past work like Choi et al. (2021) instructs annotators to make minimal edits to statements. However, they do not provide guidance on what criteria should be used to choose from among multiple candidate edits.

Molecular facts

These two criteria suggest two things. First, atomic facts can be “too atomic:” they may need to be decontextualized. However, it is still valuable to have a reasonably minimal fact so it can be supported by many possible evidence documents.

Refer to caption
Figure 2: Controlled evidence generation framework for illustrating error localization introduced by decontextualization for atomic fact verification.

Molecular Fact

A molecular fact is a statement 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to claim 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that obeys criteria 1 and 2: it should uniquely specify the interpretation of 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT even when considered on its own, while adding as little information as possible to do so.

2.2 Task Definition: Fact-checking LLMs

Recall our setting where an LLM has generated a response 𝐫𝐫\mathbf{r}bold_r to input prompt 𝐱𝐱\mathbf{x}bold_x, and 𝐫𝐫\mathbf{r}bold_r has associated claims (𝐜1,,𝐜n)subscript𝐜1subscript𝐜𝑛(\mathbf{c}_{1},\ldots,\mathbf{c}_{n})( bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). For each 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we have a corresponding set of k𝑘kitalic_k evidence documents, Di=(Di,1,,Di,k)subscript𝐷𝑖subscript𝐷𝑖1subscript𝐷𝑖𝑘D_{i}=(D_{i,1},\ldots,D_{i,k})italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_D start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ), that are referenced to assess the accuracy of 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Furthermore, we have access to a gold standard of human-annotated labels for each atomic fact, represented as L=(l1,,ln)𝐿subscript𝑙1subscript𝑙𝑛L=(l_{1},\ldots,l_{n})italic_L = ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), where each lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be either SUPPORTED or NOT_SUPPORTED. Our goal is to make judgments about the supportedness of the 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which requires appropriately decontextualizing each fact.

We augment each atomic claim 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a corresponding molecular claim 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as described in Section 3, resulting in a set of facts (𝐦1,𝐦n)subscript𝐦1subscript𝐦𝑛(\mathbf{m}_{1},\ldots{\mathbf{m}_{n}})( bold_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … bold_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). We represent the model’s factuality judgment prediction as a set of supported documents pi=Check(Di,j,𝐦i),for all j{1,2,,n} in Diformulae-sequencesubscript𝑝𝑖Checksubscript𝐷𝑖𝑗subscript𝐦𝑖for all 𝑗12𝑛 in subscript𝐷𝑖{p}_{i}=\mathrm{Check}(D_{i,j},\mathbf{m}_{i}),\quad\text{for all }j\in\{1,2,% \dots,n\}\text{ in }D_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Check ( italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , for all italic_j ∈ { 1 , 2 , … , italic_n } in italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In other words, the prediction of Check()Check\mathrm{Check()}roman_Check ( ) is accurate when it supports the molecular claim with the same evidence docs as humans.

3 Method: Producing Molecular Facts

We use a two-step process to refine an atomic fact into a molecular fact using gpt-4-turbo-2024-04-09 Achiam et al. (2023). Our methodology makes the assumption that the ambiguity is typically restricted to a single entity in the claim. This is the case for the datasets we study in this work, described in Section 4.5.

Stage 1: Identifying Ambiguity

We identify the primary subject of the claim and to assess potential ambiguities based on its parametric knowledge: does the model know of multiple entities with this name? This step identifies the main subject 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the claim 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and provides a disambiguation criteria 𝐛isubscript𝐛𝑖\mathbf{b}_{i}bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the subject 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The disambiguation criteria 𝐛isubscript𝐛𝑖\mathbf{b}_{i}bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be ‘None‘ when there is no ambiguity, or a type of criteria such as profession, birthyear, or location when disambiguation is required.

For example, if the claim is about ‘Charles Osgood’, with multiple possible referents, 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is ‘Charles Osgood’, while 𝐛isubscript𝐛𝑖\mathbf{b}_{i}bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT could be ‘profession’ or ‘birthyear’ to clarify which Charles Osgood is being referred to. Conversely, if the claim concerns the unambiguous ‘Julius Robert Oppenheimer’, 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is ‘Julius Robert Oppenheimer’, and 𝐛isubscript𝐛𝑖\mathbf{b}_{i}bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is ‘None’.

Stage 2: Molecular Facts Generation

We then prompt the LLM to disambiguate the subject 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT within the claim 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, harnessing both the identified disambiguation criteria 𝐛isubscript𝐛𝑖\mathbf{b}_{i}bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the claim’s context 𝐫𝐫\mathbf{r}bold_r.The output of this stage is a molecular fact 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the atomic claim 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

The specifics regarding the prompts used are elaborated upon in Appendix 6 and 7.

3.1 Baselines

We analyze the robustness of fact verification across various systems on the defined criteria of minimality and decontextuality. Outputs for baselines are generated with gpt-4-turbo-2024-04-09.

ATOMIC:

Atomic claims are generated from the LLM’s response using Min et al. (2023).

SIMPLE-DECONTEXT:

Atomic claims are decontextualized with a prompt described in 8 using the LLM’s generated response as context for the atomic claim.

SAFE-DECONTEXT:

Decontextualization of atomic claims is performed using the revision prompt described in Wei et al. (2024).

MOLECULAR-DECONTEXT:

This approach follows a two-stage process described in section 3 to identify disambiguation criteria and subsequently decontextualize the atomic claim.

Examples of outputs from each method can be found in Figure 3. With this task definition and baseline methodologies, we structure our experiments to analyze the two criteria presented in Section 2.1 in the following sections.

4 Experiment: Minimality & Localization

We begin our analysis of decontextualization with a controlled experiment to illustrate problems with error localization due to loss of minimality discussed in Criterion 2 in Section 2.1. Minimality is more difficult to evaluate than decontextuality. Less minimal facts impact error localization and can potentially lead to errors where an ancillary part of the claim leads to the whole claim being judged as wrong Kamoi et al. (2023a). However, precisely measuring the harms of this is not easy without taking into account the downstream uses of error localization systems such as answer refinement Xu et al. (2023) or fine-tuning Wu et al. (2024); Roit et al. (2023).

To measure the effects in a controlled way, we design a method for synthetic evidence generation as summarized in Figure 2. Our goal is to illustrate when decontextualized atomic facts actually contain multiple facts in a way that could impact error localization. We then study how many of these cases truly show this problem. To study the impact of information addition, we consider two baselines SIMPLE-DECONTEXT and SAFE-DECONTEXT which respectively have less and more restrictive prompts for including new information from the context to revise an atomic claim.

4.1 Controlled Dataset Construction

We now detail the dataset construction process as illustrated in Figure 2. We take a dataset D𝐷Ditalic_D of 812 claims from the Factcheck-Bench dataset Wang et al. (2024) which consists of long form ChatGPT responses with human-annotated factuality labels.

Refer to caption
Figure 3: Example claims (right) generated by SIMPLE-DECONTEXT, SAFE-DECONTEXT, MOLECULAR-DECONTEXT for the atomic claim derived from the highlighted sentence in the LLM generation (left).

Step 1: Extract Atomic Facts For each response 𝐫D𝐫𝐷\mathbf{r}\in Dbold_r ∈ italic_D, we extract atomic facts (𝐜1,,𝐜n)subscript𝐜1subscript𝐜𝑛(\mathbf{c}_{1},\ldots,\mathbf{c}_{n})( bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) using the method of Min et al. (2023).

Step 2: Decontextualization: We perform decontextualization of the extracted atomic facts using SIMPLE-DECONTEXT and SAFE-DECONTEXT. Let the decontextualization for claim 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be denoted as 𝐝isubscript𝐝𝑖\mathbf{d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We refer to the 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that 𝐝isubscript𝐝𝑖\mathbf{d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was created from as its core atomic fact; however, note that 𝐝isubscript𝐝𝑖\mathbf{d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT might support other facts as well.

Step 3: Identifying Claims with Multiple Atomic Facts: We identify decontextualized claims that entail information of more than one atomic fact. We use the entailment model from Liu et al. (2022) to determine e(𝐝i,𝐜j){supported,unsupported}𝑒subscript𝐝𝑖subscript𝐜𝑗supportedunsupportede(\mathbf{d}_{i},\mathbf{c}_{j})\in\{\mathrm{supported},\mathrm{unsupported}\}italic_e ( bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ { roman_supported , roman_unsupported }; is each 𝐜jsubscript𝐜𝑗\mathbf{c}_{j}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT supported by 𝐝isubscript𝐝𝑖\mathbf{d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT? We retain cases where e(𝐝i,𝐜i)=supported𝑒subscript𝐝𝑖subscript𝐜𝑖supportede(\mathbf{d}_{i},\mathbf{c}_{i})=\mathrm{supported}italic_e ( bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_supported and where |{j:e(𝐝i,𝐜j)=supported}|2conditional-set𝑗𝑒subscript𝐝𝑖subscript𝐜𝑗supported2|\{j:e(\mathbf{d}_{i},\mathbf{c}_{j})=\mathrm{supported}\}|\geq 2| { italic_j : italic_e ( bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_supported } | ≥ 2; that is, at least two atomic facts are supported by 𝐝isubscript𝐝𝑖\mathbf{d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For example, in Figure 2, the claim (𝐝isubscript𝐝𝑖\mathbf{d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), ‘The “Blackpink in Your Area” compilation album was released in 2018‘, is a decontextualized claim derived from the core atomic claim (𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), ‘The album was released in 2018.’. The decontextualized claim (𝐝isubscript𝐝𝑖\mathbf{d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) entails the core atomic fact (𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and an additional atomic fact (𝐜jsubscript𝐜𝑗\mathbf{c}_{j}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) ‘ “Blackpink in Your Area” is a compilation album’. Let Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denote this filtered set.

Step 4: Generating Evidence for Partial Support: Whenever multiple atomic facts are merged, we could theoretically see a loss in localization capability from a model: if one fact is not supported, the entire claim will be determined to be not supported. To demonstrate this possibility, we now generate evidence that partially supports our multi-fact claims. As an example, in Figure 2, our goal in step 4 is to generate a paragraph that should not include details about “Blackpink in Your Area” being a compilation album. Then, if the statement ‘The album was released in 2018’ is decontextualized to include information about it being a compilation album, this paragraph will enable us to identify this: the evidence will no longer support the decontextualized fact, reflecting a failure of error localization.

By construction of Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 𝐝isubscript𝐝𝑖\mathbf{d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is supported by at least two facts, its core atomic fact and auxiliary atomic fact(s). From this set of auxiliary atomic fact(s), we sample a banned fact 𝐜bsubscript𝐜𝑏\mathbf{c}_{b}bold_c start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. For each 𝐝isubscript𝐝𝑖\mathbf{d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we sample a set of key facts Ci={𝐜i,1,,𝐜i,m}subscript𝐶𝑖subscript𝐜𝑖1subscript𝐜𝑖𝑚C_{i}=\{\mathbf{c}_{i,1},\ldots,\mathbf{c}_{i,m}\}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_c start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , bold_c start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT } such that Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains the all atomic facts of the response r𝑟ritalic_r except 𝐜bsubscript𝐜𝑏\mathbf{c}_{b}bold_c start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. We then prompt the LLM to generate an evidence article supporting the facts Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and not supporting the fact cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Each of these evidence articles ideally should support all the key facts and not support the banned fact.

The prompt for this step is detailed in Figure 10 and other filtering criteria are described in Appendix F. Denote this set where evidence generation is feasible as Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

4.2 Evaluation Criteria

We evaluate the impacts of loss of minimality on the recall of fact-checking. We measure the percentage of cases that change their label from SUPPORTED to NOT_SUPPORTED after decontextualization on the set Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We employ the roberta-large from AlignScore Zha et al. (2023) as our Check()Check\mathrm{Check()}roman_Check ( ) function.222We conducted preliminary analysis with GPT-4 as well, and found it gave very similar results. Using Check(Di,ci)ChecksubscriptDisubscriptci\mathrm{Check(D_{i},c_{i})}roman_Check ( roman_D start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT , roman_c start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ), we identify cases where the core key fact is SUPPORTED by the generated evidence while the decontextualization and banned fact are NOT_SUPPORTED. We call this set auto non-minimal.

Baseline Potential Auto
Non-minimal Non-minimal
SAFE-DECONTEXT 8.49% 3.94%
SIMPLE-DECONTEXT 23.39% 13.42%
Table 1: Percentage of overall dataset impacted by minimality loss due to decontextualization leading to prediction changes from SUPPORTED to NOT_SUPPORTED.

4.3 Results

Table 1 shows the fraction of claims which are included in the set Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which yields 8.49% for SAFE-DECONTEXT and 23.39% for SIMPLE-DECONTEXT. We refer to these claims as potential non-minimal claims: they have passed the checks in our pipeline and contain multiple atomic facts. Next we apply the Check()Check\mathrm{Check()}roman_Check ( ) function to identify auto non-minimal claims, and find that they occur at a rate of 3.94% to 13.42% (Table 1).

Category Minimal Non-minimal
SAFE-DECONTEXT 56.2% 43.8%
SIMPLE-DECONTEXT 27.5% 72.5%
Table 2: Human annotation for categorizing the Auto Non-minimal subset into minimal vs. non-minimal.
ACCURACY ACCURACY ACCURACY MODIFICATION AVG LENGTH
Subset OVERALL SUPPORTED NOT_SUPPORTED RATE (# of words)
ATOMIC 68.7% 77.5% 22.4% - 7.61±plus-or-minus\pm±3.03
SIMPLE-DECONTEXT 76.2% 84.3% 33.6% 99.5% 15.55±plus-or-minus\pm±5.65
SAFE-DECONTEXT 73.4% 81.3% 31.9% 72.6% 9.86±plus-or-minus\pm±4.38
MOLECULAR-DECONTEXT 74.7% 81.5% 38.8% 96.8% 14.96±plus-or-minus\pm±5.6
Table 3: Accuracy measured by Check(Di,m)Checksubscript𝐷𝑖𝑚\mathrm{Check}(D_{i},m)roman_Check ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m ), assessing the effectiveness of claim revisions by each baseline against the ambiguous document set associated with claim’s main entity.
Human Label\to SUPPORTED NOT_SUPPORTED
Baseline Pred\to SUPPORTED SUPPORTED NOT_SUPPORTED SUPPORTED
Matching Type\to Multi-Evidence Single-Evidence No Evidence Single/Multiple Overall
Baseline \downarrow matched Wrong Entity matched Evidence matched \downarrow
ATOMIC 16.2% 0.8% 1.8% 12.4% 31.1%
SIMPLE-DECONTEXT 7.9% 1.5% 3.9% 10.6% 23.8%
SAFE-DECONTEXT 12.0% 1.0% 2.8% 10.9% 26.6%
MOLECULAR-DECONTEXT 9.2% 1.5% 4.8% 9.8% 25.3%
Table 4: Fine-grained error analysis categorizing baseline mistakes based on human label of SUPPORTED/NOT_SUPPORTED along with categorization of <Single/Multi/No>-Evidence based on the number of ambiguous evidence docs that support the claim.

4.4 Human Evaluation

Susceptibility to Error Localization

We perform human evaluation on the auto non-minimal claims in Table 1. First, we categorize these into human judgments of whether a claim in this subset is minimal or not in Table 2. We categorize a decontextualization as minimal based on the criteria outlined in 2.1. This annotation is performed by the authors of the paper. We find that for SAFE-DECONTEXT, 43.8% of these cases are truly non-minimal in our judgment which represent 1.7% of the dataset D𝐷Ditalic_D. For the SIMPLE-DECONTEXT baseline, we find that a staggering 72.5% of the auto non-minimal subset represents truly non-minimal claims. This represents 9.6% of the dataset D𝐷Ditalic_D. We note that the remaining fraction of decontextualization cases not identified by the auto methods are those which entail more than one atomic fact but it is a necessary addition to make the atomic claim standalone.

Decontextualization and Loss of Minimality We highlight that addition of information to a claim does not always make it less entailed to the evidence. In fact, in many cases information addition makes the sentence more specific. This is evident from Table 2 which shows that automatically flagged cases for non-minimality have a large percentage of minimal claims after human evaluation. For instance, “All taxes must be paid by April 15\rightarrowIn the US, all taxes must be paid by April 15” is a necessary addition for claim specificity.

4.5 Conclusion: Problem of Non-minimality

We find through our controlled experiment and human evaluation that decontextualization can lead to non-minimal cases for between 1.7% to 9.6% of decontextualizations. These cases could cause error localization issues due to too much information added to the claims. In absolute terms, this is a low fraction for the baseline SAFE-DECONTEXT. However, we note that a biography from FActScore Min et al. (2023) contains dozens of atomic facts, meaning that in a single response from an LLM, there can easily be a handful of facts posing localization problems. Given the increasing adoption of the decomposition and decontextualization pipeline for automatic fact verification systems, we argue that multiple localization errors per response is cause to re-examine that pipeline. Next, we analyze tradeoffs between minimality and decontextuality for fact checking of ambiguous biographies.

5 Experiment: Ambiguous Biographies

We now analyze to what extent our molecular facts add the correct information to decontextualize on an existing dataset with ambiguous entity references.

Dataset

We use the ambiguous biographies dataset introduced in Chiang and yi Lee (2024) which comprises biographies generated by LLMs for multiple entities that share similar names, such as Dick Hanley (swimmer) and Dick Hanley (footballer). In this dataset we represent the biographies generated by the LLMs as 𝐫𝐫\mathbf{r}bold_r and 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT correspond to atomic claims generated using the methodology outlined in Min et al. (2023). For this setting, we define each claim to have a subject 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT, which is ambiguous due to the nature of the dataset. The dataset provides a set of evidence documents sourced from Wikipedia page of the subject disambiguation, Di={Di,2,Di,2,}subscript𝐷𝑖subscript𝐷𝑖2subscript𝐷𝑖2D_{i}=\{D_{i,2},D_{i,2},\ldots\}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_D start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , … } for subjects sharing similar names as 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This dataset is suitable for evaluating decontextuality as it consists of two properties: (i) atomic claims that require decontextualization (such as entity specification, noun completion), (ii) multiple entities with the same name that require additional disambiguation such as specifying location, occupation, or time-period.

Our goal is to verify the claims with the set of documents using Check()Check\mathrm{Check()}roman_Check ( ). We randomly sample 726 claims from the human-annotated set for this study which belong to either SUPPORTED or NOT_SUPPORTED categories. For each claim we construct a revision using the methods and baselines described in section 3 and compare the prediction with human labels.

Evaluation Criteria

We evaluate our judgment of a claim on two axes: (1) whether it aligns with the human annotation of SUPPORTED or NOT_SUPPORTED, and (2) whether it is supported by the correct evidence. For each evidence associated with the claim, we compute pi,k=Check(Di,k,ci)subscript𝑝𝑖𝑘Checksubscript𝐷𝑖𝑘subscript𝑐𝑖{p}_{i,k}=\mathrm{Check}(D_{i,k},c_{i})italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = roman_Check ( italic_D start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the claim processed by the particular baseline and k𝑘kitalic_k represents the k𝑘{k}italic_kth ambiguous subject related document for the claim. We consider the judgment pi,ksubscript𝑝𝑖𝑘{p}_{i,k}italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT to be correct only if the prediction of the claim matches the human label and the prediction is supported by the correct entity’s evidence document.

Baseline Minimal \uparrow Non-Minimal\downarrow Ambig.\downarrow
SIMPLE 16.0% 56.0% 28.0%
SAFE 24.0% 0.0% 76.0%
MOLECULAR 52.0% 24.0% 24.0%
Table 5: Human analysis of decontextualized claims for all baselines on the axis of minimality and ambiguity.

6 Results: Ambiguous Biographies

Table 3 presents the results of this experiment. All methods of decontextualization baselines yield higher accuracy rates compared to atomic claims, across all subsets. We see that Molecular and Simple decontextualization methods have a higher proclivity to modify the atomic claims than the SAFE decontextualization baseline. Consequently, the average sentence lengths of the former methods is also larger than the SAFE baseline. Higher degrees of modification generally lead to higher accuracy. All three methods are on a Pareto frontier of length versus accuracy.

However, accuracy using the Check()Check\mathrm{Check()}roman_Check ( ) function does not incorporate minimality. We investigate the minimality of the baselines by performing a human evaluation of randomly sampled 25 claims in Table 5. We see that the baseline SIMPLE-DECONTEXT has a large fraction of non-minimal and ambiguous claims as compared to MOLECULAR-DECONTEXT. Analysis in Section 4.4 shows that SAFE-DECONTEXT is more minimal than SIMPLE-DECONTEXT; however, it struggles with ambiguity.

Overall, we observe that molecular claims strike a balance by maintaining minimality with ambiguity removal and improving accuracy. They are significantly more minimal than SIMPLE-DECONTEXT and more performant in ambiguous generations than SAFE-DECONTEXT.

Error breakdown

To analyze the nature of errors encountered, we detail a case-wise error distribution in Table 4. Specifically, we study the behavior of various baselines to mispredict the label as SUPPORTED or NOT_SUPPORTED in comparison to human annotation. Note that due to the ambiguous nature of this dataset, claims may be erroneously validated by several distracting pieces of evidence. Therefore, we further partition the error analysis table to reflect the model’s prediction on (i) Single/Multi/No Evidence: whether a claim is supported by single, multiple, or no pieces of evidence, and (ii) (Correct/Wrong Entity): whether the set of supporting evidence contains the accurate evidence with which the claim ought to be aligned. Overall, all decontextualization methods show a lower error rate than atomic claims.

Baseline Pair Overlap
ATOM & SIMPLE-DECONTEXT 7%
ATOM & SAFE-DECONTEXT 44%
ATOM & MOLECULAR-DECONTEXT 15%
SIMPLE-DECONTEXT & SAFE-DECONTEXT 27%
SIMPLE-DECONTEXT & MOLECULAR-DECONTEXT 36%
MOLECULAR-DECONTEXT & SAFE-DECONTEXT 32%
Table 6: Information overlap between baselines as measured by bi-directional entailment.

Information Overlap

We perform an information overlap analysis shown in Table 6 using the model from Liu et al. (2022) to check bidirectional entailment of the fraction of cases where the information is equivalent between two baselines Gunjal and Durrett (2023). We find in a large fraction of cases each baseline adds different information to modify the atomic claim. SAFE-DECONTEXT has least amount of modification albeit suffers with ambiguity and SIMPLE-DECONTEXT has most amount of modification at the cost of minimality loss.

7 Related Work

Recent research in factuality verification of LLM generations advocates decomposing LLM generations into atomic facts or subclaims and verifying each against retrieved evidence Min et al. (2023); Kamoi et al. (2023b); Fabbri et al. (2022). End-to-end pipelines for factuality verification have been proposed, involving steps such as claim extraction, revision, determining checkworthiness, evidence retrieval, and verification Wang et al. (2024); Chern et al. (2023); Wei et al. (2024); Chen et al. (2024). These papers often evaluate on recently-released datasets of errors in generations Liu et al. (2023a); Malaviya et al. (2024); Chen et al. (2023a). Our work comments on the decontextualization step frequently used in these pipelines.

Our work fits into a broader ecosystem of techniques in this area. Gao et al. (2023b) enable LLMs to generate text with citations. For faithful LLM generations, Gao et al. (2023a) use evidence retrieval for revision, and He et al. (2022) utilize chain-of-thought coupled with retrieval for faithful explanations. Fine-tuned systems, such as that by Zha et al. (2023), predict alignment scores for verification, while Tang et al. (2024) propose LLM-AggreFact for sentence-level factuality labels. Wanner et al. (2024) find that evaluation metrics for fact verification are sensitive to the claim decomposition method used.

Prior work on decontextualization has investigated basic notions like anaphora resolution Choi et al. (2021), question answering frameworks Newman et al. (2023), and extract-then-decontextualize methods for summarization Potluri et al. (2023). In fact verification, atomic claims are made standalone before evidence retrieval via decontextualization Wang et al. (2024) or claim revision Wei et al. (2024). Decontextualization is also used to resolve ambiguity Zhang and Choi (2021); Lee et al. (2024); our work shares this focus.

8 Conclusion

We introduce molecular facts and the desiderata of decontextualization in LLM fact verification. We define the criteria of decontextuality and minimality in this context. Through a controlled experiment, we show that localization errors due to loss of minimality by decontextualization is sensitive to the method used. We propose a method of “molecular facts” and find that they improve fact verification precision for claims from generation about ambiguous entities. We show that molecular facts strike a balance between maintaining minimality and accuracy of fact-verification.

Limitations

Scope

We illustrate the phenomenon of ambiguity in atomic claims; however, our main evaluation of molecular facts is in the domain of English-language biographies. This is due to the availability of the dataset, Wikipedia evidence, and the prevalence of biography benchmarks in recent work. Conceptually, the ambiguity in the subject or predicate of the claim can be extended to other realistic datasets, but we leave that exploration to future work. Relatedly, we focus on entity ambiguity for illustration of our method. There may be other types of ambiguities that molecular fact generation can address in other contexts and other datasets.

Furthermore, we focus our experiments on high-performing LLMs in this work. The extension of decontextualization and molecular fact generation to smaller, open-source models and the improvement in this regime is a good subject for further study.

Finally, we believe our approach should be evaluated fully end-to-end in an LLM pipeline that generates responses and then verifies their factuality. However, despite substantial research in these directions, we are not aware of an off-the-shelf experimental pipeline that is usable for this setting.

Decomposition Quality

We do not consider the errors introduced due to poor decomposition of atomic facts in this work. It is possible that some of these errors are resolved due to decontextualization or disambiguation implicitly, but we do not make any specific claims about this.

Coverage of Domains and Languages

The datasets utilized for ambiguous biographies are limited to English-language claims focused on English-centric concepts within Wikipedia. Similarly, the synthetic data generation experiment for minimality analysis is confined to English language outputs and relies on GPT-4’s parametric knowledge, which may limit the breadth of topics and domains covered.

Acknowledgments

We thank Jessy Li for comments and feedback on the initial draft of this work. This work was supported by NSF CAREER Award IIS-2145280 and the NSF AI Institute for Foundations of Machine Learning (IFML). This material is also based on research that is in part supported by the Air Force Research Laboratory (AFRL), DARPA, for the KAIROS program under agreement number FA8750-19-2-1003.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Chen et al. (2024) Jifan Chen, Grace Kim, Aniruddh Sriram, Greg Durrett, and Eunsol Choi. 2024. Complex claim verification with evidence retrieved in the wild. In Proceedings of the North American Chapter of the Association for Computational Linguistics.
  • Chen et al. (2023a) Shiqi Chen, Yiran Zhao, **ghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, and Junxian He. 2023a. FELM: Benchmarking Factuality Evaluation of Large Language Models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Chen et al. (2023b) Sihao Chen, Senaka Buthpitiya, Alex Fabrikant, Dan Roth, and Tal Schuster. 2023b. PropSegmEnt: A Large-Scale Corpus for Proposition-level Segmentation and Entailment Recognition. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8874–8893, Toronto, Canada. Association for Computational Linguistics.
  • Chern et al. (2023) I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu, et al. 2023. FacTool: Factuality Detection in Generative AI–A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. arXiv preprint arXiv:2307.13528.
  • Chiang and yi Lee (2024) Cheng-Han Chiang and Hung yi Lee. 2024. Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations. arXiv 2402.05629.
  • Choi et al. (2021) Eunsol Choi, Jennimaria Palomaki, Matthew Lamm, Tom Kwiatkowski, Dipanjan Das, and Michael Collins. 2021. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics, 9:447–461.
  • Fabbri et al. (2022) Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2022. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, Seattle, United States. Association for Computational Linguistics.
  • Falke et al. (2019) Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics.
  • Gao et al. (2023a) Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. 2023a. RARR: Researching and Revising What Language Models Say, Using Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
  • Gao et al. (2023b) Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023b. Enabling Large Language Models to Generate Text with Citations. In Empirical Methods in Natural Language Processing (EMNLP).
  • Goyal and Durrett (2021) Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Proceedings of NAACL.
  • Gunjal and Durrett (2023) Anisha Gunjal and Greg Durrett. 2023. Drafting Event Schemas using Language Models. arXiv preprint arXiv:2305.14847.
  • He et al. (2022) Hangfeng He, Hongming Zhang, and Dan Roth. 2022. Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303.
  • Ji et al. (2022) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye** Bang, Andrea Madotto, and Pascale Fung. 2022. Survey of hallucination in natural language generation. ACM Computing Surveys.
  • Kamoi et al. (2023a) Ryo Kamoi, Tanya Goyal, and Greg Durrett. 2023a. Shortcomings of question answering based factuality frameworks for error localization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 132–146.
  • Kamoi et al. (2023b) Ryo Kamoi, Tanya Goyal, Juan Rodriguez, and Greg Durrett. 2023b. WiCE: Real-World Entailment for Claims in Wikipedia. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7561–7583, Singapore. Association for Computational Linguistics.
  • Lee et al. (2024) Yoonsang Lee, Xi Ye, and Eunsol Choi. 2024. Ambigdocs: Reasoning across documents on different entities under the same name. arXiv 2404.12447.
  • Li et al. (2016) Junyi Jessy Li, Bridget O’Daniel, Yi Wu, Wenli Zhao, and Ani Nenkova. 2016. Improving the annotation of sentence specificity. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3921–3927.
  • Liu et al. (2022) Alisa Liu, Swabha Swayamdipta, Noah A. Smith, and Ye** Choi. 2022. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Liu et al. (2023a) Nelson Liu, Tianyi Zhang, and Percy Liang. 2023a. Evaluating Verifiability in Generative Search Engines. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001–7025, Singapore. Association for Computational Linguistics.
  • Liu et al. (2023b) Yixin Liu, Alex Fabbri, Pengfei Liu, Yilun Zhao, Linyong Nan, Ruilin Han, Simeng Han, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, and Dragomir Radev. 2023b. Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4140–4170, Toronto, Canada. Association for Computational Linguistics.
  • Louis and Nenkova (2012) Annie Louis and Ani Nenkova. 2012. A corpus of general and specific sentences from news. In LREC, volume 1818, page 10. Citeseer.
  • Malaviya et al. (2024) Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. 2024. ExpertQA: Expert-curated questions and attributed answers. In Proceedings of the North American Chapter of the Association for Computational Linguistics.
  • Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  • Nenkova and Passonneau (2004) Ani Nenkova and Rebecca J Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the human language technology conference of the north american chapter of the association for computational linguistics: Hlt-naacl 2004, pages 145–152.
  • Newman et al. (2023) Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, and Kyle Lo. 2023. A question answering framework for decontextualizing user-facing snippets from scientific documents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3194–3212.
  • Pezzelle (2023) Sandro Pezzelle. 2023. Dealing with semantic underspecification in multimodal nlp. arXiv preprint arXiv:2306.05240.
  • Potluri et al. (2023) Abhilash Potluri, Fangyuan Xu, and Eunsol Choi. 2023. Concise answers to complex questions: Summarization of long-form answers. arXiv preprint arXiv:2305.19271.
  • Rashkin et al. (2021) Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and D. Reitter. 2021. Measuring Attribution in Natural Language Generation Models. Computational Linguistics, 49:777–840.
  • Roit et al. (2023) Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Leonard Hussenot, Orgad Keller, Nikola Momchev, Sabela Ramos Garea, Piotr Stanczyk, Nino Vieillard, Olivier Bachem, Gal Elidan, Avinatan Hassidim, Olivier Pietquin, and Idan Szpektor. 2023. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6252–6272, Toronto, Canada. Association for Computational Linguistics.
  • Schilder (1998) Frank Schilder. 1998. An underspecified segmented discourse representation theory (usdrt). In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2, pages 1188–1192.
  • Tang et al. (2024) Liyan Tang, Philippe Laban, and Greg Durrett. 2024. MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents. arXiv preprint arXiv:2404.10774.
  • Wang et al. (2024) Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, and Preslav Nakov. 2024. Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers.
  • Wanner et al. (2024) Miriam Wanner, Seth Ebner, Zheng** Jiang, Mark Dredze, and Benjamin Van Durme. 2024. A Closer Look at Claim Decomposition. arXiv preprint arXiv:2403.11903.
  • Wei et al. (2024) Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. 2024. Long-form factuality in large language models. arXiv preprint arXiv:2403.18802.
  • Wu et al. (2024) Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2024. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36.
  • Xu et al. (2023) Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei Li, and Markus Freitag. 2023. Pinpoint, not criticize: Refining large language models via fine-grained actionable feedback. arXiv preprint arXiv:2311.09336.
  • Zha et al. (2023) Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
  • Zhang and Choi (2021) Michael J.Q. Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Zhang et al. (2024) Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A. Smith. 2024. How Language Model Hallucinations Can Snowball. In Forty-first International Conference on Machine Learning.
  • Zhang and Bansal (2021) Shiyue Zhang and Mohit Bansal. 2021. Finding a Balanced Degree of Automation for Summary Evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6617–6632, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Zhang et al. (2023) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023. Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.

Appendix A Prompts

We give details on all the prompts used throughout this work.

Decontextuality Experiment Prompts

The step-wise molecular facts generation prompts for MOLECULAR_DECONTEXT are in Figure 6, 7. For the simple decontextualization baseline SIMPLE_DECONTEXT, the prompts are provided in 8.

Minimality Experiment Prompts

The prompt for generating controlled evidence for the minimality experiment is given in Figure 10.

Appendix B Additional Related Work

Decomposition in Text Summarization

Decomposition of responses is also prevelant in the text summarization literature. Nenkova and Passonneau (2004) introduced the Pyramid protocol for summarization evaluation which extracts weighted Summarization Content Units (SCUs) which represent the importance of various facts present in multiple human-generated summaries of a text. Zhang and Bansal (2021) propose using Semantic Triplet Units (STUs), which are summary content units generated automatically using SRL parsers, to evaluate generated summaries with textual entailment models. Similarly, Liu et al. (2023b) propose Atomic Content Units (ACUs) as a new summarization salience protocol that allows for higher inter-annotator agreement. Chen et al. (2023b) propose using entailment judgments on a set of sentence propositions within a document.

Decontextualization and Specificity

Decontextualization is a process of making sentences stand-alone by resolving missing context while preserving its meaning Choi et al. (2021). A related phenomenon is the notion of specificity. Louis and Nenkova (2012) presented the first corpus of sentences distinguished on the criteria of being general or specific. Their idea of classification was based on examples and intuition by defining general sentences to be broad statements about a topic that would need additional evidence or examples for a reader to understand, whereas, specific sentences can stand by themselves. Li et al. (2016) make this definition more specific by grounding specificity for a sentence to three requirements: (i) it is easy to understand the meaning and identify of the intended references without ambiguity; (ii) the truth of the statement can be assessed based on the sentence itself and general shared knowledge; and (iii) the sentence fully expresses key information about the participants and causes of an event. Another related notion is underspecification in discourse, which is an intentional feature to maintain communication efficiency Schilder (1998). This has been annotated by Li et al. (2016) and highlighted in a multimodal setting by Pezzelle (2023).

Appendix C Human Annotation Criteria for Categorizing the Non-minimal Subset

We describe the criteria for annotating the auto non-minimal subset into minimal vs. non-minimal as shown in Table 2. For each instance, we compare the original claim, the decontextualization, and the banned fact. We label cases as minimal when either of the following applies: (1) the banned fact is closely related the atomic fact and it is a necessary addition to the atomic claim to make it standalone. In other words, the banned fact is a necessary addition to the atomic claim to add context and/or resolve ambiguity. For example, “The album is their first full-length studio album.” is decontextualized to “The album released in 2020 is Blackpink’s first full-length studio album.” and the banned fact is “The album was released in 2020.”. The information in the banned fact is necessary addition to disambiguate “the album” in this case. (2) The banned fact entailed by the decontextualization, but it is due to an entailment error. For example, the decontextualization “Mey Eden, one of the largest bottled water companies in Israel, offers flavored water products." is erroneously entailed by the banned fact “Mey Eden offers still water products.".

Appendix D Human Analysis Criteria for Categorizing Minimality and Ambiguity

We describe the criteria for the human analysis for on the decontextualization of each baseline on the axis of minimality and ambiguity shown in Table 5. We categorize a claim decontextualization as non-minimal when it contains additional information that goes beyond making the sentence stand-alone and can potentially cause loss of error-localization. We categorize a claim decontextualization as ambiguous when it lacks clarifications for entities that could refer to different ambiguous subjects or add enough context to disambiguate the main entity. If both of the above conditions are not violated, we categorize the decontextualization as minimal.

Appendix E Models, Datasets and Computation Cost

The gpt-4-turbo-2024-04-09 model was employed for running baselines and generating outputs, while the gpt-3.5-turbo model was used for evaluation through FActScore Achiam et al. (2023). For generation experiments, we set the temperature to 0.75. The total cost for generating decontextualizations and evaluating the ambiguous biography experiment was approximately $120.

In the minimality experiment, gpt-3.5-turbo was used to extract atomic facts, and gpt-4-turbo-2024-04-09 was used for decontextualization and generation tasks. This resulted in a total cost of around $100. We use a NVIDIA A40 GPU for evaluation using AlignScore Zha et al. (2023) and entailment computation using WANLI Liu et al. (2022),

We use ChatGPT for improving writing formatting and generating boilerplate code for figure generation in this paper.

We use the open-source dataset published by Wang et al. (2024) under the Apache 2.0 license. We also use the open-source code-base of FactScore Min et al. (2023) for evaluations which is published under MIT license and AlignScore Zha et al. (2023) published under MIT License.

Refer to caption
Figure 4: Variation in accuracy for different fact-checking methods as the offset from the entity switch point changes. Each line represents a method, with the solid lines indicating the method’s accuracy at different offsets, and the dashed lines representing the overall accuracy of the method. The silver star represents the performance of human-in-the-loop molecular claim generation.

Appendix F Controlled Experiment on Minimality Generation Details

Filtering Criteria applied in Step 3

Before filtering claims which are supported by more than two atomic facts, we do not consider cases where one atomic fact is a substring of another one.

Filtering Criteria applied in Step 4

We detail the filtering criteria applied in evidence generation for partial support detailed in 4.1. After we sample a set of key facts Ci={𝐜i,1,,𝐜i,m}subscript𝐶𝑖subscript𝐜𝑖1subscript𝐜𝑖𝑚C_{i}=\{\mathbf{c}_{i,1},\ldots,\mathbf{c}_{i,m}\}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_c start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , bold_c start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT } such that Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains the all atomic facts of the response r𝑟ritalic_r except 𝐜bsubscript𝐜𝑏\mathbf{c}_{b}bold_c start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, we also apply a filtering criteria to remove cases where the banned fact and any of the key facts is similar; i.e., for 𝐜i,ksubscript𝐜𝑖𝑘\mathbf{c}_{i,k}bold_c start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT in Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we filter cases where e(𝐜i,k,𝐜b)=supported𝑒subscript𝐜𝑖𝑘subscript𝐜𝑏supportede(\mathbf{c}_{i,k},\mathbf{c}_{b})=\mathrm{supported}italic_e ( bold_c start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = roman_supported. At the end of step 4 after we prompt the LLM to generate an evidence article, we also account for generation errors and remove the cases where banned fact is supported by the generated evidence.

Appendix G Remaining Challenges

To shed light on the remaining challenges, we focus on one of the most challenging scenarios for decontextualization. In the ambiguous biography dataset from Chiang and yi Lee (2024), we often observe what we call an entity switch point: a claim 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that draws on information about entity B, when sentences 𝐜<isubscript𝐜absent𝑖\mathbf{c}_{<i}bold_c start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT all refer to entity A. This is where decontextualization is crucial to recognize that 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in context does not refer to the correct entity.

Molecular claims recover fastest at the entity-switching point

We investigate the performance of baselines under the lens of ambiguity resolution. Note that these results are reported on baselines tested with gpt3.5-turbo. We find that the dataset of ambiguous biographies becomes the most confusing at the entity switch point. Figure 4 shows a significant performance drop at the switch across all methods. Basic decontextualization methods (DECONTEXT, SAFE-DECONTEXT) perform the worst, underperforming the ATOMIC baseline at the switch, but molecular claims, which incorporate richer disambiguation information, show relative robustness, improving by 3.5% over the most effective decontextualization approach (SAFE-DECONTEXT).

Refer to caption
Figure 5: Changing preferences of selection of diambiguating fact by molecular decontextualization for long-form generation with hallucinations.

Gap from human performance

To estimate the upper bound of ideal performance at the entity switch point in Figure 4, we generate molecular claims at the entity-switch point with weak supervision human-in-the-loop supervision. We use the prompt shown Figure 9 in which has access to gold disambiguations from Wikipedia about the entities in the passage. This method’s performance even with weak human supervision is significantly better than automated decontextualization methods, bringing attention to this limitation of current fact-checking pipelines.

Refer to caption
Figure 6: Ambiguity detection prompt for detection of ambiguous entities and generating disambiguation guideline for generation of molecular claims for the baselines MOLECULAR and MOLECULAR-GPT4.
Refer to caption
Figure 7: Molecular decontextualization prompt for the baselines MOLECULAR and MOLECULAR-GPT4.
Refer to caption
Figure 8: Decontextualization prompt for the baseline SIMPLE-DECONTEXT.
Refer to caption
Figure 9: Silver labels ambiguity detection prompt for detection of ambiguous entities and generating disambiguation guideline for generation of molecular claims for the baselines MOLECULAR and MOLECULAR-GPT4.
Refer to caption
Figure 10: Prompt for controlled evidence generation to generate articles that incorporate key facts and avoid banned facts.