LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation

Longchao Da1superscriptLongchao Da1\text{Longchao Da}^{1}Longchao Da start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, Tie** Chen1superscriptTie** Chen1\text{Tie** Chen}^{1}Tie** Chen start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, Lu Cheng2superscriptLu Cheng2\text{Lu Cheng}^{2}Lu Cheng start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, Hua Wei1superscriptHua Wei1\text{Hua Wei}^{1}\thanks{\ \ Corresponding author.}Hua Wei start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT
1superscript1\text{}^{1}start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPTArizona State University, 2superscript2\text{}^{2}start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTUniversity of Illinois Chicago
{longchao, tchen169, hua.wei}@asu.edu, [email protected]
  Corresponding author.
Abstract

The Large language models (LLMs) have showcased superior capabilities in sophisticated tasks across various domains, stemming from basic question-answer (QA), they are nowadays used as decision assistants or explainers for unfamiliar content. However, they are not always correct due to the data sparsity in specific domain corpus, or the model’s hallucination problems. Given this, how much should we trust the responses from LLMs? This paper presents a novel way to evaluate the uncertainty that captures the directional instability, by constructing a directional graph from entailment probabilities, and we innovatively conduct Random Walk Laplacian given the asymmetric property of a constructed directed graph, then the uncertainty is aggregated by the derived eigenvalues from the Laplacian process. We also provide a way to incorporate the existing work’s semantics uncertainty with our proposed layer. Besides, this paper identifies the vagueness issues in the raw response set and proposes an augmentation approach to mitigate such a problem, we conducted extensive empirical experiments and demonstrated the superiority of our proposed solutions.

LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation


Longchao Da1superscriptLongchao Da1\text{Longchao Da}^{1}Longchao Da start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, Tie** Chen1superscriptTie** Chen1\text{Tie** Chen}^{1}Tie** Chen start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, Lu Cheng2superscriptLu Cheng2\text{Lu Cheng}^{2}Lu Cheng start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, Hua Wei1thanks: Correspondingauthor.superscriptHua Wei1thanks: Correspondingauthor.\text{Hua Wei}^{1}\lx@make@thanks{\ \ Correspondingauthor.}Hua Wei start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT Correspondingauthor. 1superscript1\text{}^{1}start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPTArizona State University, 2superscript2\text{}^{2}start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTUniversity of Illinois Chicago {longchao, tchen169, hua.wei}@asu.edu, [email protected]


1 Introduction

The Large Language Models (LLMs) Chang et al. (2024) have become a hot spot for almost everyone, they demonstrate superior performance on various tasks and even proved to be able to conduct human-like conversations by breaking the Turing Test Biever (2023). There are different voices on the emerging LLM techniques, and there are also different attitudes towards it Kambhampati et al. (2024); Valmeekam et al. (2022), either skeptical or accepting, one major concern is commonly acknowledged that the trustworthiness of LLMs responses is not guaranteed Sun et al. (2024); Liu et al. (2023); Huang et al. (2024). The trustworthiness has become a key obstacle for LLMs to deploy in crucial scenarios, such as healthcare Yang et al. (2023); Wang et al. (2024b), autonomous control Wang et al. (2024a), and intelligent planning Kambhampati et al. (2024); Da et al. (2024b).

This has brought many researchers to investigate the uncertainty quantification (UQ) approaches to better understand the LLM’s inferences, e.g., how well they estimate the system dynamics given some context information Da et al. (2024a). However, the UQ in Natural Language Generation (NLG), brings distinct challenges by their intrinsic semantics features, linguistic ambiguity, and complex output structures Lin et al. (2023). Another challenge in UQ for LLMs lies in the limited access to commercial large models, the unavailable model parameters or true prediction probabilities greatly hinder the intrinsic profiling of the model’s behavior, and leading to unachievable white-box uncertainty evaluation Balloccu et al. (2024).

Refer to caption
Figure 1: The left part is an example of directional entailment logic, (R1, Q) proves\vdash (R2, Q) means the probability of R1 entails R2 given the context of question Q, and the right part shows the difference between existing symmetric similarity and our proposed directed relations.
Refer to caption
Figure 2: The overall directional uncertainty quantification (UQ) framework of D-UE (right) compared to the traditional symmetric similarity-based uncertainty evaluation (left). As shown in the figure, the traditional method uses symmetric-based similarity and feeds into an estimator (e.g., Numset, Symmetric Laplacian, etc.) that only perceives monotonous semantics uncertainty Ussuperscript𝑈𝑠U^{s}italic_U start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, while D-UE perceives both directions of entailment between response pairs and enhanced by text similarity, the Random Walk Laplacian is specially applied for complex and asymmetric property. Specifically, after Random Walk Laplacian, we derive the eigenvalues λksubscript𝜆𝑘\lambda_{k}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from Laplacian and aggregate them following Eq.  9 as the final uncertainty measurement UEigvdsubscriptsuperscript𝑈𝑑𝐸𝑖𝑔𝑣U^{d}_{Eigv}italic_U start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_E italic_i italic_g italic_v end_POSTSUBSCRIPT. We also provide a way to fairly consider both semantic uncertainty and directional uncertainty in Section 4.4.

Alternatively, researchers resort to the black-box quantification Lin et al. (2023) within limited question and response sets. A common practice for black-box evaluation is to first build a similarity matrix from a set of responses, and detect the inconsistency of these responses by either conducting Graph Laplacian or analyzing the response set’s entropy Kuhn et al. (2023). However, the current works of literature only consider ‘how similar’ are the two responses when constructing the matrix, this assumes the similarity from response A to B is the same as B to A. But in fact, in linguistics studies, two sentences contain directional logic information. For example, as in Fig. 1, this pair of responses contains two dramatically different entailment probabilities (measured by the NLI model 111https://huggingface.co/microsoft/deberta-v3-large). This implies potential direction information in the response set that existing work neglects by taking the mean or based on semantic measures (undirected).

In this paper, we propose a novel way named Directed Uncertainty Evaluation D-UE to apply a directed graph enforced by entailment probability to construct a more nuanced relationship that can capture the directions of responses and carry the semantic similarities at the same time. Besides, we also discover that the generated responses themselves may have vagueness issues that, bring more challenges to the UQ process, in this paper, we also propose a claim-based augmentation method that helps reduce the vagueness issue and mine the model’s real awareness of a question, further enhanced the UQ for LLMs.

2 Related Work

The first branch of research is solving the UQ in LLMs by inducing the LLMs to output their uncertainty along with the response Kadavath et al. (2022); Lin et al. (2022); Mielke et al. (2020); Tian et al. (2023). Most of the literature above requires the token-level probabilities of LLMs to train (or fine-tune) and predict the uncertainty. This is a straightforward solution while having full access to model structure and weights, but it can be unwieldy as of time-consuming and resource-tense. Another method Kuhn et al. (2023) estimates LLMs uncertainty directly from response level semantic entropy, yet still requires the token-related probability values as input, which is hard to access given black-box or commercial language models.

In consideration of fast and light-weight evaluation, some researchers propose to solve the UQ by treating the LLMs as black-box and analyzing the consistency in the response semantics structure. Lin et al. (2023) first analyzes the UQ by text responses, treating the sum of eigenvalues from the graph Laplacian as the uncertainty indicator. Chen and Mueller (2023) identify unreliable or speculative answers by computing a confidence score for its generated outputs. However, they solely analyze the UQ from semantics, and  Lin et al. (2023) take the average of entailment probability from two directions to construct a similarity matrix, while in this paper, we find the claims together with the semantics information, better contribute to more comprehensive uncertainty quantification, and the directional logic, is not negligible in nuance analysis of response intrinsic structures.

3 Preliminaries

In this section, we will formalize the uncertainty evaluation in LLMs. Let \mathcal{M}caligraphic_M be a general LLM model, which is trained from a certain network structure and contains a parameter set θ𝜃\thetaitalic_θ. In the prompt-based inference period, an input x𝑥xitalic_x is provided to \mathcal{M}caligraphic_M and the model produces a sequence of tokens denoted as y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG from a probability distribution 𝐩(y^|x,θ)𝐩conditional^𝑦𝑥𝜃\mathbf{p}(\hat{y}|x,\theta)bold_p ( over^ start_ARG italic_y end_ARG | italic_x , italic_θ ). The probability distribution plays a key role in understanding the \mathcal{M}caligraphic_M’s characteristics.

For those who train the model from scratch, or use fully open-sourced models, the probability logits (or even the model parameters) are available for analysis and evaluation, and this branch of practice is seen as White Box evaluation, while on the other hand, due to the commercial, competitive, or other reasons, if there is no direct access to the probability logits, the uncertainty evaluation under this scenario is taken as Black Box evaluation.

3.1 White Box Evaluation

Traditionally, researchers could conduct uncertainty evaluation by gradient norms either from the input aspect or parameter aspect: Ugrad-input(x)=x(y^,y)2subscript𝑈grad-input𝑥subscriptnormsubscript𝑥^𝑦𝑦2U_{\text{grad-input}}(x)=\left\|\nabla_{x}\mathcal{L}(\hat{y},y)\right\|_{2}italic_U start_POSTSUBSCRIPT grad-input end_POSTSUBSCRIPT ( italic_x ) = ∥ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT caligraphic_L ( over^ start_ARG italic_y end_ARG , italic_y ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Ugrad-param(x)=θ(y^,y)2subscript𝑈grad-param𝑥subscriptnormsubscript𝜃^𝑦𝑦2U_{\text{grad-param}}(x)=\left\|\nabla_{\theta}\mathcal{L}(\hat{y},y)\right\|_% {2}italic_U start_POSTSUBSCRIPT grad-param end_POSTSUBSCRIPT ( italic_x ) = ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( over^ start_ARG italic_y end_ARG , italic_y ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where (|)\mathcal{L(\cdot|\cdot)}caligraphic_L ( ⋅ | ⋅ ) is the loss function between predicted output y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG and groundtruth y𝑦yitalic_y. And the gradient is calculated from two different aspects to understand the model’s sensitivity. This requires the true label of answer y𝑦yitalic_y, which is not suitable for open-ended questions or prompts. An alternative solution is to evaluate the inconsistency based on the list of responses Y={y1,y2,y3,yn}Ysubscript𝑦1subscript𝑦2subscript𝑦3subscript𝑦𝑛\textbf{Y}=\{y_{1},y_{2},y_{3},...y_{n}\}Y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, and their probabilities P. Such as the use of entropy for inconsistency implementation: U(x)=H(Y|x)=yp(y|x)log(p(y|x))𝑈𝑥𝐻conditionalY𝑥subscript𝑦𝑝conditionaly𝑥𝑙𝑜𝑔𝑝conditionaly𝑥U(x)=H(\textbf{Y}|x)=-\sum_{y}p(\textbf{y}|x)log(p(\textbf{y}|x))italic_U ( italic_x ) = italic_H ( Y | italic_x ) = - ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_p ( y | italic_x ) italic_l italic_o italic_g ( italic_p ( y | italic_x ) ) where the x𝑥xitalic_x is the input and Y is the sequence of generated tokens (as a response).  Kuhn et al. (2023); Sun et al. (2019); Abdar et al. (2021).

3.2 Black Box Evaluation

For black box evaluation, the evaluator only has the text-level responses Wang et al. (2023), this typically requires a more nuanced and deeper understanding of the model’s output stability and potential response structure from limited Q-A samples. Assume there are n response samples to the same question q. The common practice is to construct a matrix 𝒮𝒮\mathcal{S}caligraphic_S which encapsulates the similarity information among the responses:

𝒮=[1s12s13s1ns211s23s2ns31s321s3nsn1sn2sn31]𝒮matrix1subscript𝑠12subscript𝑠13subscript𝑠1𝑛subscript𝑠211subscript𝑠23subscript𝑠2𝑛subscript𝑠31subscript𝑠321subscript𝑠3𝑛subscript𝑠𝑛1subscript𝑠𝑛2subscript𝑠𝑛31\mathcal{S}=\begin{bmatrix}1&s_{12}&s_{13}&\cdots&s_{1n}\\ s_{21}&1&s_{23}&\cdots&s_{2n}\\ s_{31}&s_{32}&1&\cdots&s_{3n}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ s_{n1}&s_{n2}&s_{n3}&\cdots&1\end{bmatrix}caligraphic_S = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL italic_s start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_s start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_s start_POSTSUBSCRIPT 1 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL start_CELL italic_s start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_s start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT end_CELL start_CELL italic_s start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL start_CELL ⋯ end_CELL start_CELL italic_s start_POSTSUBSCRIPT 3 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_n 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_n 3 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] (1)

where each of the value at position sijsubscript𝑠𝑖𝑗s_{ij}italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, {i,j(1n)𝑖𝑗similar-to1𝑛i,j\in(1\sim n)italic_i , italic_j ∈ ( 1 ∼ italic_n )} is the calculated pariwise similarity score.

Given a list of responses R={r1,r2,,rn}𝑅subscript𝑟1subscript𝑟2subscript𝑟𝑛R=\{r_{1},r_{2},\ldots,r_{n}\}italic_R = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, the pairwise similarity score sijsubscript𝑠𝑖𝑗s_{ij}italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT between responses risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can be calculated using a general similarity function as: sij=sim(ri,rj)subscript𝑠𝑖𝑗simsubscript𝑟𝑖subscript𝑟𝑗s_{ij}=\text{sim}(r_{i},r_{j})italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = sim ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) the current explorations on matrix 𝒮𝒮\mathcal{S}caligraphic_S is mainly based on symmetric similarity property calculations such as Jaccard similarity or worldVector similarity, implying the sij=sjisubscript𝑠𝑖𝑗subscript𝑠𝑗𝑖s_{ij}=s_{ji}italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT in 𝒮𝒮\mathcal{S}caligraphic_S. Then this condition guarantees the use of Normalized Laplacian to understand the hidden structure in the responses space Lin et al. (2023):

L:=ID12WD12assign𝐿𝐼superscript𝐷12𝑊superscript𝐷12L:=I-D^{-\frac{1}{2}}WD^{-\frac{1}{2}}italic_L := italic_I - italic_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_W italic_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (2)

where the weighted adjacency matrix W𝑊Witalic_W is from the symmetric similarity matrix 𝒮𝒮\mathcal{S}caligraphic_S, and the degree matrix is:

Dri,rj={j[n]wi,j(ri=rj)0(rirj)subscript𝐷subscript𝑟𝑖subscript𝑟𝑗casessubscriptsuperscript𝑗delimited-[]𝑛subscript𝑤𝑖superscript𝑗subscript𝑟𝑖subscript𝑟𝑗0subscript𝑟𝑖subscript𝑟𝑗D_{r_{i},r_{j}}=\begin{cases}\sum_{j^{\prime}\in[n]}w_{i,j^{\prime}}&(r_{i}=r_% {j})\\ 0&(r_{i}\neq r_{j})\end{cases}italic_D start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_n ] end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW (3)

where the diagonal element Dri,rjsubscript𝐷subscript𝑟𝑖subscript𝑟𝑗D_{r_{i},r_{j}}italic_D start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT (ri=rjsubscript𝑟𝑖subscript𝑟𝑗r_{i}=r_{j}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) is the degree of the node risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is the sum of the weights (similarity si,jsubscript𝑠𝑖superscript𝑗s_{i,j^{\prime}}italic_s start_POSTSUBSCRIPT italic_i , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT) of all edges connected, jsuperscript𝑗j^{\prime}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT goes over the connected responses with size n𝑛nitalic_n.

Then one can leverage the constructed Symmetric Laplacian to find the eigenvalue to represent the connectivity of the graph, and use this as an indicator of uncertainty: UEigV=k=1nmax(0,1λk)subscript𝑈EigVsuperscriptsubscript𝑘1𝑛01subscript𝜆𝑘U_{\text{EigV}}=\sum_{k=1}^{n}\max(0,1-\lambda_{k})italic_U start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_max ( 0 , 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) where the λksubscript𝜆𝑘\lambda_{k}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the kthsubscript𝑘thk_{\text{th}}italic_k start_POSTSUBSCRIPT th end_POSTSUBSCRIPT eigenvalues of Laplacian L𝐿Litalic_L.

3.3 Discussion

Since the white box evaluation places a strict requirement on the original model, we analyze from the perspective of the black box evaluation in this paper.

First, the current black box evaluation make assumption that sij=sjisubscript𝑠𝑖𝑗subscript𝑠𝑗𝑖s_{ij}=s_{ji}italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT, however, in the actual knowledge representation logic, this neglected the directional information of two responses: If proposition A𝐴Aitalic_A entails proposition B𝐵Bitalic_B (denoted as ABproves𝐴𝐵A\vdash Bitalic_A ⊢ italic_B), it means that if A𝐴Aitalic_A is true, then B𝐵Bitalic_B must also be true. This is a one-way relationship. Importantly, this relationship is not necessarily symmetric; that is, ABproves𝐴𝐵A\vdash Bitalic_A ⊢ italic_B does not imply BAproves𝐵𝐴B\vdash Aitalic_B ⊢ italic_A. So the construction of a symmetric matrix broke this rule, which will lead to the loss of directional information from the original response set. In this paper, we propose to reconstruct the response relationship from a directional graph and provide a Random Walk Laplacian uncertainty evaluation method to better fit the asymmetric property of the constructed graph.

Second, the response set with long answers, containing more than one identical claim is easy to be miscalculated on the similarity from either semantics or knowledge claim aspect. E.g, to the question ‘How many students became heroes?’ the two answers from language model \mathcal{M}caligraphic_M as: A: ‘Andrew Willis, Chris Willis, Reece Galea’ and B: ‘Three students became heroes’. According to the context, the answer A is partially correct because it named the correct persons in the answer, however, the similarity between A and B is near 0 either calculated from entailment similarity or Jaccard, etc. This raises our proposal to provide claim-based augmentation before the uncertainty evaluation to recover the correct response intentions.

4 Uncertainty Evaluation within Directed Entailment Graph : D-UE

In this section, we will discuss how to formally model the logical direction information Kripke (1959); Dagan and Glickman (2004) in the responses with different entailment probabilities, and how the claims-based response augmentation helps with the potential semantic information mining. And then, we provide a way to integrate our method with plain semantic similarity matrix-derived uncertainty, which makes our method possible to layer on any of the existing methods that overlook the directional entailment information. The overall framework of D-UE compared to the traditional UQ based on the symmetric measure is shown in Fig. 2.

4.1 Directional Entailment Graph

In order to preserve the directional entailment information from a response set R𝑅Ritalic_R, we adopt the NLI (Natural Language Inference) model to provide pair-wise entailment measurement in the response set R𝑅Ritalic_R Williams et al. (2017); Bowman et al. (2015). Following the work Kuhn et al. (2023), the employed NLI model 222off-the-shelf DeBERTa-large model provides a three-element tuple by taking two text elements risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT:

[logitcont,logitneut,logitent]=NLI(ri,rj)subscriptlogit𝑐𝑜𝑛𝑡subscriptlogit𝑛𝑒𝑢𝑡subscriptlogit𝑒𝑛𝑡NLIsubscript𝑟𝑖subscript𝑟𝑗[\textit{logit}_{cont},\textit{logit}_{neut},\textit{logit}_{ent}]=% \overrightarrow{\textit{NLI}}(r_{i},r_{j})[ logit start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t end_POSTSUBSCRIPT , logit start_POSTSUBSCRIPT italic_n italic_e italic_u italic_t end_POSTSUBSCRIPT , logit start_POSTSUBSCRIPT italic_e italic_n italic_t end_POSTSUBSCRIPT ] = over→ start_ARG NLI end_ARG ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (4)

The output is processed by transforming into the probability through:

p=Softmax(logitcont,logitneut,logitent)pSoftmaxsubscriptlogit𝑐𝑜𝑛𝑡subscriptlogit𝑛𝑒𝑢𝑡subscriptlogit𝑒𝑛𝑡\textbf{p}=\textit{Softmax}(\textit{logit}_{cont},\textit{logit}_{neut},% \textit{logit}_{ent})p = Softmax ( logit start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t end_POSTSUBSCRIPT , logit start_POSTSUBSCRIPT italic_n italic_e italic_u italic_t end_POSTSUBSCRIPT , logit start_POSTSUBSCRIPT italic_e italic_n italic_t end_POSTSUBSCRIPT ) (5)

where pent(ri,rj)=p(rirj)=p3\overrightarrow{p_{ent}}(r_{i},r_{j})=p(r_{i}\vdash r_{j})=\textbf{p}_{3}over→ start_ARG italic_p start_POSTSUBSCRIPT italic_e italic_n italic_t end_POSTSUBSCRIPT end_ARG ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_p ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊢ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is the entailment probability of rirjprovessubscript𝑟𝑖subscript𝑟𝑗r_{i}\vdash r_{j}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊢ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. To here, an asymmetric entailment matrix 𝒮𝒮\mathcal{S}caligraphic_S is derived for constructing the directional graph 𝒢d=(V,E)subscript𝒢𝑑𝑉𝐸\mathcal{G}_{d}=(V,\overrightarrow{E})caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = ( italic_V , over→ start_ARG italic_E end_ARG ). The V=R𝑉𝑅V=Ritalic_V = italic_R is the set of responses with |V|=n𝑉𝑛|V|=n| italic_V | = italic_n and E𝐸\overrightarrow{E}over→ start_ARG italic_E end_ARG is the set of directed edges weighted primarily by the entailment probabilities.

E={(vi,vj)i,j weight:p(rirj)}\overrightarrow{E}=\{(v_{i},v_{j})\mid\forall i,j\text{ weight:}p(r_{i}\vdash r% _{j})\}over→ start_ARG italic_E end_ARG = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ ∀ italic_i , italic_j weight: italic_p ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊢ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } (6)

Thus, the adjacency matrix A𝐴Aitalic_A of the directed graph 𝒢dsubscript𝒢𝑑\mathcal{G}_{d}caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT can be defined as: Aij=p(rirj)A_{ij}=p(r_{i}\vdash r_{j})italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_p ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊢ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the weight of the directed edge from vertex visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to vertex vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (the vertex is indeed a corresponding response, so in a later section might use interchangeably), the Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT not necessarily equals to Ajisubscript𝐴𝑗𝑖A_{ji}italic_A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT unless the two meta responses are completely the same. The constructed 𝒢dsubscript𝒢𝑑\mathcal{G}_{d}caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT stands for directional semantic logic, different from the semantic similarity (sem) that may take the average of two entailment directions: Aij,sem=Aji,sem=p(rirj)+p(rjri)2A_{ij,\textit{sem}}=A_{ji,\textit{sem}}=\frac{p(r_{i}\vdash r_{j})+p(r_{j}% \vdash r_{i})}{2}italic_A start_POSTSUBSCRIPT italic_i italic_j , sem end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_j italic_i , sem end_POSTSUBSCRIPT = divide start_ARG italic_p ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊢ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_p ( italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⊢ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG, which lacks partial of information.

4.2 Enhance the Graph with Text Similarity

Based on the constructed directed entailment graph, it is feasible to incorporate the text similarity to enrich the information in the graph. We consider another matrix: the text similarity matrix 𝒯𝒯\mathcal{T}caligraphic_T, with identical size n×n𝑛𝑛n\times nitalic_n × italic_n as 𝒮𝒮\mathcal{S}caligraphic_S, we can enrich the edges-carried information between the nodes with jointly weighted values from both the entailment and text similarity matrix.

Let 𝒮=[sij]𝒮delimited-[]subscript𝑠𝑖𝑗\mathcal{S}=[s_{ij}]caligraphic_S = [ italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] and 𝒯=[tij]𝒯delimited-[]subscript𝑡𝑖𝑗\mathcal{T}=[t_{ij}]caligraphic_T = [ italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] represent the entailment and text similarity matrix, respectively. We define the weight of the edge from node i𝑖iitalic_i to node j𝑗jitalic_j in the graph 𝒢dsubscript𝒢𝑑\mathcal{G}_{d}caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as: wij=sij+tijsubscript𝑤𝑖𝑗subscript𝑠𝑖𝑗subscript𝑡𝑖𝑗w_{ij}=s_{ij}+t_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. The adjacency matrix Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of the graph G𝐺Gitalic_G is can be updated with weights wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

Please note that to achieve 𝒯𝒯\mathcal{T}caligraphic_T, the text similarity can be measured in multiple ways such as TF-IDF Aizawa (2003), Cosine Similarity, Word Embeddings, etc. Here in this paper, since we are measuring the responses given the same question, we provide an implementation with Jaccard Similarity from set operation:

J(ri,rj)=|rirj||rirj|𝐽subscript𝑟𝑖subscript𝑟𝑗subscript𝑟𝑖subscript𝑟𝑗subscript𝑟𝑖subscript𝑟𝑗J(r_{i},r_{j})=\frac{|r_{i}\cap r_{j}|}{|r_{i}\cup r_{j}|}italic_J ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG | italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG | italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG (7)

where the risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as response sentences, contain multiple phases and words serving as two sets.

4.3 Random Walk Laplacian

For a directed graph 𝒢dsubscript𝒢𝑑\mathcal{G}_{d}caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, the connectivity of nodes (responses) reflects the potential semantic clusters in the response set R𝑅Ritalic_R, we can analyze the graph characteristics by conducting a Laplacian process to derive the eigenvalue, which reflects the dispersion of the nodes, in the given scenario, it reveals the uncertainty of the black box model that generated the response set, given certain question.

However, the current 𝒢dsubscript𝒢𝑑\mathcal{G}_{d}caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is special for its asymmetric property, thus, the Normalized Laplacian or Symmetric Graph Laplacian, etc. are no longer suitable for the problem since they require the symmetric matrix. We innovatively propose to employ the Random Walk Laplacian which focuses on the out-degree of nodes to tackle this directional, and asymmetric issue. The out-degree matrix is calculated as: 𝐃out=diag(dout,1,dout,2,,dout,n)subscript𝐃outdiagsubscript𝑑out1subscript𝑑out2subscript𝑑out𝑛\mathbf{D}_{\text{out}}=\text{diag}(d_{\text{out},1},d_{\text{out},2},\ldots,d% _{\text{out},n})bold_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = diag ( italic_d start_POSTSUBSCRIPT out , 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT out , 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT out , italic_n end_POSTSUBSCRIPT ) , where dout,i=j=1naijsubscript𝑑out𝑖superscriptsubscript𝑗1𝑛subscript𝑎𝑖𝑗d_{\text{out},i}=\sum_{j=1}^{n}a_{ij}italic_d start_POSTSUBSCRIPT out , italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the out-degree of node risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and aijsubscript𝑎𝑖𝑗a_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is an instance of adjacency matrix A𝐴Aitalic_A carrying the weights from risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Then, the inverse of the out-degree matrix can be calculated by 𝐃out1=(𝐃out+ϵ𝐈)1superscriptsubscript𝐃out1superscriptsubscript𝐃outitalic-ϵ𝐈1\mathbf{D}_{\text{out}}^{-1}=(\mathbf{D}_{\text{out}}+\epsilon\mathbf{I})^{-1}bold_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = ( bold_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT + italic_ϵ bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , where 𝐈𝐈\mathbf{I}bold_I is the identity matrix and ϵitalic-ϵ\epsilonitalic_ϵ is a small positive constant to avoid division by zero. The random walk Laplacian matrix 𝐋rwsubscript𝐋rw\mathbf{L}_{\text{rw}}bold_L start_POSTSUBSCRIPT rw end_POSTSUBSCRIPT is then defined as:

𝐋rw=𝐈𝐃out1𝐀subscript𝐋rw𝐈superscriptsubscript𝐃out1𝐀\mathbf{L}_{\text{rw}}=\mathbf{I}-\mathbf{D}_{\text{out}}^{-1}\mathbf{A}bold_L start_POSTSUBSCRIPT rw end_POSTSUBSCRIPT = bold_I - bold_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A (8)

we compute the eigenvalues of the random walk Laplacian matrix 𝐋rwsubscript𝐋rw\mathbf{L}_{\text{rw}}bold_L start_POSTSUBSCRIPT rw end_POSTSUBSCRIPT and derive λksubscript𝜆𝑘\lambda_{k}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the eigenvalues of 𝐋rwsubscript𝐋rw\mathbf{L}_{\text{rw}}bold_L start_POSTSUBSCRIPT rw end_POSTSUBSCRIPT, where k=1,2,,n𝑘12𝑛k=1,2,\ldots,nitalic_k = 1 , 2 , … , italic_n. For details please refer to Appendix A. The uncertainty measure UEigVsubscript𝑈EigVU_{\text{EigV}}italic_U start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT is then computed by

𝐔EigVd=k=1nmax(0,1λk)superscriptsubscript𝐔EigV𝑑superscriptsubscript𝑘1𝑛01subscript𝜆𝑘\mathbf{U}_{\text{EigV}}^{d}=\sum_{k=1}^{n}\max(0,1-\lambda_{k})bold_U start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_max ( 0 , 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (9)

This measure captures the extent to which the eigenvalues λksubscript𝜆𝑘\lambda_{k}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT deviate from 1, providing a representation of the uncertainty in the language model’s responses, note that for each question q𝑞qitalic_q related response set Rqsubscript𝑅𝑞R_{q}italic_R start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, riRqsubscript𝑟𝑖subscript𝑅𝑞r_{i}\in R_{q}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, and |Rq|=nsubscript𝑅𝑞𝑛|R_{q}|=n| italic_R start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | = italic_n, our method derives one aggregated uncertainty value by Eq. 9.

It is important to perform the Random Walk Laplacian on the directed graph 𝒢dsubscript𝒢𝑑\mathcal{G}_{d}caligraphic_G start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT because in the directed graph, the probability of transition from node i𝑖iitalic_i to node j𝑗jitalic_j is defined by: Pij=AijkAiksubscript𝑃𝑖𝑗subscript𝐴𝑖𝑗subscript𝑘subscript𝐴𝑖𝑘P_{i\rightarrow j}=\frac{A_{i\rightarrow j}}{\sum_{k}A_{i\rightarrow k}}italic_P start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT = divide start_ARG italic_A start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i → italic_k end_POSTSUBSCRIPT end_ARG, where the Aijsubscript𝐴𝑖𝑗A_{i\rightarrow j}italic_A start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT is the weights in the adjacency matrix and k𝑘kitalic_k is the total amount of accessible nodes. If two responses exist entail(rirj)entail(rjri)entail(r_{i}\vdash r_{j})\neq entail(r_{j}\vdash r_{i})italic_e italic_n italic_t italic_a italic_i italic_l ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊢ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≠ italic_e italic_n italic_t italic_a italic_i italic_l ( italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⊢ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), then we have different transition probability, making a difference in profiling the response set characters:

AijAjiPijPjisubscript𝐴𝑖𝑗subscript𝐴𝑗𝑖subscript𝑃𝑖𝑗subscript𝑃𝑗𝑖A_{i\to j}\neq A_{j\to i}\implies P_{i\to j}\neq P_{j\to i}italic_A start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT ≠ italic_A start_POSTSUBSCRIPT italic_j → italic_i end_POSTSUBSCRIPT ⟹ italic_P start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT ≠ italic_P start_POSTSUBSCRIPT italic_j → italic_i end_POSTSUBSCRIPT (10)

This designed structure captures the non-symmetry information in the entailment probability from different directions of two responses.

4.4 Integrate Directional Entailment Uncertainty with Semantics Uncertainty

The uncertainty derived in Eq. 9 represents uncertainty from directional entailment probability and text in-consistency as introduced in Section 4.2. And since there exist multiple solutions for semantic uncertainty measurement Lin et al. (2023); Kuhn et al. (2023), we propose a way to seamlessly integrate 𝐔EigVdsuperscriptsubscript𝐔EigV𝑑\mathbf{U}_{\text{EigV}}^{d}bold_U start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT from D-UE with other semantics uncertainty 𝐔ssuperscript𝐔𝑠\mathbf{U}^{s}bold_U start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, thus have a multi-angle evaluation on limited response sets.

One simplest way is to directly aggregate the two resources of 𝐔EigVd,isuperscriptsubscript𝐔EigV𝑑𝑖\mathbf{U}_{\text{EigV}}^{d,i}bold_U start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d , italic_i end_POSTSUPERSCRIPT and 𝐔s,isuperscript𝐔𝑠𝑖\mathbf{U}^{s,i}bold_U start_POSTSUPERSCRIPT italic_s , italic_i end_POSTSUPERSCRIPT on the same response set Rqisuperscriptsubscript𝑅𝑞𝑖R_{q}^{i}italic_R start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, however, when there are multiple response sets Rqisuperscriptsubscript𝑅𝑞𝑖R_{q}^{i}\in\mathcal{R}italic_R start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ caligraphic_R, i={1,2,h}𝑖12i=\{1,2,...h\}italic_i = { 1 , 2 , … italic_h }, direct aggregation can not guarantee the order change is caused by the uncertainties contribution: because the different uncertainty measure from 𝐔EigVd,isuperscriptsubscript𝐔EigV𝑑𝑖\mathbf{U}_{\text{EigV}}^{d,i}bold_U start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d , italic_i end_POSTSUPERSCRIPT and 𝐔ssuperscript𝐔𝑠\mathbf{U}^{s}bold_U start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT result in different scales (some are in [0, 1] and some are not bounded), the order changes after 𝐔EigVd,i+𝐔ssuperscriptsubscript𝐔EigV𝑑𝑖superscript𝐔𝑠\mathbf{U}_{\text{EigV}}^{d,i}+\mathbf{U}^{s}bold_U start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d , italic_i end_POSTSUPERSCRIPT + bold_U start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is probably caused by the absolute value range difference. Thus instead of working on Rqisuperscriptsubscript𝑅𝑞𝑖R_{q}^{i}italic_R start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, we focus on the whole response space \mathcal{R}caligraphic_R that contains multiple questions’ response sets, yielding 𝒰EigVdsuperscriptsubscript𝒰EigV𝑑\mathbf{\mathcal{U}}_{\text{EigV}}^{d}caligraphic_U start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝒰ssuperscript𝒰𝑠\mathbf{\mathcal{U}}^{s}caligraphic_U start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, |𝒰EigVd|=|𝒰s|=hsuperscriptsubscript𝒰EigV𝑑superscript𝒰𝑠|\mathbf{\mathcal{U}}_{\text{EigV}}^{d}|=|\mathbf{\mathcal{U}}^{s}|=h| caligraphic_U start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | = | caligraphic_U start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT | = italic_h meaning there contains hhitalic_h uncertainties for hhitalic_h question-related response sets. And we perform the normalization on the distribution of two aspects of measurements by Z𝑍Zitalic_Z-score: Normalized(X)=XμXσXNormalized𝑋𝑋subscript𝜇𝑋subscript𝜎𝑋\text{Normalized}(X)=\frac{X-\mu_{X}}{\sigma_{X}}Normalized ( italic_X ) = divide start_ARG italic_X - italic_μ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG, replace the X with 𝒰EigVdsuperscriptsubscript𝒰EigV𝑑\mathbf{\mathcal{U}}_{\text{EigV}}^{d}caligraphic_U start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝒰ssuperscript𝒰𝑠\mathbf{\mathcal{U}}^{s}caligraphic_U start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and we get: 𝒰^EigVdsuperscriptsubscript^𝒰EigV𝑑\mathbf{\mathcal{\hat{U}}}_{\text{EigV}}^{d}over^ start_ARG caligraphic_U end_ARG start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝒰^ssuperscript^𝒰𝑠\mathbf{\mathcal{\hat{U}}}^{s}over^ start_ARG caligraphic_U end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. Then we can derive the 𝒰^=𝒰^EigVd/2+𝒰^s/2^𝒰superscriptsubscript^𝒰EigV𝑑2superscript^𝒰𝑠2\mathbf{\mathcal{\hat{U}}}=\mathbf{\mathcal{\hat{U}}}_{\text{EigV}}^{d}/2+% \mathbf{\mathcal{\hat{U}}}^{s}/2over^ start_ARG caligraphic_U end_ARG = over^ start_ARG caligraphic_U end_ARG start_POSTSUBSCRIPT EigV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT / 2 + over^ start_ARG caligraphic_U end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT / 2, which contains both semantics and directional uncertainty, and the order change in 𝒰^^𝒰\mathbf{\mathcal{\hat{U}}}over^ start_ARG caligraphic_U end_ARG is contributed by the semantics uncertainty from 𝒰ssuperscript𝒰𝑠\mathbf{\mathcal{U}}^{s}caligraphic_U start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT.

5 Claim Based Response Augmentation

Sometimes, the raw responses from the language model \mathcal{M}caligraphic_M can not fully reveal its awareness of a problem due to the multiple claim points but short descriptions. In the example responses at Section 3.3, ‘three students became heros’ and ‘Andrew Willis, Chris Willis, Reece Gelea’ are a pair of responses that share the same potential meaning ‘Andrew Willis, Chris Willis, and Reece Gelea are three students who became heros’. The direct use of raw responses like these impairs (\downarrow) the True Positive rate and increases (\uparrow) the False Negative rate, leading to a biased evaluation.

In this section, inspired by Choi and Ferrara (2024), we propose to augment raw responses on the claims level, trying to identify the potential correct claims hidden in incomplete or vague responses. It is worth noting that, here we do not conduct fact-checking, instead, we rely on the context information to provide claim augmentation, so our task is easier and more feasible to be accomplished by other pre-trained LLMs. Specifically, the task can be formalized as:

Given a question qqqitalic_q and a response set R={r1,r2,,rn}Rsubscriptr1subscriptr2subscriptrnR=\{r_{1},r_{2},\ldots,r_{n}\}italic_R = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, for each of the riRsubscriptriRr_{i}\in Ritalic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R that contains k claims cksubscriptckc_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, augment each of ckckaugsubscriptcksuperscriptsubscriptckaugc_{k}\rightarrow c_{k}^{aug}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT and derive the riaugsuperscriptsubscriptriaugr_{i}^{aug}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT with a more explicit and comprehensive description.

To realize it, the key is to first identify the claim atoms in a response risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, this step can be achieved with the basic understanding ability of context, we verified that Llama-3 is adequate for this task and used it in the claim extraction. Then to extend extracted claims by recalling the questions, this helps to align the claim descriptions with questions. And at last, combine the augmented claims into a more comprehensive answer riaugsuperscriptsubscript𝑟𝑖𝑎𝑢𝑔r_{i}^{aug}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT as following:

riaug=Augmentor(c1,c2,,ck)ckrisuperscriptsubscript𝑟𝑖𝑎𝑢𝑔Augmentorsubscriptsubscript𝑐1subscript𝑐2subscript𝑐𝑘subscript𝑐𝑘subscript𝑟𝑖r_{i}^{aug}=\textit{Augmentor}(c_{1},c_{2},...,c_{k})_{c_{k}\leftarrow r_{i}}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT = Augmentor ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (11)

where the \leftarrow here is interpreted as claim cksubscript𝑐𝑘c_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT originates from risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The response level Augmentor conducts two steps: First, it extends the claims with an Extender that takes into the current claim cksubscript𝑐𝑘c_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and question q𝑞qitalic_q as following:

ckaug=Extender(ck,q)superscriptsubscript𝑐𝑘𝑎𝑢𝑔Extendersubscript𝑐𝑘𝑞c_{k}^{aug}=\textit{Extender}(c_{k},q)italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT = Extender ( italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_q ) (12)

the task at Eq. 12 is simple because it is generating the sequence based on existing input content, so it can be fulfilled by other general language models such as Llama-3 333https://github.com/meta-llama/llama3 (used in this paper) with necessary prompt. Second, it contacts (direct-sum\oplus) all of the augmented claims to form the riaugsuperscriptsubscript𝑟𝑖𝑎𝑢𝑔r_{i}^{aug}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT:

riaug={c1aug,c2aug,,ckaug}superscriptsubscript𝑟𝑖𝑎𝑢𝑔direct-sumsuperscriptsubscript𝑐1𝑎𝑢𝑔superscriptsubscript𝑐2𝑎𝑢𝑔superscriptsubscript𝑐𝑘𝑎𝑢𝑔r_{i}^{aug}=\oplus\{c_{1}^{aug},c_{2}^{aug},...,c_{k}^{aug}\}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT = ⊕ { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT } (13)

The eventual evaluation set Raugsuperscript𝑅𝑎𝑢𝑔R^{aug}italic_R start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT can be achieved by traversing all of the riRsubscript𝑟𝑖𝑅r_{i}\in Ritalic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R.

It is worth noting that the original low-quality response r×superscript𝑟r^{\times}italic_r start_POSTSUPERSCRIPT × end_POSTSUPERSCRIPT should be kept unchanged from the augmentation process to preserve the original error generated by \mathcal{M}caligraphic_M as a part of evaluation evidence, we collect these responses by regular expression as implemented in the code 444code will be released after publication and is available under request for now..

In this paper, the augmentation is conducted following the above procedure, and the D-UE takes the Raugsuperscript𝑅𝑎𝑢𝑔R^{aug}italic_R start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT as the eventual input for uncertainty evaluation.

6 Experiment

In this section, we design experiment to empirically demonstrate the effectiveness of our proposal in uncertainty evaluation for LLMs. Please note that if without extra declarition, D-UE mean the directional entailment uncertainty on augmented response sets. We intend to investigate the following research questions:

RQ1: Can D-UE improve the uncertainty evaluation layered on existing methods that has no consideration of directional entailment logic?

RQ2: Is claim level augmentation hel** a more robust evaluation?

RQ3: How does each module of our proposed method contribute to the final uncertainty quantification? An ablation study for D-UE.

Refer to caption
Figure 3: The comparison between D-UE and baseline method on AUARC, we conducted D-UE that aggregated the directional entailment uncertainty with each of the semantic measures, the evaluation improves on Coqa dataset.

6.1 Experiment Setups

In this paper, we explored Llama3-8b for simple tasks such as claims extraction introduced in Section 5. and question-based atom-claim augmentation to complete Eq. 12. Each experiment using NLI model uses a calibrated temperature as 3. All of the experiment is supported by Ubuntu on 13th Gen Intel(R) Core(TM) i9-13900KF, with NVIDIA GeForce RTX 4090.

Datasets

We adopt the two classic (QA) datasets Coqa Reddy et al. (2019) (7,983 questions), TriviaQA Joshi et al. (2017) (9,960 questions), and another especially long question answer dataset NLQuAD Soleimani et al. (2021) (3,024 questions), which is more challenging and include more claims in one response.

Evaluation Metrics and Process

As discussed in the paper Lin et al. (2023), there exists the limitation of commonly adopted AUROC that it is very sensitive to imbalanced scenarios (likely to provide over-optimistic evaluation). Area Under Accuracy Rejection Curve (AUARC), is an alternative metric that can better reflect the evaluation performance, the calculation is shown in Appendix B, we use these two as a complementary evaluation indicator.

Measure Details
ULexiSimsubscript𝑈LexiSimU_{\textit{LexiSim}}italic_U start_POSTSUBSCRIPT LexiSim end_POSTSUBSCRIPT Lexical similarity which measures the average rougeL.
UNumSetsubscript𝑈NumSetU_{\textit{NumSet}}italic_U start_POSTSUBSCRIPT NumSet end_POSTSUBSCRIPT Multiplicity of the zero eigenvalue coincides with semantic sets.
USEsubscript𝑈SEU_{\textit{SE}}italic_U start_POSTSUBSCRIPT SE end_POSTSUBSCRIPT Semantic entropy by the entropy over semantic sets.
UEigv(Dis)subscript𝑈Eigv𝐷𝑖𝑠U_{\textit{Eigv}}(Dis)italic_U start_POSTSUBSCRIPT Eigv end_POSTSUBSCRIPT ( italic_D italic_i italic_s ) Spectral eigenvalue on the disagreement.
UEcc(Dis)subscript𝑈Ecc𝐷𝑖𝑠U_{\textit{Ecc}}(Dis)italic_U start_POSTSUBSCRIPT Ecc end_POSTSUBSCRIPT ( italic_D italic_i italic_s ) Average distance from center in responses’ disagreement.
UDegree(Dis)subscript𝑈Degree𝐷𝑖𝑠U_{\textit{Degree}}(Dis)italic_U start_POSTSUBSCRIPT Degree end_POSTSUBSCRIPT ( italic_D italic_i italic_s ) Degree of disagreement Matrix.
UEigv(Agre)subscript𝑈Eigv𝐴𝑔𝑟𝑒U_{\textit{Eigv}}(Agre)italic_U start_POSTSUBSCRIPT Eigv end_POSTSUBSCRIPT ( italic_A italic_g italic_r italic_e ) Spectral eigenvalue on the agreement.
UEcc(Agre)subscript𝑈Ecc𝐴𝑔𝑟𝑒U_{\textit{Ecc}}(Agre)italic_U start_POSTSUBSCRIPT Ecc end_POSTSUBSCRIPT ( italic_A italic_g italic_r italic_e ) Average distance from center in responses’ agreement.
UDegree(Agre)subscript𝑈Degree𝐴𝑔𝑟𝑒U_{\textit{Degree}}(Agre)italic_U start_POSTSUBSCRIPT Degree end_POSTSUBSCRIPT ( italic_A italic_g italic_r italic_e ) Degree Matrix of agreement Matrix.
UEigv(Jacc)subscript𝑈Eigv𝐽𝑎𝑐𝑐U_{\textit{Eigv}}(Jacc)italic_U start_POSTSUBSCRIPT Eigv end_POSTSUBSCRIPT ( italic_J italic_a italic_c italic_c ) Spectral eigenvalue on the Jaccard similarity.
UEcc(Jacc)subscript𝑈Ecc𝐽𝑎𝑐𝑐U_{\textit{Ecc}}(Jacc)italic_U start_POSTSUBSCRIPT Ecc end_POSTSUBSCRIPT ( italic_J italic_a italic_c italic_c ) Average distance from center in responses’ Jaccard measure.
UDegree(Jacc)subscript𝑈Degree𝐽𝑎𝑐𝑐U_{\textit{Degree}}(Jacc)italic_U start_POSTSUBSCRIPT Degree end_POSTSUBSCRIPT ( italic_J italic_a italic_c italic_c ) Degree Matrix of Jaccard similarity.
Table 1: The baseline methods and explanations

In order to evaluate the performance of an ‘evaluator’, we first need to know the correctness of responses to questions, then evaluate how well the evaluator’s output uncertainty reflects the correctness situation (say, given a question, the more uncertainty model is, the more likely it make mistakes and achieve low accuracy). In this paper, we adopt the GPT3.5-turbo to produce the correctness score from 0 to 1, for details, please refer to Appenix C.

Baseline methods

We compare 12 baseline methods including: ULexiSimsubscript𝑈LexiSimU_{\textit{LexiSim}}italic_U start_POSTSUBSCRIPT LexiSim end_POSTSUBSCRIPT, UNumSetsubscript𝑈NumSetU_{\textit{NumSet}}italic_U start_POSTSUBSCRIPT NumSet end_POSTSUBSCRIPT, USEsubscript𝑈SEU_{\textit{SE}}italic_U start_POSTSUBSCRIPT SE end_POSTSUBSCRIPT Kuhn et al. (2023), and similar method Eingvalue-based, eccentricity-based and degree-based method over three characteristics: disagreement, agreement and Jaccard Lin et al. (2023) in their similarity matrix construction. A detailed explanation is included in the following Table 1.

6.2 Experiment Result and Analysis

Refer to caption
Figure 4: The comparison between D-UE and baseline method. The figure shows the evaluation from the metric of AUROC, we conducted D-UE and aggregated the directional entailment uncertainty with each of the semantic measures, and the evaluation consistently improves on Coqa dataset.

In this section, we will discuss each of the research questions and the analysis of the proposed methods’ performance.

RQ1: We have constructed experiments on three datasets across 12 baseline methods, and verify that the implementation with D-UE + semantics uncertainty performs consistently better than most of the baseline methods. As shown in Fig. 4, each bar represents the area below the AUROC curve, for each of the x𝑥xitalic_x labels, e.g, NumSet, the blue color is the baseline method’s performance and pink color shows the directional entailment enhanced performance on specific baseline semantic uncertainty evaluation.

The performance evaluated by AUARC is shown side by side in Fig. 3. The left one shows the baseline methods’ performance. On the right side, our method D-UE improves all of the methods’ performance, which means that the directional logic information is neglected from previous methods and can be further mined by D-UE. Some methods such as numset, lexical_sim and semanticEntropy, could be improved a lot because they solely consider semantics similarity during the computation. On the other hand, methods like eigv(Agre) and degree-based perform smaller improvements because they also rely on the graph structure to detect the connectivity, which might consider the degree in the UQ process. But the major difference is that D-UE formally defined a directed graph and conducted Random Walk Laplacian with seasonable theory approval that relaxes the symmetric requirement, and thus could be used with more flexibility. We also provide another set of experiments conducted on GPT3.5’s responses on Coqa dataset, as shown in Table 2, our method outperforms all of the baseline methods.

Coqa (GPT3.5)
AUARC AUROC
Baselines Previous Ours Previous Ours
NumSet 0.4250 0.5605 0.5095 0.6660
LexiSim 0.5001 0.5467 0.6042 0.6471
Eigv(Dis) 0.5271 0.5574 0.6652 0.6733
Ecc(Dis) 0.4837 0.5603 0.5736 0.6675
Degree(Dis) 0.5320 0.5579 0.6654 0.6736
Eigv(Agre) 0.5355 0.5626 0.6769 0.6807
Ecc(Agre) 0.5295 0.5620 0.6669 0.6766
Degree(Agre) 0.5367 0.5615 0.6764 0.6800
Eigv(Jacc) 0.5179 0.5579 0.6463 0.6692
Ecc(Jacc) 0.5173 0.5550 0.6544 0.6694
Degree(Jacc) 0.5252 0.5560 0.6535 0.6693
Table 2: Performance for Coqa under GPT3.5

RQ2: In order to understand how claim level augmentation helps with a better understanding of the potential relationships between responses, we conducted a case study on an example response set, which is generated by llama2-13b.

[Uncaptioned image]

As shown in this question-answer set example, which is an answer set taken by the responses made by Llama2-13b in Coqa dataset, we found that, due to the quality and stability of language models, they might generate abbreviated or vague responses such as 3.‘These three’ and 5.‘Three high’, even though these responses cover the key idea of the true answer, but due to the incompleteness of the sentence, this brings challenges to the uncertainty evaluation, especially for black-box evaluation that can only build upon the consistency among the responses. It is hard to identify the relationship between a sentence if the claim/meaning is not stated thoroughly. The blue color fonts show the augmented results based on our designed Extender (E.q. 12), the detailed prompt will be present in the Appendix. For sentence 3. and 5., these claims are completed with the intention of the questions and are easier to reflect the consistency from the meaning. As shown in Fig. 5, which are the heatmaps showing the probability of entailment with direction P(XY)P(X\vdash Y)italic_P ( italic_X ⊢ italic_Y ). On the upper part of the left side is the PRraw(XY)P_{R^{raw}}(X\vdash Y)italic_P start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_r italic_a italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X ⊢ italic_Y ) from the original response set Rrawsuperscript𝑅𝑟𝑎𝑤R^{raw}italic_R start_POSTSUPERSCRIPT italic_r italic_a italic_w end_POSTSUPERSCRIPT and the lower part is after the augmentation PRaug(XY)P_{R^{aug}}(X\vdash Y)italic_P start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X ⊢ italic_Y ).

Refer to caption
Figure 5: The entailment probability map

From the comparison of the original and augmented probability graph, we can observe that the probability for those in-completed sentences to entail other responses is very low, even though under the case that these sentences include the correct answer, but after the effective augmentation, we could find that the probability increases (e.g., 3. and 5.), indicating the potential relationship is discovered. The right side is the residual map which is calculated by subtraction from (with) augmentation to (without) augmentation, the red means finding the stronger entailment relationship after augmentation, and the blue means mitigating the original entailment probability.

RQ3: We conducted ablation experiments to understand the contribution of the directional entailment uncertainty measure, the claim-based augmentation. Due to the limited page, three representative baselines on two metrics are shown in Fig. 6. We could observe from the baseline of NumSet that, the basic NumSet uncertainty measurement is sensitive to the augmentation by showing improvement in the augmented version (purple bar) over the basic version (blue one), both on AUROC and AUARC. But compared to the improvement brought by claim augmentation, the ‘Ours’ method(D-UE) makes a larger contribution to the general evaluation performance. This attribute to the advantage of directional entailment logic is mined with Random Walk Laplacian on the directional graph. We leave further exploration for the future on how to better combine the semantics uncertainty and directional entailment-based uncertainty and we believe there is still a potential space of improvement based on the current proposed direction.

Refer to caption
Figure 6: The comparison between baseline semantic uncertainty, D-UE +semantics (no-augmentation) and full D-UE +semantics

7 Conclusion

In this paper, we discovered the two challenges of existing uncertainty quantification methods for LLMs: the omission of directional logic in semantic meanings and the low-quality / vague response sets that bring difficulty in uncovering the actual correct answers. We proposed two solutions to tackle the above challenges: A. we formally define a directional entailment graph encapsulating the direction logic and enhance it with text similarity, then innovatively propose to conduct Random Walk Laplacian to find the eigenvalue indicating the uncertainty in response graph structure. B. we propose a claim-based augmentation method that helps with understanding the ‘real’ faithfulness of a model’s responses. These two methods improved the current existing UQ methods and provided a better insight into how trustworthy a model is. We hope the exploration of this work could raise other researchers’ interest from another aspect of understanding the uncertainty in Large Language Models and comprehending the NLG trustworthiness.

8 Limiatations

Even though this work innovatively proposes a directed graph and an augmentation method for the LLM’s uncertainty quantification, the authors believe it is still important to explore more on how to combine the semantics and directional logic uncertainty in a theoretically orthogonal way. This work was only able to testify to the evaluation tasks on LLama2-13b and ChatGPT 3.5, with the fast-growing speed of the Large Language Model family, more models would be feasible to test and understand their responses uncertainty to specific questions.

References

  • Abdar et al. (2021) Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U Rajendra Acharya, et al. 2021. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion, 76:243–297.
  • Aizawa (2003) Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1):45–65.
  • Balloccu et al. (2024) Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek. 2024. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. arXiv preprint arXiv:2402.03927.
  • Biever (2023) Celeste Biever. 2023. Chatgpt broke the turing test-the race is on for new ways to assess ai. Nature, 619(7971):686–689.
  • Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
  • Chang et al. (2024) Yupeng Chang, Xu Wang, **dong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45.
  • Chen and Mueller (2023) Jiuhai Chen and Jonas Mueller. 2023. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness.
  • Choi and Ferrara (2024) Eun Cheol Choi and Emilio Ferrara. 2024. Fact-gpt: Fact-checking augmentation via claim matching with llms. In Companion Proceedings of the ACM on Web Conference 2024, pages 883–886.
  • Da et al. (2024a) Longchao Da, Minquan Gao, Hao Mei, and Hua Wei. 2024a. Prompt to transfer: Sim-to-real transfer for traffic signal control with prompt learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 82–90.
  • Da et al. (2024b) Longchao Da, Kuanru Liou, Tie** Chen, Xuesong Zhou, Xiangyong Luo, Yezhou Yang, and Hua Wei. 2024b. Open-ti: Open traffic intelligence with augmented language model. International Journal of Machine Learning and Cybernetics, pages 1–26.
  • Dagan and Glickman (2004) Ido Dagan and Oren Glickman. 2004. Probabilistic textual entailment: Generic applied modeling of language variability. Learning Methods for Text Understanding and Mining, 2004(26-29):2–5.
  • Huang et al. (2024) Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie **, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu Zhao, et al. 2024. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review, 57(7):175.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. Preprint, arXiv:1705.03551.
  • Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  • Kambhampati et al. (2024) Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. 2024. Llms can’t plan, but can help planning in llm-modulo frameworks. arXiv preprint arXiv:2402.01817.
  • Kripke (1959) Saul A Kripke. 1959. Distinguished constituents, semantical analysis of modal logic, and the problem of entailment. The Journal of Symbolic Logic, 24(4):312–326.
  • Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664.
  • Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334.
  • Lin et al. (2023) Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
  • Liu et al. (2023) Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. 2023. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
  • Mielke et al. (2020) Sabrina J Mielke, Arthur Szlam, Y-Lan Boureau, and Emily Dinan. 2020. Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness. arXiv preprint arXiv:2012.14983, 11.
  • Reddy et al. (2019) Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
  • Soleimani et al. (2021) Amir Soleimani, Christof Monz, and Marcel Worring. 2021. Nlquad: A non-factoid long question answering data set. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1245–1255.
  • Sun et al. (2024) Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. 2024. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561.
  • Sun et al. (2019) Lin Sun, Xiaoyu Zhang, Yuhua Qian, Jiucheng Xu, and Shiguang Zhang. 2019. Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification. Information Sciences, 502:18–41.
  • Tian et al. (2023) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975.
  • Valmeekam et al. (2022) Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2022. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498.
  • Wang et al. (2024a) Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, **gsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024a. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345.
  • Wang et al. (2024b) Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Furong Huang, et al. 2024b. Enhancing visual-language modality alignment in large vision language models via self-improvement. arXiv preprint arXiv:2405.15973.
  • Wang et al. (2023) Yubo Wang, Xueguang Ma, and Wenhu Chen. 2023. Augmenting black-box llms with medical textbooks for clinical question answering. arXiv preprint arXiv:2309.02233.
  • Williams et al. (2017) Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
  • Yang et al. (2023) Rui Yang, Ting Fang Tan, Wei Lu, Arun James Thirunavukarasu, Daniel Shu Wei Ting, and Nan Liu. 2023. Large language models in health care: Development, applications, and challenges. Health Care Science, 2(4):255–263.

Appendix A Solve the Eigenvalue for Random Walk laplacian

From the definition that

Lrw=IDout1Asubscript𝐿rw𝐼superscriptsubscript𝐷out1𝐴L_{\text{rw}}=I-D_{\text{out}}^{-1}Aitalic_L start_POSTSUBSCRIPT rw end_POSTSUBSCRIPT = italic_I - italic_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A

To find the eigenvalues λ𝜆\lambdaitalic_λ and eigenvectors 𝐯𝐯\mathbf{v}bold_v of Lrwsubscript𝐿rwL_{\text{rw}}italic_L start_POSTSUBSCRIPT rw end_POSTSUBSCRIPT is to solve:

Lrw𝐯subscript𝐿rw𝐯\displaystyle L_{\text{rw}}\mathbf{v}italic_L start_POSTSUBSCRIPT rw end_POSTSUBSCRIPT bold_v =λ𝐯absent𝜆𝐯\displaystyle=\lambda\mathbf{v}= italic_λ bold_v
(IDout1A)𝐯𝐼superscriptsubscript𝐷out1𝐴𝐯\displaystyle(I-D_{\text{out}}^{-1}A)\mathbf{v}( italic_I - italic_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A ) bold_v =λ𝐯absent𝜆𝐯\displaystyle=\lambda\mathbf{v}= italic_λ bold_v
𝐯Dout1A𝐯𝐯superscriptsubscript𝐷out1𝐴𝐯\displaystyle\mathbf{v}-D_{\text{out}}^{-1}A\mathbf{v}bold_v - italic_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A bold_v =λ𝐯absent𝜆𝐯\displaystyle=\lambda\mathbf{v}= italic_λ bold_v
𝐯𝐯\displaystyle\mathbf{v}bold_v =λ𝐯+Dout1A𝐯absent𝜆𝐯superscriptsubscript𝐷out1𝐴𝐯\displaystyle=\lambda\mathbf{v}+D_{\text{out}}^{-1}A\mathbf{v}= italic_λ bold_v + italic_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A bold_v
(Iλ)𝐯𝐼𝜆𝐯\displaystyle(I-\lambda)\mathbf{v}( italic_I - italic_λ ) bold_v =Dout1A𝐯absentsuperscriptsubscript𝐷out1𝐴𝐯\displaystyle=D_{\text{out}}^{-1}A\mathbf{v}= italic_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A bold_v
Dout(Iλ)𝐯subscript𝐷out𝐼𝜆𝐯\displaystyle D_{\text{out}}(I-\lambda)\mathbf{v}italic_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ( italic_I - italic_λ ) bold_v =A𝐯absent𝐴𝐯\displaystyle=A\mathbf{v}= italic_A bold_v
(DoutλDout)𝐯subscript𝐷out𝜆subscript𝐷out𝐯\displaystyle(D_{\text{out}}-\lambda D_{\text{out}})\mathbf{v}( italic_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT - italic_λ italic_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ) bold_v =A𝐯absent𝐴𝐯\displaystyle=A\mathbf{v}= italic_A bold_v

given this, we can transform the eigenvalue problem of Lrwsubscript𝐿rwL_{\text{rw}}italic_L start_POSTSUBSCRIPT rw end_POSTSUBSCRIPT into a form involving the matrix A𝐴Aitalic_A and Doutsubscript𝐷outD_{\text{out}}italic_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT. This step simplifies the problem into the standard eigenvalue problem, specifically:

det(LrwλI)=0subscript𝐿rw𝜆𝐼0\det(L_{\text{rw}}-\lambda I)=0roman_det ( italic_L start_POSTSUBSCRIPT rw end_POSTSUBSCRIPT - italic_λ italic_I ) = 0

By applying the definition of Lrwsubscript𝐿rwL_{\text{rw}}italic_L start_POSTSUBSCRIPT rw end_POSTSUBSCRIPT:

det(IDout1AλI)=0𝐼superscriptsubscript𝐷out1𝐴𝜆𝐼0\det(I-D_{\text{out}}^{-1}A-\lambda I)=0roman_det ( italic_I - italic_D start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A - italic_λ italic_I ) = 0

Then the eigenvalues λ𝜆\lambdaitalic_λ can be solved by solving for the roots of this equation.

Appendix B The Evaluation Metric

The AUROC is calculated by plotting the accuracy of accepted predictions against the rejection rate, and then computing the area under this curve.

Given that sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the score of the i𝑖iitalic_i-th prediction. aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the accuracy of the i𝑖iitalic_i-th prediction (1 if correct, 0 if incorrect). n𝑛nitalic_n as the total number of predictions. We first sort the scores and corresponding accuracies: {(si,ai)}i=1n{(s(i),a(i))}i=1nsuperscriptsubscriptsubscript𝑠𝑖subscript𝑎𝑖𝑖1𝑛superscriptsubscriptsubscript𝑠𝑖subscript𝑎𝑖𝑖1𝑛\{(s_{i},a_{i})\}_{i=1}^{n}\rightarrow\{(s_{(i)},a_{(i)})\}_{i=1}^{n}{ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → { ( italic_s start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

For each i𝑖iitalic_i from 0 to n𝑛nitalic_n: the rejection rate Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is: Ri=i+1nsubscript𝑅𝑖𝑖1𝑛R_{i}=\frac{i+1}{n}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_i + 1 end_ARG start_ARG italic_n end_ARG and the accuracy of the accepted predictions Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is: Ai=j=i+1n1(a(j)α)n(i+1)subscript𝐴𝑖superscriptsubscript𝑗𝑖1𝑛1subscript𝑎𝑗𝛼𝑛𝑖1A_{i}=\frac{\sum_{j=i+1}^{n}\mathbb{\text{1}}(a_{(j)}\geq\alpha)}{n-(i+1)}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT 1 ( italic_a start_POSTSUBSCRIPT ( italic_j ) end_POSTSUBSCRIPT ≥ italic_α ) end_ARG start_ARG italic_n - ( italic_i + 1 ) end_ARG where 1()1\mathbb{\text{1}}(\cdot)1 ( ⋅ ) is the indicator function, and α𝛼\alphaitalic_α is the threshold. The area under the curve (AUARC) is calculated by the trapezoidal rule:

AUARC =01A(R)𝑑Rabsentsuperscriptsubscript01𝐴𝑅differential-d𝑅\displaystyle=\int_{0}^{1}A(R)\,dR= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_A ( italic_R ) italic_d italic_R
i=0n1Ai+Ai+12(Ri+1Ri)absentsuperscriptsubscript𝑖0𝑛1subscript𝐴𝑖subscript𝐴𝑖12subscript𝑅𝑖1subscript𝑅𝑖\displaystyle\approx\sum_{i=0}^{n-1}\frac{A_{i}+A_{i+1}}{2}(R_{i+1}-R_{i})≈ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT divide start_ARG italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_A start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ( italic_R start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the accuracy at the i𝑖iitalic_i-th step and Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the rejection rate at the i𝑖iitalic_i-th step.

Appendix C Evaluation and Groundtruth Correctness

Following Lin et al. (2023), in our evaluation, the responses with a score >>> 0.7 are taken as correct answers, and the human verification is applied to the correctness of the auto-generated judgment by GPT3.5-turbo and the accuracy is about 0.95. With the ground truth correctness obtained by auto-evaluation, we can perform the evaluation on the evaluator either by AUROC or AUARC to detect how much the uncertainty quantification aligns with the correctness situation and derive the area below the ROC curve (the larger, the better) can be seen as the quality of an evaluation (UQ) method.

Appendix D More Experiment Results

This section will include more details about the extra experimental results on other datasets and pair-wise comparisons.

D.1 The results for other datasets from RQ1

As shown in Table 3. and Table 4, D-UE performs consistently better than most of the baseline methods. This indicates that our proposed solution is universally applicable to both white-box and black-box uncertainty evaluations that previously relied on the semantics information. Our method is a transferrable method to apply to any existing method that lacks the directional entailment.

Trivia (Llama2) NLQUAD (Llama2)
AUARC AUROC AUARC AUROC
Baselines Previous Ours Previous Ours Baselines Previous Ours Previous Ours
NumSet 0.7459 0.8362 0.8481 0.9422 NumSet 0.3525 0.4675 0.5561 0.9123
LexiSim 0.7819 0.8324 0.8174 0.9369 LexiSim 0.5359 0.5277 0.8872 0.9346
Eigv(Dis) 0.8363 0.8457 0.9423 0.9593 Eigv(Dis) 0.4336 0.4862 0.8322 0.9669
Ecc(Dis) 0.8185 0.8368 0.9160 0.9441 Ecc(Dis) 0.3918 0.4521 0.6640 0.9410
Degree(Dis) 0.8456 0.8491 0.9614 0.9663 Degree(Dis) 0.4527 0.4970 0.8518 0.9679
Eigv(Agre) 0.8452 0.8454 0.9606 0.9589 Eigv(Agre) 0.4495 0.4995 0.9674 0.9842
Ecc(Agre) 0.8401 0.8435 0.9523 0.9556 Ecc(Agre) 0.4807 0.5117 0.9728 0.9844
Degree(Agre) 0.8516 0.8488 0.9727 0.9654 Degree(Agre) 0.4555 0.5001 0.9656 0.9816
Eigv(Jacc) 0.8326 0.8422 0.9390 0.9537 Eigv(Jacc) 0.5268 0.5180 0.9490 0.9643
Ecc(Jacc) 0.8303 0.8399 0.9325 0.9487 Ecc(Jacc) 0.4569 0.5082 0.9513 0.9784
Degree(Jacc) 0.8430 0.8455 0.9531 0.9583 Degree(Jacc) 0.5371 0.5247 0.9973 0.9835
Table 3: Performance comparison of different baselines on Coqa and NLQUAD datasets using Llama2.

D.2 The results for pair-wise comparison on Coqa dataset

This subsection provides more detailed information on the pairwise comparison between D-UE and traditional semantic uncertainty across AUROC and AUARC metrics. Please find in the Fig. 7 and Fig. 8.

Coqa (GPT3.5)
AUARC AUROC
Baselines Previous Ours Previous Ours
NumSet 0.425 0.5605 0.5095 0.6660
LexiSim 0.5001 0.5467 0.6042 0.6471
Eigv(Dis) 0.5271 0.5574 0.6652 0.6733
Ecc(Dis) 0.4837 0.5603 0.5736 0.6675
Degree(Dis) 0.5320 0.5579 0.6654 0.6736
Eigv(Agre) 0.5355 0.5626 0.6769 0.6807
Ecc(Agre) 0.5295 0.5620 0.6669 0.6766
Degree(Agre) 0.5367 0.5615 0.6764 0.6800
Eigv(Jacc) 0.5179 0.5579 0.6463 0.6692
Ecc(Jacc) 0.5173 0.5550 0.6544 0.6694
Degree(Jacc) 0.5252 0.5560 0.6535 0.6693
Table 4: Performance for Coqa under GPT3.5

Appendix E Prompt Design

In this section, we describe the details of the prompt design for two tasks that have LLMs involved, to make sure the reproducibility of the work.

E.1 The Prompt used for Claim Extraction

Firstly, we define the instructions as below:

<<INST>><<SYS>> You are given a piece of text that includes knowledge claims. A claim is a statement that asserts something as true or false, which can be verified by humans. [Task] Your task is to accurately identify and extract every claim stated in the provided text. Then, resolve any coreference (pronouns or other referring expressions) in the claim for clarity. Each claim should be concise (less than 15 words) and self-contained. Your response MUST be a list of dictionaries. Each dictionary should contain the key "claim", which corresponds to the extracted claim (with all references resolved). You MUST only respond in the format as described below.

Then, necessary constraints and format restrictions should be applied (could vary to different LLM backbones, please modify based on the empirical exploration)

[Response Format] ["claim": "Ensure that the claim is fewer than 15 words and conveys a complete idea. Resolve any coreference (pronouns or other referring expressions) in the claim for clarity." ,… ] [DO NOT] RESPOND WITH ANYTHING ELSE. ADDING ANY OTHER EXTRA NOTES THAT VIOLATE THE RESPONSE FORMAT IS BANNED. START YOUR RESPONSE WITH ’[’.

Additionally, more examples are provided for few-shot learning from the inference period:

[examples]: [text]: Tomas Berdych defeated Gael Monfis 6-1, 6-4 on Saturday. The sixth seed reaches the Monte Carlo Masters final for the first time. Berdych will face either Rafael Nadal or Novak Djokovic in the final. [response]: ["claim": "Tomas Berdych defeated Gael Mon-fis 6-1, 6-4", "claim": "Tomas Berdych defeated Gael Monfis 6-1, 6-4 on Saturday", "claim": "Tomas Berdych reaches Monte Carlo Masters final", "claim": "Tomas Berdych is the sixth-seed", "claim": "Tomas Berdych reaches Monte Carlo Masters final for the first time", "claim": "Berdych will face either Rafael Nadal or Novak Djokovic", "claim": "Berdych will face either Rafael Nadal or Novak Djokovic in the final"] [text]: Tinder only displays the last 34 photos - but users can easily see more. The firm also said it had improved its mutual friends feature. [response]: ["claim": "Tinder only displays the last photos", "claim": "Tinder only displays the last 34 photos", "claim": "Tinder users can easily see more photos", "claim": "Tinder said it had improved its feature", "claim": "Tinder said it had improved its mutual friends feature"]

Given the above prompt information, we could ask for task completion to get the claims from a response (as described in the preparation step before the Eq.11):

Now complete the following: [text]: your input text [response]: [/INST]’

E.2 The Prompt Used for Response Evaluation

In this section, we provide details of prompt information used for response evaluation to get the correctness scores of the model’s responses, which will be used to judge if the uncertainty evaluation result is as expected to the corresponding correctness performance.

<<INST>><<SYS>> You are given a question, a reference (ground truth) answer, and an actual answer in each round of the task. Task Rate the level of consistency between the actual answer to the reference answer in each question.

Then we also have the value range description and reactions:

[Evaluation Range] The evaluation value should range from 0 to 100. [Response Format] PLEASE JUST GIVE ME A NUMBER WITHOUT ANY OTHER WORDS OR EXPLANATION.

Then we can apply the few-shot learning examples to enhance the tool-LLM’s understanding of its task. We provide some guidance here, and the readers could specify their demonstrations by defining a variable few_shots which contains examples with a triplet of elements: (Question, Reference, Answer):

[examples]: Question: few_shots[0][’question’] Reference: few_shots[0][’reference’] Answer: few_shots[0][’answer’] Rating: 100. Question: few_shots[1][’question’] Reference: few_shots[1][’reference’] Answer: few_shots[1][’answer’] Rating: 0.

E.3 The Prompt Used for Response Claim Augmentation

This section introduces the prompt template that augments the claims extracted from the response, by reflecting on the questions being asked, the LLMs should complete the claims if any part is missing or resolve the vagueness if any sentence is found unclear.

<<INST>><<SYS>> You are given two pieces of text identified as a question and response claim. A claim is a statement that asserts something as true or false, which can be verified by humans. Task Your task is to first understand every claim stated in the provided text. Then, augment the claims by considering the question being asked, complete the sentence if any part is missing, and resolve any coreference (pronouns or other referring expressions) in the claim for clarity.

Similarly, add the response constraint if your LLM backbone is not performing stably. After that, we could apply the agent to finish the following task by giving it the question and claim for augmentation.

Now complete the following: [text]: Question:q, Claim:c [augmented claim]: [/INST]’

Please note that the prompt performance may vary on different LLMs for completing the tasks, this prompt is testified on LLM3-8b, practitioners could tune the prompt segments if applying other backbones.

Refer to caption
Figure 7: The AUARC improvement of our method D-UE to the monotonous semantics level Uncertainty Quantification (UQ), each sub-figure demonstrates the comparison and our methods consistently perform better, note that our methods mean the method layers on the existing semantics methods and integrated with the response augmentation and directional entailment.
Refer to caption
Figure 8: The AUROC improvement of our method D-UE to the monotonous semantics level Uncertainty Quantification (UQ), similarly, each sub-figure demonstrates the comparison, our methods consistently perform better than solely using semantics UQ.