LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation

\text{Longchao Da}^{1}

\text{Tie** Chen}^{1}

\text{Lu Cheng}^{2}

\text{Hua Wei}^{1}\thanks{\ \ Corresponding author.}

\text{}^{1}

Arizona State University,

\text{}^{2}

University of Illinois Chicago
{longchao, tchen169, hua.wei}@asu.edu, [email protected]
Corresponding author.

Abstract

The Large language models (LLMs) have showcased superior capabilities in sophisticated tasks across various domains, stemming from basic question-answer (QA), they are nowadays used as decision assistants or explainers for unfamiliar content. However, they are not always correct due to the data sparsity in specific domain corpus, or the model’s hallucination problems. Given this, how much should we trust the responses from LLMs? This paper presents a novel way to evaluate the uncertainty that captures the directional instability, by constructing a directional graph from entailment probabilities, and we innovatively conduct Random Walk Laplacian given the asymmetric property of a constructed directed graph, then the uncertainty is aggregated by the derived eigenvalues from the Laplacian process. We also provide a way to incorporate the existing work’s semantics uncertainty with our proposed layer. Besides, this paper identifies the vagueness issues in the raw response set and proposes an augmentation approach to mitigate such a problem, we conducted extensive empirical experiments and demonstrated the superiority of our proposed solutions.

LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation

$\text{Longchao Da}^{1}$ , $\text{Tie** Chen}^{1}$ , $\text{Lu Cheng}^{2}$ , $\text{Hua Wei}^{1}\lx@make@thanks{\ \ Correspondingauthor.}$ $\text{}^{1}$ Arizona State University, $\text{}^{2}$ University of Illinois Chicago {longchao, tchen169, hua.wei}@asu.edu, [email protected]

1 Introduction

The Large Language Models (LLMs) Chang et al. (2024) have become a hot spot for almost everyone, they demonstrate superior performance on various tasks and even proved to be able to conduct human-like conversations by breaking the Turing Test Biever (2023). There are different voices on the emerging LLM techniques, and there are also different attitudes towards it Kambhampati et al. (2024); Valmeekam et al. (2022), either skeptical or accepting, one major concern is commonly acknowledged that the trustworthiness of LLMs responses is not guaranteed Sun et al. (2024); Liu et al. (2023); Huang et al. (2024). The trustworthiness has become a key obstacle for LLMs to deploy in crucial scenarios, such as healthcare Yang et al. (2023); Wang et al. (2024b), autonomous control Wang et al. (2024a), and intelligent planning Kambhampati et al. (2024); Da et al. (2024b).

This has brought many researchers to investigate the uncertainty quantification (UQ) approaches to better understand the LLM’s inferences, e.g., how well they estimate the system dynamics given some context information Da et al. (2024a). However, the UQ in Natural Language Generation (NLG), brings distinct challenges by their intrinsic semantics features, linguistic ambiguity, and complex output structures Lin et al. (2023). Another challenge in UQ for LLMs lies in the limited access to commercial large models, the unavailable model parameters or true prediction probabilities greatly hinder the intrinsic profiling of the model’s behavior, and leading to unachievable white-box uncertainty evaluation Balloccu et al. (2024).

Refer to caption — Figure 1: The left part is an example of directional entailment logic, (R1, Q) $\vdash$ (R2, Q) means the probability of R1 entails R2 given the context of question Q, and the right part shows the difference between existing symmetric similarity and our proposed directed relations.

Alternatively, researchers resort to the black-box quantification Lin et al. (2023) within limited question and response sets. A common practice for black-box evaluation is to first build a similarity matrix from a set of responses, and detect the inconsistency of these responses by either conducting Graph Laplacian or analyzing the response set’s entropy Kuhn et al. (2023). However, the current works of literature only consider ‘how similar’ are the two responses when constructing the matrix, this assumes the similarity from response A to B is the same as B to A. But in fact, in linguistics studies, two sentences contain directional logic information. For example, as in Fig. 1, this pair of responses contains two dramatically different entailment probabilities (measured by the NLI model ¹¹1https://huggingface.co/microsoft/deberta-v3-large). This implies potential direction information in the response set that existing work neglects by taking the mean or based on semantic measures (undirected).

In this paper, we propose a novel way named Directed Uncertainty Evaluation D-UE to apply a directed graph enforced by entailment probability to construct a more nuanced relationship that can capture the directions of responses and carry the semantic similarities at the same time. Besides, we also discover that the generated responses themselves may have vagueness issues that, bring more challenges to the UQ process, in this paper, we also propose a claim-based augmentation method that helps reduce the vagueness issue and mine the model’s real awareness of a question, further enhanced the UQ for LLMs.

2 Related Work

The first branch of research is solving the UQ in LLMs by inducing the LLMs to output their uncertainty along with the response Kadavath et al. (2022); Lin et al. (2022); Mielke et al. (2020); Tian et al. (2023). Most of the literature above requires the token-level probabilities of LLMs to train (or fine-tune) and predict the uncertainty. This is a straightforward solution while having full access to model structure and weights, but it can be unwieldy as of time-consuming and resource-tense. Another method Kuhn et al. (2023) estimates LLMs uncertainty directly from response level semantic entropy, yet still requires the token-related probability values as input, which is hard to access given black-box or commercial language models.

In consideration of fast and light-weight evaluation, some researchers propose to solve the UQ by treating the LLMs as black-box and analyzing the consistency in the response semantics structure. Lin et al. (2023) first analyzes the UQ by text responses, treating the sum of eigenvalues from the graph Laplacian as the uncertainty indicator. Chen and Mueller (2023) identify unreliable or speculative answers by computing a confidence score for its generated outputs. However, they solely analyze the UQ from semantics, and Lin et al. (2023) take the average of entailment probability from two directions to construct a similarity matrix, while in this paper, we find the claims together with the semantics information, better contribute to more comprehensive uncertainty quantification, and the directional logic, is not negligible in nuance analysis of response intrinsic structures.

3 Preliminaries

In this section, we will formalize the uncertainty evaluation in LLMs. Let $\mathcal{M}$ be a general LLM model, which is trained from a certain network structure and contains a parameter set $\theta$ . In the prompt-based inference period, an input $x$ is provided to $\mathcal{M}$ and the model produces a sequence of tokens denoted as $\hat{y}$ from a probability distribution $\mathbf{p}(\hat{y}|x,\theta)$ . The probability distribution plays a key role in understanding the $\mathcal{M}$ ’s characteristics.

For those who train the model from scratch, or use fully open-sourced models, the probability logits (or even the model parameters) are available for analysis and evaluation, and this branch of practice is seen as White Box evaluation, while on the other hand, due to the commercial, competitive, or other reasons, if there is no direct access to the probability logits, the uncertainty evaluation under this scenario is taken as Black Box evaluation.

3.1 White Box Evaluation

Traditionally, researchers could conduct uncertainty evaluation by gradient norms either from the input aspect or parameter aspect: $U_{\text{grad-input}}(x)=\left\|\nabla_{x}\mathcal{L}(\hat{y},y)\right\|_{2}$ $U_{\text{grad-param}}(x)=\left\|\nabla_{\theta}\mathcal{L}(\hat{y},y)\right\|_% {2}$ , where $\mathcal{L(\cdot|\cdot)}$ is the loss function between predicted output $\hat{y}$ and groundtruth $y$ . And the gradient is calculated from two different aspects to understand the model’s sensitivity. This requires the true label of answer $y$ , which is not suitable for open-ended questions or prompts. An alternative solution is to evaluate the inconsistency based on the list of responses $\textbf{Y}=\{y_{1},y_{2},y_{3},...y_{n}\}$ , and their probabilities P. Such as the use of entropy for inconsistency implementation: $U(x)=H(\textbf{Y}|x)=-\sum_{y}p(\textbf{y}|x)log(p(\textbf{y}|x))$ where the $x$ is the input and Y is the sequence of generated tokens (as a response). Kuhn et al. (2023); Sun et al. (2019); Abdar et al. (2021).

3.2 Black Box Evaluation

For black box evaluation, the evaluator only has the text-level responses Wang et al. (2023), this typically requires a more nuanced and deeper understanding of the model’s output stability and potential response structure from limited Q-A samples. Assume there are n response samples to the same question q. The common practice is to construct a matrix $\mathcal{S}$ which encapsulates the similarity information among the responses:

\mathcal{S}=\begin{bmatrix}1&s_{12}&s_{13}&\cdots&s_{1n}\\ s_{21}&1&s_{23}&\cdots&s_{2n}\\ s_{31}&s_{32}&1&\cdots&s_{3n}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ s_{n1}&s_{n2}&s_{n3}&\cdots&1\end{bmatrix}

(1)

where each of the value at position $s_{ij}$ , { $i,j\in(1\sim n)$ } is the calculated pariwise similarity score.

Given a list of responses $R=\{r_{1},r_{2},\ldots,r_{n}\}$ , the pairwise similarity score $s_{ij}$ between responses $r_{i}$ and $r_{j}$ can be calculated using a general similarity function as: $s_{ij}=\text{sim}(r_{i},r_{j})$ the current explorations on matrix $\mathcal{S}$ is mainly based on symmetric similarity property calculations such as Jaccard similarity or worldVector similarity, implying the $s_{ij}=s_{ji}$ in $\mathcal{S}$ . Then this condition guarantees the use of Normalized Laplacian to understand the hidden structure in the responses space Lin et al. (2023):

L:=I-D^{-\frac{1}{2}}WD^{-\frac{1}{2}}

(2)

where the weighted adjacency matrix $W$ is from the symmetric similarity matrix $\mathcal{S}$ , and the degree matrix is:

D_{r_{i},r_{j}}=\begin{cases}\sum_{j^{\prime}\in[n]}w_{i,j^{\prime}}&(r_{i}=r_% {j})\\ 0&(r_{i}\neq r_{j})\end{cases}

(3)

where the diagonal element $D_{r_{i},r_{j}}$ ( $r_{i}=r_{j}$ ) is the degree of the node $r_{i}$ , which is the sum of the weights (similarity $s_{i,j^{\prime}}$ ) of all edges connected, $j^{\prime}$ goes over the connected responses with size $n$ .

Then one can leverage the constructed Symmetric Laplacian to find the eigenvalue to represent the connectivity of the graph, and use this as an indicator of uncertainty: $U_{\text{EigV}}=\sum_{k=1}^{n}\max(0,1-\lambda_{k})$ where the $\lambda_{k}$ is the $k_{\text{th}}$ eigenvalues of Laplacian $L$ .

3.3 Discussion

Since the white box evaluation places a strict requirement on the original model, we analyze from the perspective of the black box evaluation in this paper.

First, the current black box evaluation make assumption that $s_{ij}=s_{ji}$ , however, in the actual knowledge representation logic, this neglected the directional information of two responses: If proposition $A$ entails proposition $B$ (denoted as $A\vdash B$ ), it means that if $A$ is true, then $B$ must also be true. This is a one-way relationship. Importantly, this relationship is not necessarily symmetric; that is, $A\vdash B$ does not imply $B\vdash A$ . So the construction of a symmetric matrix broke this rule, which will lead to the loss of directional information from the original response set. In this paper, we propose to reconstruct the response relationship from a directional graph and provide a Random Walk Laplacian uncertainty evaluation method to better fit the asymmetric property of the constructed graph.

Second, the response set with long answers, containing more than one identical claim is easy to be miscalculated on the similarity from either semantics or knowledge claim aspect. E.g, to the question ‘How many students became heroes?’ the two answers from language model $\mathcal{M}$ as: A: ‘Andrew Willis, Chris Willis, Reece Galea’ and B: ‘Three students became heroes’. According to the context, the answer A is partially correct because it named the correct persons in the answer, however, the similarity between A and B is near 0 either calculated from entailment similarity or Jaccard, etc. This raises our proposal to provide claim-based augmentation before the uncertainty evaluation to recover the correct response intentions.

4 Uncertainty Evaluation within Directed Entailment Graph : D-UE

In this section, we will discuss how to formally model the logical direction information Kripke (1959); Dagan and Glickman (2004) in the responses with different entailment probabilities, and how the claims-based response augmentation helps with the potential semantic information mining. And then, we provide a way to integrate our method with plain semantic similarity matrix-derived uncertainty, which makes our method possible to layer on any of the existing methods that overlook the directional entailment information. The overall framework of D-UE compared to the traditional UQ based on the symmetric measure is shown in Fig. 2.

4.1 Directional Entailment Graph

In order to preserve the directional entailment information from a response set $R$ , we adopt the NLI (Natural Language Inference) model to provide pair-wise entailment measurement in the response set $R$ Williams et al. (2017); Bowman et al. (2015). Following the work Kuhn et al. (2023), the employed NLI model ²²2off-the-shelf DeBERTa-large model provides a three-element tuple by taking two text elements $r_{i}$ and $r_{j}$ :

[\textit{logit}_{cont},\textit{logit}_{neut},\textit{logit}_{ent}]=% \overrightarrow{\textit{NLI}}(r_{i},r_{j})

(4)

The output is processed by transforming into the probability through:

\textbf{p}=\textit{Softmax}(\textit{logit}_{cont},\textit{logit}_{neut},% \textit{logit}_{ent})

(5)

where $\overrightarrow{p_{ent}}(r_{i},r_{j})=p(r_{i}\vdash r_{j})=\textbf{p}_{3}$ is the entailment probability of $r_{i}\vdash r_{j}$ . To here, an asymmetric entailment matrix $\mathcal{S}$ is derived for constructing the directional graph $\mathcal{G}_{d}=(V,\overrightarrow{E})$ . The $V=R$ is the set of responses with $|V|=n$ and $\overrightarrow{E}$ is the set of directed edges weighted primarily by the entailment probabilities.

\overrightarrow{E}=\{(v_{i},v_{j})\mid\forall i,j\text{ weight:}p(r_{i}\vdash r% _{j})\}

(6)

Thus, the adjacency matrix $A$ of the directed graph $\mathcal{G}_{d}$ can be defined as: $A_{ij}=p(r_{i}\vdash r_{j})$ , where $A_{ij}$ represents the weight of the directed edge from vertex $v_{i}$ to vertex $v_{j}$ (the vertex is indeed a corresponding response, so in a later section might use interchangeably), the $A_{ij}$ not necessarily equals to $A_{ji}$ unless the two meta responses are completely the same. The constructed $\mathcal{G}_{d}$ stands for directional semantic logic, different from the semantic similarity (sem) that may take the average of two entailment directions: $A_{ij,\textit{sem}}=A_{ji,\textit{sem}}=\frac{p(r_{i}\vdash r_{j})+p(r_{j}% \vdash r_{i})}{2}$ , which lacks partial of information.

4.2 Enhance the Graph with Text Similarity

Based on the constructed directed entailment graph, it is feasible to incorporate the text similarity to enrich the information in the graph. We consider another matrix: the text similarity matrix $\mathcal{T}$ , with identical size $n\times n$ as $\mathcal{S}$ , we can enrich the edges-carried information between the nodes with jointly weighted values from both the entailment and text similarity matrix.

Let $\mathcal{S}=[s_{ij}]$ and $\mathcal{T}=[t_{ij}]$ represent the entailment and text similarity matrix, respectively. We define the weight of the edge from node $i$ to node $j$ in the graph $\mathcal{G}_{d}$ as: $w_{ij}=s_{ij}+t_{ij}$ . The adjacency matrix $A_{ij}$ of the graph $G$ is can be updated with weights $w_{ij}$ .

Please note that to achieve $\mathcal{T}$ , the text similarity can be measured in multiple ways such as TF-IDF Aizawa (2003), Cosine Similarity, Word Embeddings, etc. Here in this paper, since we are measuring the responses given the same question, we provide an implementation with Jaccard Similarity from set operation:

J(r_{i},r_{j})=\frac{|r_{i}\cap r_{j}|}{|r_{i}\cup r_{j}|}

(7)

where the $r_{i}$ and $r_{j}$ as response sentences, contain multiple phases and words serving as two sets.

4.3 Random Walk Laplacian

For a directed graph $\mathcal{G}_{d}$ , the connectivity of nodes (responses) reflects the potential semantic clusters in the response set $R$ , we can analyze the graph characteristics by conducting a Laplacian process to derive the eigenvalue, which reflects the dispersion of the nodes, in the given scenario, it reveals the uncertainty of the black box model that generated the response set, given certain question.

However, the current $\mathcal{G}_{d}$ is special for its asymmetric property, thus, the Normalized Laplacian or Symmetric Graph Laplacian, etc. are no longer suitable for the problem since they require the symmetric matrix. We innovatively propose to employ the Random Walk Laplacian which focuses on the out-degree of nodes to tackle this directional, and asymmetric issue. The out-degree matrix is calculated as: $\mathbf{D}_{\text{out}}=\text{diag}(d_{\text{out},1},d_{\text{out},2},\ldots,d% _{\text{out},n})$ , where $d_{\text{out},i}=\sum_{j=1}^{n}a_{ij}$ is the out-degree of node $r_{i}$ , and $a_{ij}$ is an instance of adjacency matrix $A$ carrying the weights from $r_{i}$ to $r_{j}$ .

Then, the inverse of the out-degree matrix can be calculated by $\mathbf{D}_{\text{out}}^{-1}=(\mathbf{D}_{\text{out}}+\epsilon\mathbf{I})^{-1}$ , where $\mathbf{I}$ is the identity matrix and $\epsilon$ is a small positive constant to avoid division by zero. The random walk Laplacian matrix $\mathbf{L}_{\text{rw}}$ is then defined as:

\mathbf{L}_{\text{rw}}=\mathbf{I}-\mathbf{D}_{\text{out}}^{-1}\mathbf{A}

(8)

we compute the eigenvalues of the random walk Laplacian matrix $\mathbf{L}_{\text{rw}}$ and derive $\lambda_{k}$ , the eigenvalues of $\mathbf{L}_{\text{rw}}$ , where $k=1,2,\ldots,n$ . For details please refer to Appendix A. The uncertainty measure $U_{\text{EigV}}$ is then computed by

\mathbf{U}_{\text{EigV}}^{d}=\sum_{k=1}^{n}\max(0,1-\lambda_{k})

(9)

This measure captures the extent to which the eigenvalues $\lambda_{k}$ deviate from 1, providing a representation of the uncertainty in the language model’s responses, note that for each question $q$ related response set $R_{q}$ , $r_{i}\in R_{q}$ , and $|R_{q}|=n$ , our method derives one aggregated uncertainty value by Eq. 9.

It is important to perform the Random Walk Laplacian on the directed graph $\mathcal{G}_{d}$ because in the directed graph, the probability of transition from node $i$ to node $j$ is defined by: $P_{i\rightarrow j}=\frac{A_{i\rightarrow j}}{\sum_{k}A_{i\rightarrow k}}$ , where the $A_{i\rightarrow j}$ is the weights in the adjacency matrix and $k$ is the total amount of accessible nodes. If two responses exist $entail(r_{i}\vdash r_{j})\neq entail(r_{j}\vdash r_{i})$ , then we have different transition probability, making a difference in profiling the response set characters:

A_{i\to j}\neq A_{j\to i}\implies P_{i\to j}\neq P_{j\to i}

(10)

This designed structure captures the non-symmetry information in the entailment probability from different directions of two responses.

4.4 Integrate Directional Entailment Uncertainty with Semantics Uncertainty

The uncertainty derived in Eq. 9 represents uncertainty from directional entailment probability and text in-consistency as introduced in Section 4.2. And since there exist multiple solutions for semantic uncertainty measurement Lin et al. (2023); Kuhn et al. (2023), we propose a way to seamlessly integrate $\mathbf{U}_{\text{EigV}}^{d}$ from D-UE with other semantics uncertainty $\mathbf{U}^{s}$ , thus have a multi-angle evaluation on limited response sets.

One simplest way is to directly aggregate the two resources of $\mathbf{U}_{\text{EigV}}^{d,i}$ and $\mathbf{U}^{s,i}$ on the same response set $R_{q}^{i}$ , however, when there are multiple response sets $R_{q}^{i}\in\mathcal{R}$ , $i=\{1,2,...h\}$ , direct aggregation can not guarantee the order change is caused by the uncertainties contribution: because the different uncertainty measure from $\mathbf{U}_{\text{EigV}}^{d,i}$ and $\mathbf{U}^{s}$ result in different scales (some are in [0, 1] and some are not bounded), the order changes after $\mathbf{U}_{\text{EigV}}^{d,i}+\mathbf{U}^{s}$ is probably caused by the absolute value range difference. Thus instead of working on $R_{q}^{i}$ , we focus on the whole response space $\mathcal{R}$ that contains multiple questions’ response sets, yielding $\mathbf{\mathcal{U}}_{\text{EigV}}^{d}$ and $\mathbf{\mathcal{U}}^{s}$ , $|\mathbf{\mathcal{U}}_{\text{EigV}}^{d}|=|\mathbf{\mathcal{U}}^{s}|=h$ meaning there contains $h$ uncertainties for $h$ question-related response sets. And we perform the normalization on the distribution of two aspects of measurements by $Z$ -score: $\text{Normalized}(X)=\frac{X-\mu_{X}}{\sigma_{X}}$ , replace the X with $\mathbf{\mathcal{U}}_{\text{EigV}}^{d}$ and $\mathbf{\mathcal{U}}^{s}$ and we get: $\mathbf{\mathcal{\hat{U}}}_{\text{EigV}}^{d}$ and $\mathbf{\mathcal{\hat{U}}}^{s}$ . Then we can derive the $\mathbf{\mathcal{\hat{U}}}=\mathbf{\mathcal{\hat{U}}}_{\text{EigV}}^{d}/2+% \mathbf{\mathcal{\hat{U}}}^{s}/2$ , which contains both semantics and directional uncertainty, and the order change in $\mathbf{\mathcal{\hat{U}}}$ is contributed by the semantics uncertainty from $\mathbf{\mathcal{U}}^{s}$ .

5 Claim Based Response Augmentation

Sometimes, the raw responses from the language model $\mathcal{M}$ can not fully reveal its awareness of a problem due to the multiple claim points but short descriptions. In the example responses at Section 3.3, ‘three students became heros’ and ‘Andrew Willis, Chris Willis, Reece Gelea’ are a pair of responses that share the same potential meaning ‘Andrew Willis, Chris Willis, and Reece Gelea are three students who became heros’. The direct use of raw responses like these impairs ( $\downarrow$ ) the True Positive rate and increases ( $\uparrow$ ) the False Negative rate, leading to a biased evaluation.

In this section, inspired by Choi and Ferrara (2024), we propose to augment raw responses on the claims level, trying to identify the potential correct claims hidden in incomplete or vague responses. It is worth noting that, here we do not conduct fact-checking, instead, we rely on the context information to provide claim augmentation, so our task is easier and more feasible to be accomplished by other pre-trained LLMs. Specifically, the task can be formalized as:

Given a question $q$ and a response set $R=\{r_{1},r_{2},\ldots,r_{n}\}$ , for each of the $r_{i}\in R$ that contains k claims $c_{k}$ , augment each of $c_{k}\rightarrow c_{k}^{aug}$ and derive the $r_{i}^{aug}$ with a more explicit and comprehensive description.

To realize it, the key is to first identify the claim atoms in a response $r_{i}$ , this step can be achieved with the basic understanding ability of context, we verified that Llama-3 is adequate for this task and used it in the claim extraction. Then to extend extracted claims by recalling the questions, this helps to align the claim descriptions with questions. And at last, combine the augmented claims into a more comprehensive answer $r_{i}^{aug}$ as following:

r_{i}^{aug}=\textit{Augmentor}(c_{1},c_{2},...,c_{k})_{c_{k}\leftarrow r_{i}}

(11)

where the $\leftarrow$ here is interpreted as claim $c_{k}$ originates from $r_{i}$ . The response level Augmentor conducts two steps: First, it extends the claims with an Extender that takes into the current claim $c_{k}$ and question $q$ as following:

c_{k}^{aug}=\textit{Extender}(c_{k},q)

(12)

the task at Eq. 12 is simple because it is generating the sequence based on existing input content, so it can be fulfilled by other general language models such as Llama-3 ³³3https://github.com/meta-llama/llama3 (used in this paper) with necessary prompt. Second, it contacts ( $\oplus$ ) all of the augmented claims to form the $r_{i}^{aug}$ :

r_{i}^{aug}=\oplus\{c_{1}^{aug},c_{2}^{aug},...,c_{k}^{aug}\}

(13)

The eventual evaluation set $R^{aug}$ can be achieved by traversing all of the $r_{i}\in R$ .

It is worth noting that the original low-quality response $r^{\times}$ should be kept unchanged from the augmentation process to preserve the original error generated by $\mathcal{M}$ as a part of evaluation evidence, we collect these responses by regular expression as implemented in the code ⁴⁴4code will be released after publication and is available under request for now..

In this paper, the augmentation is conducted following the above procedure, and the D-UE takes the $R^{aug}$ as the eventual input for uncertainty evaluation.

6 Experiment

In this section, we design experiment to empirically demonstrate the effectiveness of our proposal in uncertainty evaluation for LLMs. Please note that if without extra declarition, D-UE mean the directional entailment uncertainty on augmented response sets. We intend to investigate the following research questions:

RQ1: Can D-UE improve the uncertainty evaluation layered on existing methods that has no consideration of directional entailment logic?

RQ2: Is claim level augmentation hel** a more robust evaluation?

RQ3: How does each module of our proposed method contribute to the final uncertainty quantification? An ablation study for D-UE.

6.1 Experiment Setups

In this paper, we explored Llama3-8b for simple tasks such as claims extraction introduced in Section 5. and question-based atom-claim augmentation to complete Eq. 12. Each experiment using NLI model uses a calibrated temperature as 3. All of the experiment is supported by Ubuntu on 13th Gen Intel(R) Core(TM) i9-13900KF, with NVIDIA GeForce RTX 4090.

Datasets

We adopt the two classic (QA) datasets Coqa Reddy et al. (2019) (7,983 questions), TriviaQA Joshi et al. (2017) (9,960 questions), and another especially long question answer dataset NLQuAD Soleimani et al. (2021) (3,024 questions), which is more challenging and include more claims in one response.

Evaluation Metrics and Process

As discussed in the paper Lin et al. (2023), there exists the limitation of commonly adopted AUROC that it is very sensitive to imbalanced scenarios (likely to provide over-optimistic evaluation). Area Under Accuracy Rejection Curve (AUARC), is an alternative metric that can better reflect the evaluation performance, the calculation is shown in Appendix B, we use these two as a complementary evaluation indicator.

Measure	Details
$U_{\textit{LexiSim}}$	Lexical similarity which measures the average rougeL.
$U_{\textit{NumSet}}$	Multiplicity of the zero eigenvalue coincides with semantic sets.
$U_{\textit{SE}}$	Semantic entropy by the entropy over semantic sets.
$U_{\textit{Eigv}}(Dis)$	Spectral eigenvalue on the disagreement.
$U_{\textit{Ecc}}(Dis)$	Average distance from center in responses’ disagreement.
$U_{\textit{Degree}}(Dis)$	Degree of disagreement Matrix.
$U_{\textit{Eigv}}(Agre)$	Spectral eigenvalue on the agreement.
$U_{\textit{Ecc}}(Agre)$	Average distance from center in responses’ agreement.
$U_{\textit{Degree}}(Agre)$	Degree Matrix of agreement Matrix.
$U_{\textit{Eigv}}(Jacc)$	Spectral eigenvalue on the Jaccard similarity.
$U_{\textit{Ecc}}(Jacc)$	Average distance from center in responses’ Jaccard measure.
$U_{\textit{Degree}}(Jacc)$	Degree Matrix of Jaccard similarity.

Table 1: The baseline methods and explanations

In order to evaluate the performance of an ‘evaluator’, we first need to know the correctness of responses to questions, then evaluate how well the evaluator’s output uncertainty reflects the correctness situation (say, given a question, the more uncertainty model is, the more likely it make mistakes and achieve low accuracy). In this paper, we adopt the GPT3.5-turbo to produce the correctness score from 0 to 1, for details, please refer to Appenix C.

Baseline methods

We compare 12 baseline methods including: $U_{\textit{LexiSim}}$ , $U_{\textit{NumSet}}$ , $U_{\textit{SE}}$ Kuhn et al. (2023), and similar method Eingvalue-based, eccentricity-based and degree-based method over three characteristics: disagreement, agreement and Jaccard Lin et al. (2023) in their similarity matrix construction. A detailed explanation is included in the following Table 1.

6.2 Experiment Result and Analysis

In this section, we will discuss each of the research questions and the analysis of the proposed methods’ performance.

RQ1: We have constructed experiments on three datasets across 12 baseline methods, and verify that the implementation with D-UE + semantics uncertainty performs consistently better than most of the baseline methods. As shown in Fig. 4, each bar represents the area below the AUROC curve, for each of the $x$ labels, e.g, NumSet, the blue color is the baseline method’s performance and pink color shows the directional entailment enhanced performance on specific baseline semantic uncertainty evaluation.

The performance evaluated by AUARC is shown side by side in Fig. 3. The left one shows the baseline methods’ performance. On the right side, our method D-UE improves all of the methods’ performance, which means that the directional logic information is neglected from previous methods and can be further mined by D-UE. Some methods such as numset, lexical_sim and semanticEntropy, could be improved a lot because they solely consider semantics similarity during the computation. On the other hand, methods like eigv(Agre) and degree-based perform smaller improvements because they also rely on the graph structure to detect the connectivity, which might consider the degree in the UQ process. But the major difference is that D-UE formally defined a directed graph and conducted Random Walk Laplacian with seasonable theory approval that relaxes the symmetric requirement, and thus could be used with more flexibility. We also provide another set of experiments conducted on GPT3.5’s responses on Coqa dataset, as shown in Table 2, our method outperforms all of the baseline methods.

Coqa (GPT3.5)
	AUARC		AUROC
Baselines	Previous	Ours	Previous	Ours
NumSet	0.4250	0.5605	0.5095	0.6660
LexiSim	0.5001	0.5467	0.6042	0.6471
Eigv(Dis)	0.5271	0.5574	0.6652	0.6733
Ecc(Dis)	0.4837	0.5603	0.5736	0.6675
Degree(Dis)	0.5320	0.5579	0.6654	0.6736
Eigv(Agre)	0.5355	0.5626	0.6769	0.6807
Ecc(Agre)	0.5295	0.5620	0.6669	0.6766
Degree(Agre)	0.5367	0.5615	0.6764	0.6800
Eigv(Jacc)	0.5179	0.5579	0.6463	0.6692
Ecc(Jacc)	0.5173	0.5550	0.6544	0.6694
Degree(Jacc)	0.5252	0.5560	0.6535	0.6693

Table 2: Performance for Coqa under GPT3.5

RQ2: In order to understand how claim level augmentation helps with a better understanding of the potential relationships between responses, we conducted a case study on an example response set, which is generated by llama2-13b.

As shown in this question-answer set example, which is an answer set taken by the responses made by Llama2-13b in Coqa dataset, we found that, due to the quality and stability of language models, they might generate abbreviated or vague responses such as 3.‘These three’ and 5.‘Three high’, even though these responses cover the key idea of the true answer, but due to the incompleteness of the sentence, this brings challenges to the uncertainty evaluation, especially for black-box evaluation that can only build upon the consistency among the responses. It is hard to identify the relationship between a sentence if the claim/meaning is not stated thoroughly. The blue color fonts show the augmented results based on our designed Extender (E.q. 12), the detailed prompt will be present in the Appendix. For sentence 3. and 5., these claims are completed with the intention of the questions and are easier to reflect the consistency from the meaning. As shown in Fig. 5, which are the heatmaps showing the probability of entailment with direction $P(X\vdash Y)$ . On the upper part of the left side is the $P_{R^{raw}}(X\vdash Y)$ from the original response set $R^{raw}$ and the lower part is after the augmentation $P_{R^{aug}}(X\vdash Y)$ .

From the comparison of the original and augmented probability graph, we can observe that the probability for those in-completed sentences to entail other responses is very low, even though under the case that these sentences include the correct answer, but after the effective augmentation, we could find that the probability increases (e.g., 3. and 5.), indicating the potential relationship is discovered. The right side is the residual map which is calculated by subtraction from (with) augmentation to (without) augmentation, the red means finding the stronger entailment relationship after augmentation, and the blue means mitigating the original entailment probability.

RQ3: We conducted ablation experiments to understand the contribution of the directional entailment uncertainty measure, the claim-based augmentation. Due to the limited page, three representative baselines on two metrics are shown in Fig. 6. We could observe from the baseline of NumSet that, the basic NumSet uncertainty measurement is sensitive to the augmentation by showing improvement in the augmented version (purple bar) over the basic version (blue one), both on AUROC and AUARC. But compared to the improvement brought by claim augmentation, the ‘Ours’ method(D-UE) makes a larger contribution to the general evaluation performance. This attribute to the advantage of directional entailment logic is mined with Random Walk Laplacian on the directional graph. We leave further exploration for the future on how to better combine the semantics uncertainty and directional entailment-based uncertainty and we believe there is still a potential space of improvement based on the current proposed direction.

7 Conclusion

In this paper, we discovered the two challenges of existing uncertainty quantification methods for LLMs: the omission of directional logic in semantic meanings and the low-quality / vague response sets that bring difficulty in uncovering the actual correct answers. We proposed two solutions to tackle the above challenges: A. we formally define a directional entailment graph encapsulating the direction logic and enhance it with text similarity, then innovatively propose to conduct Random Walk Laplacian to find the eigenvalue indicating the uncertainty in response graph structure. B. we propose a claim-based augmentation method that helps with understanding the ‘real’ faithfulness of a model’s responses. These two methods improved the current existing UQ methods and provided a better insight into how trustworthy a model is. We hope the exploration of this work could raise other researchers’ interest from another aspect of understanding the uncertainty in Large Language Models and comprehending the NLG trustworthiness.

8 Limiatations

Even though this work innovatively proposes a directed graph and an augmentation method for the LLM’s uncertainty quantification, the authors believe it is still important to explore more on how to combine the semantics and directional logic uncertainty in a theoretically orthogonal way. This work was only able to testify to the evaluation tasks on LLama2-13b and ChatGPT 3.5, with the fast-growing speed of the Large Language Model family, more models would be feasible to test and understand their responses uncertainty to specific questions.

References

Abdar et al. (2021) Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U Rajendra Acharya, et al. 2021. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion, 76:243–297.
Aizawa (2003) Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1):45–65.
Balloccu et al. (2024) Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek. 2024. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. arXiv preprint arXiv:2402.03927.
Biever (2023) Celeste Biever. 2023. Chatgpt broke the turing test-the race is on for new ways to assess ai. Nature, 619(7971):686–689.
Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
Chang et al. (2024) Yupeng Chang, Xu Wang, **dong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45.
Chen and Mueller (2023) Jiuhai Chen and Jonas Mueller. 2023. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness.
Choi and Ferrara (2024) Eun Cheol Choi and Emilio Ferrara. 2024. Fact-gpt: Fact-checking augmentation via claim matching with llms. In Companion Proceedings of the ACM on Web Conference 2024, pages 883–886.
Da et al. (2024a) Longchao Da, Minquan Gao, Hao Mei, and Hua Wei. 2024a. Prompt to transfer: Sim-to-real transfer for traffic signal control with prompt learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 82–90.
Da et al. (2024b) Longchao Da, Kuanru Liou, Tie** Chen, Xuesong Zhou, Xiangyong Luo, Yezhou Yang, and Hua Wei. 2024b. Open-ti: Open traffic intelligence with augmented language model. International Journal of Machine Learning and Cybernetics, pages 1–26.
Dagan and Glickman (2004) Ido Dagan and Oren Glickman. 2004. Probabilistic textual entailment: Generic applied modeling of language variability. Learning Methods for Text Understanding and Mining, 2004(26-29):2–5.
Huang et al. (2024) Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie **, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu Zhao, et al. 2024. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review, 57(7):175.
Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. Preprint, arXiv:1705.03551.
Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
Kambhampati et al. (2024) Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. 2024. Llms can’t plan, but can help planning in llm-modulo frameworks. arXiv preprint arXiv:2402.01817.
Kripke (1959) Saul A Kripke. 1959. Distinguished constituents, semantical analysis of modal logic, and the problem of entailment. The Journal of Symbolic Logic, 24(4):312–326.
Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664.
Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334.
Lin et al. (2023) Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
Liu et al. (2023) Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. 2023. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
Mielke et al. (2020) Sabrina J Mielke, Arthur Szlam, Y-Lan Boureau, and Emily Dinan. 2020. Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness. arXiv preprint arXiv:2012.14983, 11.
Reddy et al. (2019) Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
Soleimani et al. (2021) Amir Soleimani, Christof Monz, and Marcel Worring. 2021. Nlquad: A non-factoid long question answering data set. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1245–1255.
Sun et al. (2024) Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. 2024. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561.
Sun et al. (2019) Lin Sun, Xiaoyu Zhang, Yuhua Qian, Jiucheng Xu, and Shiguang Zhang. 2019. Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification. Information Sciences, 502:18–41.
Tian et al. (2023) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975.
Valmeekam et al. (2022) Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2022. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498.
Wang et al. (2024a) Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, **gsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024a. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345.
Wang et al. (2024b) Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Furong Huang, et al. 2024b. Enhancing visual-language modality alignment in large vision language models via self-improvement. arXiv preprint arXiv:2405.15973.
Wang et al. (2023) Yubo Wang, Xueguang Ma, and Wenhu Chen. 2023. Augmenting black-box llms with medical textbooks for clinical question answering. arXiv preprint arXiv:2309.02233.
Williams et al. (2017) Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
Yang et al. (2023) Rui Yang, Ting Fang Tan, Wei Lu, Arun James Thirunavukarasu, Daniel Shu Wei Ting, and Nan Liu. 2023. Large language models in health care: Development, applications, and challenges. Health Care Science, 2(4):255–263.

Appendix A Solve the Eigenvalue for Random Walk laplacian

From the definition that

L_{\text{rw}}=I-D_{\text{out}}^{-1}A

To find the eigenvalues $\lambda$ and eigenvectors $\mathbf{v}$ of $L_{\text{rw}}$ is to solve:

	$\displaystyle L_{\text{rw}}\mathbf{v}$	$\displaystyle=\lambda\mathbf{v}$
	$\displaystyle(I-D_{\text{out}}^{-1}A)\mathbf{v}$	$\displaystyle=\lambda\mathbf{v}$
	$\displaystyle\mathbf{v}-D_{\text{out}}^{-1}A\mathbf{v}$	$\displaystyle=\lambda\mathbf{v}$
	$\displaystyle\mathbf{v}$	$\displaystyle=\lambda\mathbf{v}+D_{\text{out}}^{-1}A\mathbf{v}$
	$\displaystyle(I-\lambda)\mathbf{v}$	$\displaystyle=D_{\text{out}}^{-1}A\mathbf{v}$
	$\displaystyle D_{\text{out}}(I-\lambda)\mathbf{v}$	$\displaystyle=A\mathbf{v}$
	$\displaystyle(D_{\text{out}}-\lambda D_{\text{out}})\mathbf{v}$	$\displaystyle=A\mathbf{v}$

given this, we can transform the eigenvalue problem of $L_{\text{rw}}$ into a form involving the matrix $A$ and $D_{\text{out}}$ . This step simplifies the problem into the standard eigenvalue problem, specifically:

\det(L_{\text{rw}}-\lambda I)=0

By applying the definition of $L_{\text{rw}}$ :

\det(I-D_{\text{out}}^{-1}A-\lambda I)=0

Then the eigenvalues $\lambda$ can be solved by solving for the roots of this equation.

Appendix B The Evaluation Metric

The AUROC is calculated by plotting the accuracy of accepted predictions against the rejection rate, and then computing the area under this curve.

Given that $s_{i}$ is the score of the $i$ -th prediction. $a_{i}$ as the accuracy of the $i$ -th prediction (1 if correct, 0 if incorrect). $n$ as the total number of predictions. We first sort the scores and corresponding accuracies: $\{(s_{i},a_{i})\}_{i=1}^{n}\rightarrow\{(s_{(i)},a_{(i)})\}_{i=1}^{n}$

For each $i$ from 0 to $n$ : the rejection rate $R_{i}$ is: $R_{i}=\frac{i+1}{n}$ and the accuracy of the accepted predictions $A_{i}$ is: $A_{i}=\frac{\sum_{j=i+1}^{n}\mathbb{\text{1}}(a_{(j)}\geq\alpha)}{n-(i+1)}$ where $\mathbb{\text{1}}(\cdot)$ is the indicator function, and $\alpha$ is the threshold. The area under the curve (AUARC) is calculated by the trapezoidal rule:

	AUARC	$\displaystyle=\int_{0}^{1}A(R)\,dR$
		$\displaystyle\approx\sum_{i=0}^{n-1}\frac{A_{i}+A_{i+1}}{2}(R_{i+1}-R_{i})$

where $A_{i}$ is the accuracy at the $i$ -th step and $R_{i}$ is the rejection rate at the $i$ -th step.

Appendix C Evaluation and Groundtruth Correctness

Following Lin et al. (2023), in our evaluation, the responses with a score $>$ 0.7 are taken as correct answers, and the human verification is applied to the correctness of the auto-generated judgment by GPT3.5-turbo and the accuracy is about 0.95. With the ground truth correctness obtained by auto-evaluation, we can perform the evaluation on the evaluator either by AUROC or AUARC to detect how much the uncertainty quantification aligns with the correctness situation and derive the area below the ROC curve (the larger, the better) can be seen as the quality of an evaluation (UQ) method.

Appendix D More Experiment Results

This section will include more details about the extra experimental results on other datasets and pair-wise comparisons.

D.1 The results for other datasets from RQ1

As shown in Table 3. and Table 4, D-UE performs consistently better than most of the baseline methods. This indicates that our proposed solution is universally applicable to both white-box and black-box uncertainty evaluations that previously relied on the semantics information. Our method is a transferrable method to apply to any existing method that lacks the directional entailment.

Trivia (Llama2)					NLQUAD (Llama2)
	AUARC		AUROC			AUARC		AUROC
Baselines	Previous	Ours	Previous	Ours	Baselines	Previous	Ours	Previous	Ours
NumSet	0.7459	0.8362	0.8481	0.9422	NumSet	0.3525	0.4675	0.5561	0.9123
LexiSim	0.7819	0.8324	0.8174	0.9369	LexiSim	0.5359	0.5277	0.8872	0.9346
Eigv(Dis)	0.8363	0.8457	0.9423	0.9593	Eigv(Dis)	0.4336	0.4862	0.8322	0.9669
Ecc(Dis)	0.8185	0.8368	0.9160	0.9441	Ecc(Dis)	0.3918	0.4521	0.6640	0.9410
Degree(Dis)	0.8456	0.8491	0.9614	0.9663	Degree(Dis)	0.4527	0.4970	0.8518	0.9679
Eigv(Agre)	0.8452	0.8454	0.9606	0.9589	Eigv(Agre)	0.4495	0.4995	0.9674	0.9842
Ecc(Agre)	0.8401	0.8435	0.9523	0.9556	Ecc(Agre)	0.4807	0.5117	0.9728	0.9844
Degree(Agre)	0.8516	0.8488	0.9727	0.9654	Degree(Agre)	0.4555	0.5001	0.9656	0.9816
Eigv(Jacc)	0.8326	0.8422	0.9390	0.9537	Eigv(Jacc)	0.5268	0.5180	0.9490	0.9643
Ecc(Jacc)	0.8303	0.8399	0.9325	0.9487	Ecc(Jacc)	0.4569	0.5082	0.9513	0.9784
Degree(Jacc)	0.8430	0.8455	0.9531	0.9583	Degree(Jacc)	0.5371	0.5247	0.9973	0.9835

Table 3: Performance comparison of different baselines on Coqa and NLQUAD datasets using Llama2.

D.2 The results for pair-wise comparison on Coqa dataset

This subsection provides more detailed information on the pairwise comparison between D-UE and traditional semantic uncertainty across AUROC and AUARC metrics. Please find in the Fig. 7 and Fig. 8.

Coqa (GPT3.5)
	AUARC		AUROC
Baselines	Previous	Ours	Previous	Ours
NumSet	0.425	0.5605	0.5095	0.6660
LexiSim	0.5001	0.5467	0.6042	0.6471
Eigv(Dis)	0.5271	0.5574	0.6652	0.6733
Ecc(Dis)	0.4837	0.5603	0.5736	0.6675
Degree(Dis)	0.5320	0.5579	0.6654	0.6736
Eigv(Agre)	0.5355	0.5626	0.6769	0.6807
Ecc(Agre)	0.5295	0.5620	0.6669	0.6766
Degree(Agre)	0.5367	0.5615	0.6764	0.6800
Eigv(Jacc)	0.5179	0.5579	0.6463	0.6692
Ecc(Jacc)	0.5173	0.5550	0.6544	0.6694
Degree(Jacc)	0.5252	0.5560	0.6535	0.6693

Table 4: Performance for Coqa under GPT3.5

Appendix E Prompt Design

In this section, we describe the details of the prompt design for two tasks that have LLMs involved, to make sure the reproducibility of the work.

E.1 The Prompt used for Claim Extraction

Firstly, we define the instructions as below:

Then, necessary constraints and format restrictions should be applied (could vary to different LLM backbones, please modify based on the empirical exploration)

Additionally, more examples are provided for few-shot learning from the inference period:

Given the above prompt information, we could ask for task completion to get the claims from a response (as described in the preparation step before the Eq.11):

E.2 The Prompt Used for Response Evaluation

In this section, we provide details of prompt information used for response evaluation to get the correctness scores of the model’s responses, which will be used to judge if the uncertainty evaluation result is as expected to the corresponding correctness performance.

Then we also have the value range description and reactions:

Then we can apply the few-shot learning examples to enhance the tool-LLM’s understanding of its task. We provide some guidance here, and the readers could specify their demonstrations by defining a variable few_shots which contains examples with a triplet of elements: (Question, Reference, Answer):

E.3 The Prompt Used for Response Claim Augmentation

This section introduces the prompt template that augments the claims extracted from the response, by reflecting on the questions being asked, the LLMs should complete the claims if any part is missing or resolve the vagueness if any sentence is found unclear.

Similarly, add the response constraint if your LLM backbone is not performing stably. After that, we could apply the agent to finish the following task by giving it the question and claim for augmentation.

Please note that the prompt performance may vary on different LLMs for completing the tasks, this prompt is testified on LLM3-8b, practitioners could tune the prompt segments if applying other backbones.