ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models

Jio Oh¹¹1Equal contribution.

{}^{1}

&Soyeon Kim¹¹1Equal contribution.

{}^{1}

&Junseok Seo

{}^{1}

&**dong Wang

{}^{2}

&Ruochen Xu

{}^{2}

&Xing Xie

{}^{2}

&Steven E. Whang²²2Corresponding author

\langle

[email protected]

\rangle

{}^{1}

\AND

{}^{1}

KAIST

{}^{2}

Microsoft Research Asia

Abstract

Large language models (LLMs) have achieved unprecedented performance in various applications, yet their evaluation remains a critical issue. Existing hallucination benchmarks are either static or lack adjustable complexity for thorough analysis. We contend that utilizing existing relational databases is a promising approach for constructing benchmarks due to their accurate knowledge description via functional dependencies. We propose ERBench to automatically convert any relational database into a benchmark based on the entity-relationship (ER) model. Our key idea is to construct questions using the database schema, records, and functional dependencies such that they can be automatically verified. In addition, we use foreign key constraints to join relations and construct multi-hop questions, which can be arbitrarily complex and used to debug the intermediate answers of LLMs. Finally, ERBench supports continuous evaluation, multimodal questions, and various prompt engineering techniques. In our experiments, we construct an LLM benchmark using databases of multiple domains and make an extensive comparison of contemporary LLMs. We observe that better LLMs like GPT-4 can handle a larger variety of question types, but are by no means perfect. Also, correct answers do not necessarily imply correct rationales, which is an important evaluation that ERBench does better than other benchmarks for various question types. Code is available at https://github.com/DILAB-KAIST/ERBench.

1 Introduction

Large Language Models (LLMs) OpenAI (2023); Team and Anil (2023) have become prevalent and are increasingly popular in a wide range of applications, including natural language processing, chatbots, content generation, and information retrieval, to name a few. However, a fundamental issue of LLMs is hallucination (Zhang et al., 2023a; Huang et al., 2023a; Ye et al., 2023; Rawte et al., 2023), which refers to the phenomenon that LLMs generate fake, unverified, or non-existent information especially for knowledge-related and safety-critical applications. Hallucination remains one of the most severe issues that should be addressed.

To address the issue of hallucination, it is necessary to develop benchmarks that are comprehensive, intricate, automatically verifiable, and can be scaled efficiently. One approach is to construct manual benchmarks by human annotators Lin et al. (2022); Srivastava et al. (2022); Suzgun et al. (2023); Li et al. (2023), which are expensive and not scalable. Another approach is to automatically construct evaluation samples using knowledge bases Sun et al. (2023) by converting triples to simple factual questions or use existing QA datasets Yu et al. (2023a). Although these benchmarks may scale as the answers can be verified automatically and updated with more knowledge, their questions are still simplistic or unmodifiable, thus lacking the ability to evaluate on intricate tasks.

Refer to caption — Figure 1: The ERBench workflow. Starting from an ER diagram, database schema, functional dependencies, and foreign key constraints, we construct questions and evaluate the answers and rationales of LLMs.

We contend that utilizing existing relational databases is a promising approach to construct a benchmark that has both merits. Until now, many LLM benchmarks have been constructed based on knowledge bases. Although these benchmarks can scale, the main limitation is that the questions tend to be simplistic as they are based on triples. In comparison, relational databases contain structured data where they have schema information and follow the entity-relationship (ER) model, which supports various integrity constraints that make sure the data is well formed. A schema can be designed using traditional ER diagrams or more recent notions like UMLs. By using a database’s schema, records, and integrity constraints, it is possible to construct arbitrarily-long multi-hop questions based on multiple relations that also have clear automatically-verifiable answers based on the integrity constraints. Using databases thus opens up opportunities to extensively evaluate LLMs on a vast amount of knowledge in a principled fashion.

In this paper, we propose ERBench, an LLM benchmark based on the ER model. ERBench supports complex questions and are automatically verifiable (see Figure 1). The questions can be automatically constructed using ER diagrams. For example, if a movie’s length is determined by its title and year, and the ER diagram shows the entity movie with three attributes title, year, and length, then one can ask LLM Does the movie titled “Star Wars” produced in 1977 run for more than an hour?. We use two popular integrity constraints – functional dependencies (FDs) and foreign key constraints to make the questions verifiable. FDs are used to infer an attribute’s value based on other attribute values. In our example, if the FD title, year $\rightarrow$ director, length holds for movies, then the director and length of Star Wars are determined (George Lucas and 121 minutes, respectively). We can construct both binary and multiple-choice questions asking for the inferred values. A foreign key is a set of attributes in one relation that refers to the primary key of another relation, and a foreign key constraint (FK) ensures that the foreign key values actually exist in the other relation. Using FKs, ERBench can support questions with increasing complexity by generating multi-hop questions via joining multiple relations that have foreign key constraints and inferring longer FDs that span them.

ERBench is also extensible in terms of data, modality, and prompting. First, ERBench can be easily updated as its underlying database changes and thus support continuous evaluation. Second, ERBench supports multimodal questions where one can replace attribute text values with other data types like images. Third, we can further diversify the questions using recent techniques like chain-of-thought, few-shot QA, and knowledge augmentation.

We conducted extensive experiments using $5$ public databases and evaluated several popular LLMs, namely, GPT-3.5 (OpenAI, 2022), GPT-4 (OpenAI, 2023), Llama2-70B-Chat (Touvron et al., 2023), Gemini-Pro (Team and Anil, 2023), and Mistral-7B-Instruct (Jiang et al., 2023). We perform comprehensive analyses in terms of answer and rationale accuracies and hallucination rates using single, multi-hop, and multimodal questions and also perform prompt engineering and fine-tuning. Overall, GPT-4 performs the best, but all LLMs can be improved significantly, and the point is that ERBench is effective in showing which aspects need improvements comprehensively.

Summary of Contributions. (1) We propose ERBench, an LLM benchmark that utilizes relational databases to construct complex questions where the model reasoning can be automatically verified. (2) We show how any database can be converted to a benchmark using its schema, records, and integrity constraints. (3) We extensively evaluate contemporary LLMs using ERBench and demonstrate how ERBench itself can be extended and remain relevant over time.

2 Preliminaries

2.1 LLM Hallucination

We would like to evaluate LLMs in terms of their hallucination levels. We define hallucination as generated content that is nonsensical or unfaithful to the provided source content Huang et al. (2023b). Hallucination may occur because the data sources are flawed in various ways or the data is utilized imperfectly and cannot be recalled properly Huang et al. (2023b). In any case, we are interested in whether an LLM not only gives correct answers, but also has the correct thought process and thus consider hallucination on two levels: (1) The LLM gives a wrong answer. For example, if the question asks whether Firenze and Florence are the same city, the answer is Yes (Firenze is the Italian name of Florence), and No is considered as hallucination. (2) The LLM gives a correct answer, but with a wrong rationale. In the above example, if the answer is Yes, but the reasoning is that both cities are in the United States, then this is considered as hallucination as well. We note that recent hallucination benchmarks like Head-to-Tail Sun et al. (2023) are good at evaluating (1), but are not designed to evaluate (2). Our key idea is to utilize relational databases to generate questions that can be used to evaluate both (1) and (2).

Table 1: Sample questions constructed with FDs.

#	Question
$q_{1}$	For the movie with the title $\langle$ Harry Potter and the Philosopher’s Stone $\rangle$ produced in $\langle$ 2001 $\rangle$ , is the length larger than $\langle$ 100 $\rangle$ minutes?
$q_{2}$	Is there a movie, released in $\langle$ 2001 $\rangle$ , starring $\langle$ Emma Watson $\rangle$ where $\langle$ Chris Columbus $\rangle$ is the director?
$q_{3}$	Was the director who directed the movie $\langle$ Harry Potter and the Philosopher’s Stone $\rangle$ that was released in $\langle$ 2001 $\rangle$ born in the $\langle$ 1950s $\rangle$ ?

2.2 Relational Databases

We utilize relational databases, which are based on the ER model. A relation consists of a schema containing attributes and records containing the attribute values. The relation may also have integrity constraints including functional dependencies and foreign key constraints. When designing a schema, a typical approach is to start with an intuitive ER diagram or UML to determine which entities and relationships are needed and how they are connected with each other. These are well known notions in databases, which we would like to utilize for question generation.

An ER diagram represents relationships among entities in a database. For example, the movie ER diagram in Fig. 1 has three entity sets Movie, Star, and Director. In addition, there could be relationships, e.g., Stars-in between Movie and Star and Directed-by between Movie and Director. The resulting schema of the database could then be Movie(title, year, director, length), StarsIn(name, age), and Director(name, birth year). ER diagrams can be used to post natural language questions.

A functional dependency (FD) is a relationship between two sets of attributes $X$ and $Y$ where the $X$ values determine the $Y$ values. For example, Fig. 1 shows the FD title, year $\rightarrow$ director, length because a movie’s title and year can be used to identify the movie, which in turn determines its director and length. Likewise, there is another FD name $\rightarrow$ birth year where the director name determines his or her birth year. We denote an FD as $X\rightarrow Y$ and use it to evaluate the LLM’s reasoning as we explain later.

A foreign key constraint (FK) is a set of attributes in one relation that refers to the key attributes in another relation, which can be used to identify records. For example, Fig. 1 shows that the director attribute of Movie is a foreign key to the name attribute of Director. Using a foreign key, we can also join two relations and construct FDs that span them. In Fig. 1, the FD title, year $\rightarrow$ birth year is a result of joining the Movie and Director relations and combining the first two Movie and Director FDs. This multi-relation FD construction enables us to construct questions with arbitrary complexity as we explain later.

The integrity constraints thus determine the correctness of records, and our idea is to utilize them to verify the LLM responses as well. A natural question to ask is whether the integrity constraints themselves are always correct. Since the integrity constraints are determined by the database owner, it is the owner’s responsibility to determine if they should hold in general. In addition, we are not aiming to ensure that LLMs get perfect scores in their evaluations, but to make a relative comparison among LLMs.

3 ERBench

We first explain how we utilize functional dependencies to construct questions (Sec. 3.1). We then design two types of questions: binary and multiple-choice questions (Sec. 3.2), and discuss how to verify the questions automatically (Sec. 3.3). We also show how to increase the complexity of questions by joining relations with foreign key constraints together and expanding the functional dependencies to construct multi-hop questions (Sec. 3.4). Finally, we explain how to extend ERBench with diverse data types, modalities, and prompting techniques (Sec. 3.5).

3.1 Question Construction using FDs

We first explain how to utilize an FD to construct a question. Given a relation $R$ and an FD $X\rightarrow Y$ , we can specify the $R.X$ values and ask a question involving the $R.Y$ values, e.g., $q_{1}$ in Table 1. One can also use an LLM to generate a question from the ER diagram where the attribute values are left as place holders. In order to fill in the question templates with values, we obtain records by either accessing existing databases or constructing ones ourselves. For example, we can either use a public movie database like IMDb or crawl the IMDb website. We then use the ER diagram to generate a question that involves the attributes of the database.

3.2 Binary and Multiple-choice QA

We construct two types of questions in a binary and multiple-choice answer format for a comprehensive evaluation.

For binary QA, we generate questions whose answers are within $\{{\it Yes},{\it No},{\it Unsure}\}$ . The Unsure option is necessary as it can significantly reduce hallucination rates (Sun et al., 2023) when LLM is not confident about its answers. Hence, similar to Head-to-Tail Sun et al. (2023), we prompt the LLM with Answer the following question in yes or no, and then explain why. Say unsure if you don’t know and then explain why. Note that we ask for further explanation to proceed with the automatic reasoning verification step.

Example. We can generate the binary question $q_{1}$ in Table 1. Here, the answer is Yes, as the movie is 152 minutes.

We can naturally extend the binary QA to multiple-choice QA where we provide several correct options and one incorrect option, which the LLM should figure out. The correct options can be generated using any FD. The incorrect option can be generated by choosing one FD $X\rightarrow Y$ where we know the correct $Y$ value and choosing any other $Y$ value from the relation, which we know is incorrect. To evaluate whether the LLM is not just guessing, we can also add the option None of the above and make it the correct answer for a certain portion of questions.

Example. Continuing from above, suppose we now ask a multiple-choice question of Harry Potter and the Philosopher’s Stone. Some correct options include The director is Chris Columbus and The movie length is 152 minutes. For incorrect options, we can replace Chris Columbus with David Yates (a director for a different Harry Potter film) or 152 minutes with another number in the same relation.

3.3 Verifying LLM Responses

We verify LLM responses by automatically checking their answers and reasonings. The answer checking is straightforward where for binary questions, we look for Yes and No answers, and for multiple-choice questions, we check whether the model has successfully found the incorrect option among the correct options. When checking the reasoning, we look for the inferred values of the FD applied on the current record. For example, let us assume the FD star, director $\rightarrow$ title. If we ask the binary question (e.g., $q_{2}$ in Table 1), we not only look for the answer Yes, but also the movie title Harry Potter within the reasoning. Similarly, if we ask the multiple-choice question where the previous FD is used to generate the incorrect option (e.g., the option includes Star Wars for the false value), we look for the true value Harry Potter within the model reasoning. We can thus use FDs in both question types to check whether the LLM has the correct internal knowledge to answer the question.

We may run into an entity resolution problem where the LLM mentions an entity that is essentially the same as the inferred one, but is written differently. In the above example, we may be looking for Harry Potter, but the LLM mentions Harry J. Potter, which contains the middle initial of Harry Potter. We perform entity resolution based on heuristics including conventional string matching. Another possible solution is to use an LLM itself for the matching. While ChatGPT has indeed been used to replace human judgement Sun et al. (2023), we also believe this may give an unfair advantage to GPT compared to other LLMs and choose not to use this method.

Table 2: Functional dependencies (FD) and basic questions of the 5 datasets for binary QA. See Sec. A.2 for multiple-choice QA.

Dataset	FD	Basic Question
Movie	director, star, released year $\rightarrow$ movie title	Is there a movie, released in $\langle year\rangle$ , starring $\langle star\rangle$ where $\langle dir\rangle$ is the director?
Soccer	club, jersey number, nationality $\rightarrow$ player name (2019 year only)	Is there a soccer player from $\langle country\rangle$ who played for $\langle club\rangle$ with uniform number $\langle no\rangle$ in $\langle club\rangle$ in 2019?
Airport	latitude, longitude $\rightarrow$ airport name	Is there an airport located at latitude $\langle latitude\rangle$ and longitude $\langle longitude\rangle$ ?
Music	music title, released year $\rightarrow$ artist name	Is there an artist or group who sang a song titled $\langle title\rangle$ in $\langle year\rangle$ ?
Book	author, published date $\rightarrow$ book title	Is there a book written by $\langle author\rangle$ that was published in $\langle year\rangle$ ?

3.4 Multi-hop Question Construction using FKs

We can increase the complexity of a question by making it a multi-hop question (Talmor and Berant, 2018; Welbl et al., 2018), which can be used to test whether an LLM can think in multiple steps. We use the straightforward extension of joining multiple relations to implement the multiple hops. The relations must have foreign key relationships so that the FDs can span the relations. Given two relations $R(X,Y)$ and $S(Y,Z)$ where $R$ ’s key is $X$ , suppose there is a foreign key constraint from $R.Y$ to $S.Y$ . For any tuple $e$ representing an entity in $R$ , we can ask a question only using its $X$ and $Z$ values to see if the LLM knows the hidden $Y$ value.

Example. If the relations are movies and directors, we can ask the 2-hop question $q_{3}$ in Table 1. Here $e$ is the original Harry Potter and the Philosopher’s Stone movie, and the hidden $Y$ value is Chris Columbus. If the LLM knows Chris Columbus, then it should be able to confirm that his birth year is 1958, while giving Yes as an answer.

Notice that the verification of multi-hop questions is exactly the same as for single-hop questions. Thus, we can construct arbitrarily-complex, but automatically verifiable questions.

3.5 Extensions

We explain how ERBench is extensible in terms of data, modality, and prompting. In Sec. 4, we perform experiments on modality and prompting extensions.

Data. ERBench supports continuous evaluation in the sense that if the underlying databases change, ERBench can be updated automatically by simply reflecting the record updates to the questions as well. For any domain, we expect that new databases will continuously be introduced to evaluate LLMs over time, so being able to incrementally update the benchmark is crucial. ERBench overcomes the limitations of LLM by incorporating continuous evaluation, leveraging the benefits of databases for easy updates.

Modality. We can make ERBench multimodal by making the underlying database itself multimodal. For example, we can replace a text attribute in a relation with an image and then pose the same question to the LLM.

Prompt Engineering. There are many other prompt engineering techniques that can be used in ERBench, but are also orthogonal to the relative comparison of LLMs. We highlight the ones we use in our experiments: (1) Chain-of-thought Wei et al. (2022) is a step-by-step approach of prompting and is more likely to give an LLM better context for answering the question; (2) Few-shot prompting Brown et al. (2020) provides demonstrations to the LLM to give it more context before answering questions; and (3) Knowledge augmentation Mialon et al. (2023); Zhuang et al. (2023); Vu et al. (2023) utilizes a search engine to augment the generated answer with search results.

4 Experiments

We test the performances of LLMs on questions based on databases that are constructed from public data. We would like to evaluate LLMs based on whether their answers and rationales are both correct.

LLMs Compared. We compare Mistral-7B-Instruct (Jiang et al., 2023), Llama2-70B-Chat (Touvron et al., 2023), GPT-3.5 (OpenAI, 2022), Gemini-Pro (Team and Anil, 2023), and GPT-4 (OpenAI, 2023). For multimodal LLMs, we evaluate Gemini-Pro-Vision (Team and Anil, 2023). All these LLMs are accessible from Hugging Face or can be called through Microsoft Azure AI Studio APIs or Google Gemini API.

Datasets and Functional Dependencies. We perform experiments on $5$ datasets representing different domains and call them Movie Paul (2018), Soccer Leone (2019), Airport Borsetti (2020), Music Moura (2021), and Book Rustamov (2023) (see Sec. A.1 for details). We also use separate Director and Olympic relations that are joined with the Movie and Soccer relations for multi-hop questioning. All data are available on Kaggle or public Github repositories. Table 2 and Table 7 (in Sec. A.2) show the FDs for each dataset, which are used to verify the LLM responses for binary QA and multiple-choice QA, respectively.

Performance Measures. We utilize existing hallucination measures Sun et al. (2023) and newly introduce two measures involving rationale evaluation.

•

Answer Accuracy (A) Sun et al. (2023): The portion of LLM answers that are correct (binary).
•

Rationale Accuracy (R): The portion of LLM answers whose rationale contains the FD’s inferred value.
•

Answer-Rationale Accuracy (AR): The portion of LLM answers that are not only correct (binary), but also contain the FD’s inferred value in the rationale.
•

Hallucination Rate (H) Sun et al. (2023): The portion of LLM answers that are at least partially incorrect. Specifically, $\textbf{H}\!=\!1\!-\!\textbf{A}\!-\!\textbf{M}$ , where M denotes the percentage of LLM answers that admit they cannot answer the given question (i.e., missing rate). A lower H value is better.

Table 3: LLM performances using the binary (BN) basic and negated and multiple-choice (MC) questions on the 5 datasets. For each LLM, we specify their parameter count (UNK means unknown) and only evaluate with questions with known entities; see Sec. B.3 for more evaluation results. “n/a” means the LLM knows too few entities (less than 10) for the result to be meaningful; see Sec. B.2 for the number of entities known by each LLM. Lower H values are better, whereas higher A, R, and AR values are better.

		GPT-3.5 (175B)				GPT-4 (UNK)				Llama2 (70B)				Gemini-Pro (UNK)				Mistral (7B)
Dataset	Question Type	A	R	AR	H ( $\downarrow$ )	A	R	AR	H ( $\downarrow$ )	A	R	AR	H ( $\downarrow$ )	A	R	AR	H ( $\downarrow$ )	A	R	AR	H ( $\downarrow$ )
Movie	BN (basic)	.85	.81	.80	.15	.65	.81	.64	.35	.05	.53	05	.95	.28	.66	.26	.40	.48	.54	.31	.52
	BN (negated)	.06	.11	.05	.94	.51	.76	.50	.47	1.0	.92	.92	.00	.00	.27	.00	1.0	1.0	.66	.66	.00
	MC	.97	.96	.96	.03	.97	.97	.97	.03	.93	.97	.91	.07	.92	.97	.92	.08	.72	.75	.64	.28
Soccer	BN (basic)	.35	.28	.24	.64	.47	.70	.41	.38	.02	.18	.02	.98	.01	.06	.00	.17	.54	.10	.06	.46
	BN (negated)	.00	.02	.00	1.0	.19	.49	.17	.04	1.0	.62	.62	.00	.00	.05	.00	.29	.99	.18	.18	.01
	MC	.83	.60	.55	.17	.91	.69	.69	.02	n/a	n/a	n/a	n/a	.89	.55	.51	.11	.44	.21	.19	.56
Airport	BN (basic)	.13	.01	.01	.67	.68	.23	.20	.32	.00	.00	.00	1.0	.00	.00	.00	.00	.71	.00	.00	.29
	BN (negated)	.00	.00	.00	.97	.11	16	.03	.80	1.0	.02	.02	.00	.00	.00	.00	.42	1.0	.00	.00	.00
	MC	.71	.44	.44	.29	.96	.90	.90	.04	n/a	n/a	n/a	n/a	.76	.33	.33	.24	.13	.09	.05	.87
Music	BN (basic)	.58	.36	.34	.27	.83	.74	.68	.15	.76	.31	.29	.24	.51	.14	.11	.02	.41	.03	.03	.54
	BN (negated)	.20	.18	.14	.76	.63	.56	.54	.01	1.0	.29	.29	.00	.02	.05	.01	.10	.96	.04	.04	.00
	MC	.91	.68	.66	.09	.97	.87	.86	.01	.91	.71	.67	.09	.89	.55	.54	.11	.56	.37	.14	.44
Book	BN (basic)	.77	.13	.12	.05	.41	.21	.19	.08	.02	.02	.00	.98	.00	.00	.00	.01	.16	.01	.00	.83
	BN (negated)	.01	.01	.00	.94	.02	.02	.01	.01	1.0	.05	.05	.00	.00	.00	.00	.01	.98	.01	.01	.02
	MC	.55	.15	.12	.45	.48	.34	.34	.01	n/a	n/a	n/a	n/a	.47	.13	.11	.53	.40	.04	.03	.60

Measuring Performance based on Internal Knowledge. When measuring LLM performances, we also take into account their internal knowledge of the entities within the questions. That is, if an LLM is hallucinating on a question because it simply does not know the entity mentioned (e.g., the entity may not exist in its training data) we may want to skip that question for evaluation. We can assess an LLM’s knowledge by directly prompting if it knows an entity (e.g., Do you know about the movie “Harry Potter”?; see more details in Sec. B.1). For our datasets, the numbers of known entities per LLM are shown in Sec. B.2. For a fair evaluation, we perform three types of evaluations: (1) each LLM is evaluated only with questions with entities that it has knowledge of; (2) all LLMs are evaluated with questions that all the LLMs have knowledge of; and (3) all LLMs are evaluated with all questions, regardless of their knowledge. Due to space constraints, we only present results using (1), and the (2) and (3) results can be found in Sec. B.3.

4.1 Results for Single-hop Questions

We evaluate the LLM performances using single-hop binary and multiple-choice questions on the 5 datasets. We first explain how we construct the questions and then analyze the LLM performances.

Binary Questions. We use the basic questions in Table 2 and expect a Yes answer. In addition, we construct negated versions of these questions and expect a No answer. For example, we can negate the basic question for the Movie dataset as Is it true that there are no movies released in $\langle year\rangle$ , starring $\langle star\rangle$ where $\langle dir\rangle$ is the director?. The reason we always negate a basic question instead of say changing one of its attribute values into an incorrect one is to still perform the FD-based verification. If the question cannot utilize FDs anymore, we can no longer just look for inferred values, but need to understand the entire LLM response to see if the rationale is correct. Table 3 shows that GPT-4 and GPT-3.5 tend to have superior performances, especially in terms of answer and rationale accuracies. Gemini-Pro has lower answer and rationale accuracies, but also low hallucination rates. All three LLMs have worse performances for the negated questions. Llama2 and Mistral, on the other hand, tend to give trivial No answers for most questions, which results in high performances on the negated questions, but low performances on the basic ones.

Table 4: LLM performances using the binary basic and negated multi-hop questions w/wo CoT prompting on 2 datasets. We exclude Mistral as it knows too few entities (less than 10), resulting in mostly “n/a” values; see Sec. B.2 for the # of entities known by each LLM.

		GPT-3.5 (175B)				GPT-4 (UNK)				Llama2 (70B)				Gemini-Pro (UNK)
Dataset	Question Type	A	R-ext	AR-ext	H ( $\downarrow$ )	A	R-ext	AR-ext	H ( $\downarrow$ )	A	R-ext	AR-ext	H ( $\downarrow$ )	A	R-ext	AR-ext	H ( $\downarrow$ )
Movie w/ Director	BN (basic)	.74	.92	.73/.70	.21	.59	.98	.59/.59	.41	.02	.95	.02/.02	.98	.19	.42	.19/.19	.02
Movie w/ Director	BN (negated)	.00	.92	.00/.00	.96	.40	.98	.40/.39	.60	1.0	.95	.98/.92	.00	.01	.60	.01/.01	.33
\cdashline2-18	BN (basic) + CoT	.36	.95	.35/.34	.62	.80	.98	.80/.79	.20	.79	.97	.78/.77	.21	.31	.40	.31/.30	.00
	BN (negated) + CoT	.54	.79	.53/.50	.27	.66	.98	.66/.65	.34	.06	.95	.06/.05	.94	.02	.28	.02/.01	.20
Soccer w/ Olympic	BN (basic)	.81	.08	.00/.05/.03	.17	.46	.54	.28/.42/.41	.25	.88	.22	.00/.00/.00	.12	.00	.01	.00/.00/.00	.00
Soccer w/ Olympic	BN (negated)	.82	.13	.00/.00/.00	.18	.55	.56	.15/.19/.19	.13	.12	.27	.25/.55/.44	.88	.00	.06	.00/.00/.00	.00
\cdashline2-18	BN (basic) + CoT	.59	.60	.36/.54/.54	.33	.79	.73	.40/.55/.55	.20	.84	.41	.09/.14/.13	.15	.18	.54	.57/.73/.75	.60
	BN (negated) + CoT	.76	.55	.19/.31/.31	.21	.70	.73	.55/.73/.73	.29	.33	.39	.32/.62/.62	.65	.41	.42	.09/.09/.09	.15

Multiple-choice Questions. For the multiple-choice questions, we generate 2–4 choices using the attributes on the right-hand side of the functional dependencies, with one incorrect choice using the construction in Sec. 3.2. For each question, we generate three versions with rephrased choices (e.g., “born in US” can be rephrased to “birthplace is US” and “nationality is US”) and report the averaged LLM performance to account for prompt sensitivity (see Sec. A.2 for more details). Table 3 shows that most LLMs demonstrate improved performances in multiple-choice questions compared to binary questions. GPT-4 and Gemini-Pro are comparable in answer and rationale accuracies, while GPT-4 has better hallucination rates. Llama2 and Mistral show good performances for the Movie and Music datasets. We also extend the multiple-choice questions with a None of the above option and evaluate with GPT-3.5 and GPT-4. Fig. 2 and Fig. 3 (in Sec. B.4) show that the LLM performances tend to decrease when the proportion of this option increases among the questions.

Instead of trying to rank LLMs by performance, we would like to make observations from a benchmark perspective: (1) rationale accuracy tends to be worse than answer accuracy and (2) the LLM performances vary significantly across question types, even when using the same entities. We thus conclude that there is much room for improvement for LLM rationale and that LLM benchmarking should always involve diverse questions for a comprehensive analysis.