HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.05266v1 [cs.CL] 08 Mar 2024

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models

Jio Oh111Equal contribution.   11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT &Soyeon Kim111Equal contribution.   11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT &Junseok Seo11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT &**dong Wang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT &Ruochen Xu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT &Xing Xie22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT &Steven E. Whang222Corresponding author \langle[email protected]\rangle.   11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT \AND11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTKAIST    22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTMicrosoft Research Asia
Abstract

Large language models (LLMs) have achieved unprecedented performance in various applications, yet their evaluation remains a critical issue. Existing hallucination benchmarks are either static or lack adjustable complexity for thorough analysis. We contend that utilizing existing relational databases is a promising approach for constructing benchmarks due to their accurate knowledge description via functional dependencies. We propose ERBench to automatically convert any relational database into a benchmark based on the entity-relationship (ER) model. Our key idea is to construct questions using the database schema, records, and functional dependencies such that they can be automatically verified. In addition, we use foreign key constraints to join relations and construct multi-hop questions, which can be arbitrarily complex and used to debug the intermediate answers of LLMs. Finally, ERBench supports continuous evaluation, multimodal questions, and various prompt engineering techniques. In our experiments, we construct an LLM benchmark using databases of multiple domains and make an extensive comparison of contemporary LLMs. We observe that better LLMs like GPT-4 can handle a larger variety of question types, but are by no means perfect. Also, correct answers do not necessarily imply correct rationales, which is an important evaluation that ERBench does better than other benchmarks for various question types. Code is available at https://github.com/DILAB-KAIST/ERBench.

1 Introduction

Large Language Models (LLMs) OpenAI (2023); Team and Anil (2023) have become prevalent and are increasingly popular in a wide range of applications, including natural language processing, chatbots, content generation, and information retrieval, to name a few. However, a fundamental issue of LLMs is hallucination (Zhang et al., 2023a; Huang et al., 2023a; Ye et al., 2023; Rawte et al., 2023), which refers to the phenomenon that LLMs generate fake, unverified, or non-existent information especially for knowledge-related and safety-critical applications. Hallucination remains one of the most severe issues that should be addressed.

To address the issue of hallucination, it is necessary to develop benchmarks that are comprehensive, intricate, automatically verifiable, and can be scaled efficiently. One approach is to construct manual benchmarks by human annotators Lin et al. (2022); Srivastava et al. (2022); Suzgun et al. (2023); Li et al. (2023), which are expensive and not scalable. Another approach is to automatically construct evaluation samples using knowledge bases Sun et al. (2023) by converting triples to simple factual questions or use existing QA datasets Yu et al. (2023a). Although these benchmarks may scale as the answers can be verified automatically and updated with more knowledge, their questions are still simplistic or unmodifiable, thus lacking the ability to evaluate on intricate tasks.

Refer to caption
Figure 1: The ERBench workflow. Starting from an ER diagram, database schema, functional dependencies, and foreign key constraints, we construct questions and evaluate the answers and rationales of LLMs.

We contend that utilizing existing relational databases is a promising approach to construct a benchmark that has both merits. Until now, many LLM benchmarks have been constructed based on knowledge bases. Although these benchmarks can scale, the main limitation is that the questions tend to be simplistic as they are based on triples. In comparison, relational databases contain structured data where they have schema information and follow the entity-relationship (ER) model, which supports various integrity constraints that make sure the data is well formed. A schema can be designed using traditional ER diagrams or more recent notions like UMLs. By using a database’s schema, records, and integrity constraints, it is possible to construct arbitrarily-long multi-hop questions based on multiple relations that also have clear automatically-verifiable answers based on the integrity constraints. Using databases thus opens up opportunities to extensively evaluate LLMs on a vast amount of knowledge in a principled fashion.

In this paper, we propose ERBench, an LLM benchmark based on the ER model. ERBench supports complex questions and are automatically verifiable (see Figure 1). The questions can be automatically constructed using ER diagrams. For example, if a movie’s length is determined by its title and year, and the ER diagram shows the entity movie with three attributes title, year, and length, then one can ask LLM Does the movie titled “Star Wars” produced in 1977 run for more than an hour?. We use two popular integrity constraints – functional dependencies (FDs) and foreign key constraints to make the questions verifiable. FDs are used to infer an attribute’s value based on other attribute values. In our example, if the FD title, year \rightarrow director, length holds for movies, then the director and length of Star Wars are determined (George Lucas and 121 minutes, respectively). We can construct both binary and multiple-choice questions asking for the inferred values. A foreign key is a set of attributes in one relation that refers to the primary key of another relation, and a foreign key constraint (FK) ensures that the foreign key values actually exist in the other relation. Using FKs, ERBench can support questions with increasing complexity by generating multi-hop questions via joining multiple relations that have foreign key constraints and inferring longer FDs that span them.

ERBench is also extensible in terms of data, modality, and prompting. First, ERBench can be easily updated as its underlying database changes and thus support continuous evaluation. Second, ERBench supports multimodal questions where one can replace attribute text values with other data types like images. Third, we can further diversify the questions using recent techniques like chain-of-thought, few-shot QA, and knowledge augmentation.

We conducted extensive experiments using 5555 public databases and evaluated several popular LLMs, namely, GPT-3.5 (OpenAI, 2022), GPT-4 (OpenAI, 2023), Llama2-70B-Chat (Touvron et al., 2023), Gemini-Pro (Team and Anil, 2023), and Mistral-7B-Instruct (Jiang et al., 2023). We perform comprehensive analyses in terms of answer and rationale accuracies and hallucination rates using single, multi-hop, and multimodal questions and also perform prompt engineering and fine-tuning. Overall, GPT-4 performs the best, but all LLMs can be improved significantly, and the point is that ERBench is effective in showing which aspects need improvements comprehensively.

Summary of Contributions. (1) We propose ERBench, an LLM benchmark that utilizes relational databases to construct complex questions where the model reasoning can be automatically verified. (2) We show how any database can be converted to a benchmark using its schema, records, and integrity constraints. (3) We extensively evaluate contemporary LLMs using ERBench and demonstrate how ERBench itself can be extended and remain relevant over time.

2 Preliminaries

2.1 LLM Hallucination

We would like to evaluate LLMs in terms of their hallucination levels. We define hallucination as generated content that is nonsensical or unfaithful to the provided source content Huang et al. (2023b). Hallucination may occur because the data sources are flawed in various ways or the data is utilized imperfectly and cannot be recalled properly Huang et al. (2023b). In any case, we are interested in whether an LLM not only gives correct answers, but also has the correct thought process and thus consider hallucination on two levels: (1) The LLM gives a wrong answer. For example, if the question asks whether Firenze and Florence are the same city, the answer is Yes (Firenze is the Italian name of Florence), and No is considered as hallucination. (2) The LLM gives a correct answer, but with a wrong rationale. In the above example, if the answer is Yes, but the reasoning is that both cities are in the United States, then this is considered as hallucination as well. We note that recent hallucination benchmarks like Head-to-Tail Sun et al. (2023) are good at evaluating (1), but are not designed to evaluate (2). Our key idea is to utilize relational databases to generate questions that can be used to evaluate both (1) and (2).

Table 1: Sample questions constructed with FDs.
  #                            Question
  q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT For the movie with the title normal-⟨\langleHarry Potter and the Philosopher’s Stonenormal-⟩\rangle produced in normal-⟨\langle2001normal-⟩\rangle, is the length larger than normal-⟨\langle100normal-⟩\rangle minutes?
  q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Is there a movie, released in normal-⟨\langle2001normal-⟩\rangle, starring normal-⟨\langleEmma Watsonnormal-⟩\rangle where normal-⟨\langleChris Columbusnormal-⟩\rangle is the director?
  q3subscript𝑞3q_{3}italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Was the director who directed the movie normal-⟨\langleHarry Potter and the Philosopher’s Stonenormal-⟩\rangle that was released in normal-⟨\langle2001normal-⟩\rangle born in the normal-⟨\langle1950snormal-⟩\rangle?

2.2 Relational Databases

We utilize relational databases, which are based on the ER model. A relation consists of a schema containing attributes and records containing the attribute values. The relation may also have integrity constraints including functional dependencies and foreign key constraints. When designing a schema, a typical approach is to start with an intuitive ER diagram or UML to determine which entities and relationships are needed and how they are connected with each other. These are well known notions in databases, which we would like to utilize for question generation.

An ER diagram represents relationships among entities in a database. For example, the movie ER diagram in Fig. 1 has three entity sets Movie, Star, and Director. In addition, there could be relationships, e.g., Stars-in between Movie and Star and Directed-by between Movie and Director. The resulting schema of the database could then be Movie(title, year, director, length), StarsIn(name, age), and Director(name, birth year). ER diagrams can be used to post natural language questions.

A functional dependency (FD) is a relationship between two sets of attributes X𝑋Xitalic_X and Y𝑌Yitalic_Y where the X𝑋Xitalic_X values determine the Y𝑌Yitalic_Y values. For example, Fig. 1 shows the FD title, year \rightarrow director, length because a movie’s title and year can be used to identify the movie, which in turn determines its director and length. Likewise, there is another FD name \rightarrow birth year where the director name determines his or her birth year. We denote an FD as XY𝑋𝑌X\rightarrow Yitalic_X → italic_Y and use it to evaluate the LLM’s reasoning as we explain later.

A foreign key constraint (FK) is a set of attributes in one relation that refers to the key attributes in another relation, which can be used to identify records. For example, Fig. 1 shows that the director attribute of Movie is a foreign key to the name attribute of Director. Using a foreign key, we can also join two relations and construct FDs that span them. In Fig. 1, the FD title, year \rightarrow birth year is a result of joining the Movie and Director relations and combining the first two Movie and Director FDs. This multi-relation FD construction enables us to construct questions with arbitrary complexity as we explain later.

The integrity constraints thus determine the correctness of records, and our idea is to utilize them to verify the LLM responses as well. A natural question to ask is whether the integrity constraints themselves are always correct. Since the integrity constraints are determined by the database owner, it is the owner’s responsibility to determine if they should hold in general. In addition, we are not aiming to ensure that LLMs get perfect scores in their evaluations, but to make a relative comparison among LLMs.

3 ERBench

We first explain how we utilize functional dependencies to construct questions (Sec. 3.1). We then design two types of questions: binary and multiple-choice questions (Sec. 3.2), and discuss how to verify the questions automatically (Sec. 3.3). We also show how to increase the complexity of questions by joining relations with foreign key constraints together and expanding the functional dependencies to construct multi-hop questions (Sec. 3.4). Finally, we explain how to extend ERBench with diverse data types, modalities, and prompting techniques (Sec. 3.5).

3.1 Question Construction using FDs

We first explain how to utilize an FD to construct a question. Given a relation R𝑅Ritalic_R and an FD XY𝑋𝑌X\rightarrow Yitalic_X → italic_Y, we can specify the R.Xformulae-sequence𝑅𝑋R.Xitalic_R . italic_X values and ask a question involving the R.Yformulae-sequence𝑅𝑌R.Yitalic_R . italic_Y values, e.g., q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in Table 1. One can also use an LLM to generate a question from the ER diagram where the attribute values are left as place holders. In order to fill in the question templates with values, we obtain records by either accessing existing databases or constructing ones ourselves. For example, we can either use a public movie database like IMDb or crawl the IMDb website. We then use the ER diagram to generate a question that involves the attributes of the database.

3.2 Binary and Multiple-choice QA

We construct two types of questions in a binary and multiple-choice answer format for a comprehensive evaluation.

For binary QA, we generate questions whose answers are within {𝑌𝑒𝑠,𝑁𝑜,𝑈𝑛𝑠𝑢𝑟𝑒}𝑌𝑒𝑠𝑁𝑜𝑈𝑛𝑠𝑢𝑟𝑒\{{\it Yes},{\it No},{\it Unsure}\}{ italic_Yes , italic_No , italic_Unsure }. The Unsure option is necessary as it can significantly reduce hallucination rates (Sun et al., 2023) when LLM is not confident about its answers. Hence, similar to Head-to-Tail Sun et al. (2023), we prompt the LLM with Answer the following question in yes or no, and then explain why. Say unsure if you don’t know and then explain why. Note that we ask for further explanation to proceed with the automatic reasoning verification step.

Example. We can generate the binary question q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in Table 1. Here, the answer is Yes, as the movie is 152 minutes.

We can naturally extend the binary QA to multiple-choice QA where we provide several correct options and one incorrect option, which the LLM should figure out. The correct options can be generated using any FD. The incorrect option can be generated by choosing one FD XY𝑋𝑌X\rightarrow Yitalic_X → italic_Y where we know the correct Y𝑌Yitalic_Y value and choosing any other Y𝑌Yitalic_Y value from the relation, which we know is incorrect. To evaluate whether the LLM is not just guessing, we can also add the option None of the above and make it the correct answer for a certain portion of questions.

Example. Continuing from above, suppose we now ask a multiple-choice question of Harry Potter and the Philosopher’s Stone. Some correct options include The director is Chris Columbus and The movie length is 152 minutes. For incorrect options, we can replace Chris Columbus with David Yates (a director for a different Harry Potter film) or 152 minutes with another number in the same relation.

3.3 Verifying LLM Responses

We verify LLM responses by automatically checking their answers and reasonings. The answer checking is straightforward where for binary questions, we look for Yes and No answers, and for multiple-choice questions, we check whether the model has successfully found the incorrect option among the correct options. When checking the reasoning, we look for the inferred values of the FD applied on the current record. For example, let us assume the FD star, director \rightarrow title. If we ask the binary question (e.g., q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Table 1), we not only look for the answer Yes, but also the movie title Harry Potter within the reasoning. Similarly, if we ask the multiple-choice question where the previous FD is used to generate the incorrect option (e.g., the option includes Star Wars for the false value), we look for the true value Harry Potter within the model reasoning. We can thus use FDs in both question types to check whether the LLM has the correct internal knowledge to answer the question.

We may run into an entity resolution problem where the LLM mentions an entity that is essentially the same as the inferred one, but is written differently. In the above example, we may be looking for Harry Potter, but the LLM mentions Harry J. Potter, which contains the middle initial of Harry Potter. We perform entity resolution based on heuristics including conventional string matching. Another possible solution is to use an LLM itself for the matching. While ChatGPT has indeed been used to replace human judgement Sun et al. (2023), we also believe this may give an unfair advantage to GPT compared to other LLMs and choose not to use this method.

Table 2: Functional dependencies (FD) and basic questions of the 5 datasets for binary QA. See Sec. A.2 for multiple-choice QA.
Dataset                   FD                                     Basic Question
 Movie director, star, released year \rightarrow movie title Is there a movie, released in yeardelimited-⟨⟩𝑦𝑒𝑎𝑟\langle year\rangle⟨ italic_y italic_e italic_a italic_r ⟩, starring stardelimited-⟨⟩𝑠𝑡𝑎𝑟\langle star\rangle⟨ italic_s italic_t italic_a italic_r ⟩ where dirdelimited-⟨⟩𝑑𝑖𝑟\langle dir\rangle⟨ italic_d italic_i italic_r ⟩ is the director?
 Soccer club, jersey number, nationality \rightarrow player name (2019 year only) Is there a soccer player from countrydelimited-⟨⟩𝑐𝑜𝑢𝑛𝑡𝑟𝑦\langle country\rangle⟨ italic_c italic_o italic_u italic_n italic_t italic_r italic_y ⟩ who played for clubdelimited-⟨⟩𝑐𝑙𝑢𝑏\langle club\rangle⟨ italic_c italic_l italic_u italic_b ⟩ with uniform number nodelimited-⟨⟩𝑛𝑜\langle no\rangle⟨ italic_n italic_o ⟩ in clubdelimited-⟨⟩𝑐𝑙𝑢𝑏\langle club\rangle⟨ italic_c italic_l italic_u italic_b ⟩ in 2019?
 Airport latitude, longitude \rightarrow airport name Is there an airport located at latitude latitudedelimited-⟨⟩𝑙𝑎𝑡𝑖𝑡𝑢𝑑𝑒\langle latitude\rangle⟨ italic_l italic_a italic_t italic_i italic_t italic_u italic_d italic_e ⟩ and longitude longitudedelimited-⟨⟩𝑙𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒\langle longitude\rangle⟨ italic_l italic_o italic_n italic_g italic_i italic_t italic_u italic_d italic_e ⟩?
 Music music title, released year \rightarrow artist name Is there an artist or group who sang a song titled titledelimited-⟨⟩𝑡𝑖𝑡𝑙𝑒\langle title\rangle⟨ italic_t italic_i italic_t italic_l italic_e ⟩ in yeardelimited-⟨⟩𝑦𝑒𝑎𝑟\langle year\rangle⟨ italic_y italic_e italic_a italic_r ⟩?
 Book author, published date \rightarrow book title Is there a book written by authordelimited-⟨⟩𝑎𝑢𝑡𝑜𝑟\langle author\rangle⟨ italic_a italic_u italic_t italic_h italic_o italic_r ⟩ that was published in yeardelimited-⟨⟩𝑦𝑒𝑎𝑟\langle year\rangle⟨ italic_y italic_e italic_a italic_r ⟩?

3.4 Multi-hop Question Construction using FKs

We can increase the complexity of a question by making it a multi-hop question (Talmor and Berant, 2018; Welbl et al., 2018), which can be used to test whether an LLM can think in multiple steps. We use the straightforward extension of joining multiple relations to implement the multiple hops. The relations must have foreign key relationships so that the FDs can span the relations. Given two relations R(X,Y)𝑅𝑋𝑌R(X,Y)italic_R ( italic_X , italic_Y ) and S(Y,Z)𝑆𝑌𝑍S(Y,Z)italic_S ( italic_Y , italic_Z ) where R𝑅Ritalic_R’s key is X𝑋Xitalic_X, suppose there is a foreign key constraint from R.Yformulae-sequence𝑅𝑌R.Yitalic_R . italic_Y to S.Yformulae-sequence𝑆𝑌S.Yitalic_S . italic_Y. For any tuple e𝑒eitalic_e representing an entity in R𝑅Ritalic_R, we can ask a question only using its X𝑋Xitalic_X and Z𝑍Zitalic_Z values to see if the LLM knows the hidden Y𝑌Yitalic_Y value.

Example. If the relations are movies and directors, we can ask the 2-hop question q3subscript𝑞3q_{3}italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in Table 1. Here e𝑒eitalic_e is the original Harry Potter and the Philosopher’s Stone movie, and the hidden Y𝑌Yitalic_Y value is Chris Columbus. If the LLM knows Chris Columbus, then it should be able to confirm that his birth year is 1958, while giving Yes as an answer.

Notice that the verification of multi-hop questions is exactly the same as for single-hop questions. Thus, we can construct arbitrarily-complex, but automatically verifiable questions.

3.5 Extensions

We explain how ERBench is extensible in terms of data, modality, and prompting. In Sec. 4, we perform experiments on modality and prompting extensions.

Data. ERBench supports continuous evaluation in the sense that if the underlying databases change, ERBench can be updated automatically by simply reflecting the record updates to the questions as well. For any domain, we expect that new databases will continuously be introduced to evaluate LLMs over time, so being able to incrementally update the benchmark is crucial. ERBench overcomes the limitations of LLM by incorporating continuous evaluation, leveraging the benefits of databases for easy updates.

Modality. We can make ERBench multimodal by making the underlying database itself multimodal. For example, we can replace a text attribute in a relation with an image and then pose the same question to the LLM.

Prompt Engineering. There are many other prompt engineering techniques that can be used in ERBench, but are also orthogonal to the relative comparison of LLMs. We highlight the ones we use in our experiments: (1) Chain-of-thought Wei et al. (2022) is a step-by-step approach of prompting and is more likely to give an LLM better context for answering the question; (2) Few-shot prompting Brown et al. (2020) provides demonstrations to the LLM to give it more context before answering questions; and (3) Knowledge augmentation Mialon et al. (2023); Zhuang et al. (2023); Vu et al. (2023) utilizes a search engine to augment the generated answer with search results.

4 Experiments

We test the performances of LLMs on questions based on databases that are constructed from public data. We would like to evaluate LLMs based on whether their answers and rationales are both correct.

LLMs Compared. We compare Mistral-7B-Instruct (Jiang et al., 2023), Llama2-70B-Chat (Touvron et al., 2023), GPT-3.5 (OpenAI, 2022), Gemini-Pro (Team and Anil, 2023), and GPT-4 (OpenAI, 2023). For multimodal LLMs, we evaluate Gemini-Pro-Vision (Team and Anil, 2023). All these LLMs are accessible from Hugging Face or can be called through Microsoft Azure AI Studio APIs or Google Gemini API.

Datasets and Functional Dependencies. We perform experiments on 5555 datasets representing different domains and call them MoviePaul (2018), SoccerLeone (2019), AirportBorsetti (2020), MusicMoura (2021), and BookRustamov (2023) (see Sec. A.1 for details). We also use separate Director and Olympic relations that are joined with the Movie and Soccer relations for multi-hop questioning. All data are available on Kaggle or public Github repositories. Table 2 and Table 7 (in Sec. A.2) show the FDs for each dataset, which are used to verify the LLM responses for binary QA and multiple-choice QA, respectively.

Performance Measures. We utilize existing hallucination measures Sun et al. (2023) and newly introduce two measures involving rationale evaluation.

  • Answer Accuracy (ASun et al. (2023): The portion of LLM answers that are correct (binary).

  • Rationale Accuracy (R): The portion of LLM answers whose rationale contains the FD’s inferred value.

  • Answer-Rationale Accuracy (AR): The portion of LLM answers that are not only correct (binary), but also contain the FD’s inferred value in the rationale.

  • Hallucination Rate (HSun et al. (2023): The portion of LLM answers that are at least partially incorrect. Specifically, 𝐇=1𝐀𝐌𝐇1𝐀𝐌\textbf{H}\!=\!1\!-\!\textbf{A}\!-\!\textbf{M}H = 1 - A - M, where M denotes the percentage of LLM answers that admit they cannot answer the given question (i.e., missing rate). A lower H value is better.

Table 3: LLM performances using the binary (BN) basic and negated and multiple-choice (MC) questions on the 5 datasets. For each LLM, we specify their parameter count (UNK means unknown) and only evaluate with questions with known entities; see Sec. B.3 for more evaluation results. “n/a” means the LLM knows too few entities (less than 10) for the result to be meaningful; see Sec. B.2 for the number of entities known by each LLM. Lower H values are better, whereas higher A, R, and AR values are better.
GPT-3.5 (175B) GPT-4 (UNK) Llama2 (70B) Gemini-Pro (UNK) Mistral (7B)
Dataset Question Type A R AR H (normal-↓\downarrow) A R AR H (normal-↓\downarrow) A R AR H (normal-↓\downarrow) A R AR H (normal-↓\downarrow) A R AR H (normal-↓\downarrow)
Movie BN (basic) .85 .81 .80 .15 .65 .81 .64 .35 .05 .53 05 .95 .28 .66 .26 .40 .48 .54 .31 .52
BN (negated) .06 .11 .05 .94 .51 .76 .50 .47 1.0 .92 .92 .00 .00 .27 .00 1.0 1.0 .66 .66 .00
MC .97 .96 .96 .03 .97 .97 .97 .03 .93 .97 .91 .07 .92 .97 .92 .08 .72 .75 .64 .28
Soccer BN (basic) .35 .28 .24 .64 .47 .70 .41 .38 .02 .18 .02 .98 .01 .06 .00 .17 .54 .10 .06 .46
BN (negated) .00 .02 .00 1.0 .19 .49 .17 .04 1.0 .62 .62 .00 .00 .05 .00 .29 .99 .18 .18 .01
MC .83 .60 .55 .17 .91 .69 .69 .02 n/a n/a n/a n/a .89 .55 .51 .11 .44 .21 .19 .56
Airport BN (basic) .13 .01 .01 .67 .68 .23 .20 .32 .00 .00 .00 1.0 .00 .00 .00 .00 .71 .00 .00 .29
BN (negated) .00 .00 .00 .97 .11 16 .03 .80 1.0 .02 .02 .00 .00 .00 .00 .42 1.0 .00 .00 .00
MC .71 .44 .44 .29 .96 .90 .90 .04 n/a n/a n/a n/a .76 .33 .33 .24 .13 .09 .05 .87
Music BN (basic) .58 .36 .34 .27 .83 .74 .68 .15 .76 .31 .29 .24 .51 .14 .11 .02 .41 .03 .03 .54
BN (negated) .20 .18 .14 .76 .63 .56 .54 .01 1.0 .29 .29 .00 .02 .05 .01 .10 .96 .04 .04 .00
MC .91 .68 .66 .09 .97 .87 .86 .01 .91 .71 .67 .09 .89 .55 .54 .11 .56 .37 .14 .44
Book BN (basic) .77 .13 .12 .05 .41 .21 .19 .08 .02 .02 .00 .98 .00 .00 .00 .01 .16 .01 .00 .83
BN (negated) .01 .01 .00 .94 .02 .02 .01 .01 1.0 .05 .05 .00 .00 .00 .00 .01 .98 .01 .01 .02
MC .55 .15 .12 .45 .48 .34 .34 .01 n/a n/a n/a n/a .47 .13 .11 .53 .40 .04 .03 .60

Measuring Performance based on Internal Knowledge. When measuring LLM performances, we also take into account their internal knowledge of the entities within the questions. That is, if an LLM is hallucinating on a question because it simply does not know the entity mentioned (e.g., the entity may not exist in its training data) we may want to skip that question for evaluation. We can assess an LLM’s knowledge by directly prompting if it knows an entity (e.g., Do you know about the movie “Harry Potter”?; see more details in Sec. B.1). For our datasets, the numbers of known entities per LLM are shown in Sec. B.2. For a fair evaluation, we perform three types of evaluations: (1) each LLM is evaluated only with questions with entities that it has knowledge of; (2) all LLMs are evaluated with questions that all the LLMs have knowledge of; and (3) all LLMs are evaluated with all questions, regardless of their knowledge. Due to space constraints, we only present results using (1), and the (2) and (3) results can be found in Sec. B.3.

4.1 Results for Single-hop Questions

We evaluate the LLM performances using single-hop binary and multiple-choice questions on the 5 datasets. We first explain how we construct the questions and then analyze the LLM performances.

Binary Questions. We use the basic questions in Table 2 and expect a Yes answer. In addition, we construct negated versions of these questions and expect a No answer. For example, we can negate the basic question for the Movie dataset as Is it true that there are no movies released in yeardelimited-⟨⟩𝑦𝑒𝑎𝑟\langle year\rangle⟨ italic_y italic_e italic_a italic_r ⟩, starring stardelimited-⟨⟩𝑠𝑡𝑎𝑟\langle star\rangle⟨ italic_s italic_t italic_a italic_r ⟩ where dirdelimited-⟨⟩𝑑𝑖𝑟\langle dir\rangle⟨ italic_d italic_i italic_r ⟩ is the director?. The reason we always negate a basic question instead of say changing one of its attribute values into an incorrect one is to still perform the FD-based verification. If the question cannot utilize FDs anymore, we can no longer just look for inferred values, but need to understand the entire LLM response to see if the rationale is correct. Table 3 shows that GPT-4 and GPT-3.5 tend to have superior performances, especially in terms of answer and rationale accuracies. Gemini-Pro has lower answer and rationale accuracies, but also low hallucination rates. All three LLMs have worse performances for the negated questions. Llama2 and Mistral, on the other hand, tend to give trivial No answers for most questions, which results in high performances on the negated questions, but low performances on the basic ones.

Refer to caption
Figure 2: GPT-3.5 answer accuracy across domains with varying portions of the None of the above option in multiple-choice QA.
Table 4: LLM performances using the binary basic and negated multi-hop questions w/wo CoT prompting on 2 datasets. We exclude Mistral as it knows too few entities (less than 10), resulting in mostly “n/a” values; see Sec. B.2 for the # of entities known by each LLM.
GPT-3.5 (175B) GPT-4 (UNK) Llama2 (70B) Gemini-Pro (UNK)
Dataset Question Type A R-ext AR-ext H (\downarrow) A R-ext AR-ext H (\downarrow) A R-ext AR-ext H (\downarrow) A R-ext AR-ext H (\downarrow)
Movie w/ Director BN (basic) .74 .92 .73/.70 .21 .59 .98 .59/.59 .41 .02 .95 .02/.02 .98 .19 .42 .19/.19 .02
BN (negated) .00 .92 .00/.00 .96 .40 .98 .40/.39 .60 1.0 .95 .98/.92 .00 .01 .60 .01/.01 .33
\cdashline2-18 BN (basic) + CoT .36 .95 .35/.34 .62 .80 .98 .80/.79 .20 .79 .97 .78/.77 .21 .31 .40 .31/.30 .00
BN (negated) + CoT .54 .79 .53/.50 .27 .66 .98 .66/.65 .34 .06 .95 .06/.05 .94 .02 .28 .02/.01 .20
Soccer w/ Olympic BN (basic) .81 .08 .00/.05/.03 .17 .46 .54 .28/.42/.41 .25 .88 .22 .00/.00/.00 .12 .00 .01 .00/.00/.00 .00
BN (negated) .82 .13 .00/.00/.00 .18 .55 .56 .15/.19/.19 .13 .12 .27 .25/.55/.44 .88 .00 .06 .00/.00/.00 .00
\cdashline2-18 BN (basic) + CoT .59 .60 .36/.54/.54 .33 .79 .73 .40/.55/.55 .20 .84 .41 .09/.14/.13 .15 .18 .54 .57/.73/.75 .60
BN (negated) + CoT .76 .55 .19/.31/.31 .21 .70 .73 .55/.73/.73 .29 .33 .39 .32/.62/.62 .65 .41 .42 .09/.09/.09 .15

Multiple-choice Questions. For the multiple-choice questions, we generate 2–4 choices using the attributes on the right-hand side of the functional dependencies, with one incorrect choice using the construction in Sec. 3.2. For each question, we generate three versions with rephrased choices (e.g., “born in US” can be rephrased to “birthplace is US” and “nationality is US”) and report the averaged LLM performance to account for prompt sensitivity (see Sec. A.2 for more details). Table 3 shows that most LLMs demonstrate improved performances in multiple-choice questions compared to binary questions. GPT-4 and Gemini-Pro are comparable in answer and rationale accuracies, while GPT-4 has better hallucination rates. Llama2 and Mistral show good performances for the Movie and Music datasets. We also extend the multiple-choice questions with a None of the above option and evaluate with GPT-3.5 and GPT-4. Fig. 2 and Fig. 3 (in Sec. B.4) show that the LLM performances tend to decrease when the proportion of this option increases among the questions.

Instead of trying to rank LLMs by performance, we would like to make observations from a benchmark perspective: (1) rationale accuracy tends to be worse than answer accuracy and (2) the LLM performances vary significantly across question types, even when using the same entities. We thus conclude that there is much room for improvement for LLM rationale and that LLM benchmarking should always involve diverse questions for a comprehensive analysis.