Harnessing Multi-Role Capabilities of Large Language Models for Open-Domain Question Answering

Hongda Sun Gaoling School of Artificial Intelligence, Renmin University of ChinaBei**gChina [email protected] , Yuxuan Liu Nankai UniversityTian**China [email protected] , Chengwei Wu Bei**g Academy of Artificial IntelligenceBei**gChina [email protected] , Haiyu Yan Renmin University of ChinaBei**gChina [email protected] , Cheng Tai Moqi, Inc.20 Emerald Hill RoadSingapore [email protected] , Xin Gao King Abdullah University of Science and TechnologyThuwalSaudi Arabia [email protected] , Shuo Shang University of Electronic Science and Technology of ChinaChengduChina [email protected] and Rui Yan Gaoling School of Artificial Intelligence, Renmin University of ChinaBei**gChina [email protected]

(2024)

Abstract.

Open-domain question answering (ODQA) has emerged as a pivotal research spotlight in information systems. Existing methods follow two main paradigms to collect evidence: (1) The retrieve-then-read paradigm retrieves pertinent documents from an external corpus; and (2) the generate-then-read paradigm employs large language models (LLMs) to generate relevant documents. However, neither can fully address multifaceted requirements for evidence. To this end, we propose LLMQA, a generalized framework that formulates the ODQA process into three basic steps: query expansion, document selection, and answer generation, combining the superiority of both retrieval-based and generation-based evidence. Since LLMs exhibit their excellent capabilities to accomplish various tasks, we instruct LLMs to play multiple roles as generators, rerankers, and evaluators within our framework, integrating them to collaborate in the ODQA process. Furthermore, we introduce a novel prompt optimization algorithm to refine role-playing prompts and steer LLMs to produce higher-quality evidence and answers. Extensive experimental results on widely used benchmarks (NQ, WebQ, and TriviaQA) demonstrate that LLMQA achieves the best performance in terms of both answer accuracy and evidence quality, showcasing its potential for advancing ODQA research and applications.

Question answering, Role-playing LLMs, Prompt optimization

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singapore.^†^†booktitle: Proceedings of the ACM Web Conference 2024 (WWW ’24), May 13–17, 2024, Singapore, Singapore^†^†isbn: 979-8-4007-0171-9/24/05^†^†doi: 10.1145/3589334.3645670^†^†ccs: Computing methodologies Natural language generation^†^†ccs: Information systems Information systems applications

1. Introduction

In the interdisciplinary realm of information system applications, open-domain question answering (ODQA) has emerged as a pivotal research spotlight. This task is intrinsically knowledge-intensive and focuses on answering factoid questions, enhancing its utility to transcend the constraints of predefined domains (Chen et al., 2017; Petroni et al., 2020; Min et al., 2021; Gao et al., 2019).

Current ODQA methods follow two main paradigms in preparation for answering questions: (1) The retrieve-then-read paradigm retrieves pertinent evidence documents from an external corpus and generates an answer based on them (Karpukhin et al., 2020; Izacard and Grave, 2021). Since retrieval models often rely on well-curated corpora like Wikipedia, they can provide highly factual and accurate information about the question; (2) The generate-then-read paradigm directly employs language models to generate virtual documents (Yu et al., 2023), diversifying the evidence sources and enhancing answer coverage for the question.

Despite the individual merits of both paradigms, neither can adequately address the multifaceted requirements for evidence. An intuitive solution is to integrate the strengths of these paradigms to collect evidence that combines factual reliability with diversity. To this end, we propose LLMQA, a novel generalized framework that incorporates the strengths of retrieval-based and generation-based evidence. Specifically, we formulate the ODQA process into three fundamental steps: (1) Query expansion involves expanding the question by producing background passages or explanations, serving as generated-based evidence to enrich the context; (2) Document selection integrates retrieval-based evidence by reranking the retrieved documents, increasing their relevance to the answer; (3) Answer generation proceeds to generate the final answer based on comprehension of the question and obtained evidence.

To implement each step of ODQA, previous methods typically train specialized models for individual modules to obtain evidence documents and final answers (Chuang et al., 2023). Limited by the inherent capabilities of these models, jointly optimizing each module to improve overall performance remains challenging. Recent works have showcased the exceptional capabilities of large language models (LLMs) across various tasks (Bubeck et al., 2023). Specifically for ODQA, which requires integration of text generation (Qin et al., 2023; Sun et al., 2023a; Yu et al., 2023; Chuang et al., 2023), document ranking (Ma et al., 2023a; Chuang et al., 2023; Ma et al., 2023b), and candidate evaluation (Bohnet et al., 2022; Yue et al., 2023), the multiple aspects of capabilities of LLMs into each module. Therefore, we aim to instruct LLMs to play three roles within our proposed unified framework: generators, rerankers, and evaluators. By closely coordinating these roles and fostering collaboration among them, we can fully exploit their potential to enhance overall performance. As shown in Figure 1: The generator expands the query and provides comprehensive and pertinent information for answer generation; The reranker prioritizes retrieved documents to distill more valid and relevant documents as evidence; The evaluator engages in interacting with the generator and reranker, providing evaluative feedback to refine their outputs.

The quality of LLMs in playing their distinct roles hinges on the quality of the prompts used to define tasks and guide their behaviors. Therefore, the precision of obtained evidence is also sensitive to these prompts. To better automatically design prompts, we present a novel prompt optimization algorithm to enhance the performance of LLMs across various roles within our framework. During the ODQA generation process, we treat evidence (e.g., query expansion and selected documents) as latent variables and leverage variational inference to optimize role-playing prompts, guiding LLMs toward producing higher-quality evidence and answers.

We conduct experiments on widely used ODQA benchmarks: NQ, WebQ, and TriviaQA. The results show that our LLMQA advances the state-of-the-art performance on both answer accuracy and evidence quality. Compared with baselines, LLMQA achieves remarkable improvement in EM scores (4.0@TriviaQA, 2.7@WebQ, 3.1@NQ), demonstrating the effectiveness of multi-role LLMs for ODQA. The role of query expansion generator can achieve 73%,76%, and 87% recall scores for the answer in generated expansions. The role of reranker increases answer coverage by about 8.1%.

To sum up, our main contributions can be summarized as follows:

$\bullet$ We propose LLMQA, a generalized framework model to formulate the ODQA process, which is a novel paradigm to combine the strengths of retrieval-based and generation-based evidence.

$\bullet$ We effectively instruct LLMs to play three roles of generators, rerankers, and evaluators respectively, and integrate their collaborative interactions under our proposed unified framework,

$\bullet$ We introduce a novel prompt optimization algorithm to guide LLMs in producing higher-quality evidence and answers. Extensive experimental results show that LLMQA advances the best performance in terms of both answer accuracy and evidence quality.

2. Related Work

2.1. Open-Domain Question Answering

For collecting the related documents as evidence, existing methods can be categorized into the following two main paradigms:

Retrieve-then-read paradigm. Pioneered by (Chen et al., 2017), most recent approaches consist of two main modules: The retriever first retrieves documents relevant to the given question from an external knowledge base. The reader then comprehends questions and retrieved documents and generates the corresponding answer. One branch focuses on improving the retriever. Sparse retrieval with inverted indexes (e.g., TF-IDF or BM25) is generally used in traditional approaches (Robertson et al., 2009). Dense retrieval using language models such as ORQA (Lee et al., 2019), DPR (Karpukhin et al., 2020), RocketQA (Qu et al., 2020), ColBertQA (Khattab et al., 2021), and ART (Sachan et al., 2023) becomes dominant. The other branch focuses on enhancing the comprehension ability of reader to generate more accurate answers (Izacard and Grave, 2021; Cheng et al., 2021). With the development of LLMs, most readers are adopted from fine-tuned T5 (Raffel et al., 2020) or InstructGPT (Ouyang et al., 2022).

Generate-then-read paradigm. Previous works have demonstrated that the knowledge preserved in LLMs can serve as a “generative retriever” (Radford et al., 2019; Petroni et al., 2019; Roberts et al., 2020). Although many existing approaches adopt LLMs in ODQA, they cannot fully harness the generation capability of LLMs (Qin et al., 2023; Kamalloo et al., 2023; Wang et al., 2023; Sun et al., 2023a). GenRead is the first to explore the potential of the generation-based evidence for ODQA, which instructs an LLM to generate documents based on clusters of question-document pairs and the given question (Yu et al., 2023). Then these generated documents and the question are fed into LLM together to produce the final answer.

Considering the limitations of a single paradigm, we propose to seamlessly integrate retrieval-based and generation-based evidence, effectively harnessing the capabilities of LLMs.

2.2. Capabilities of LLMs

Recent advancements in model scales (Chowdhery et al., 2022; Ouyang et al., 2022) provide LLMs with impressive capabilities in text generation, ranking, and evaluation.

Generation capability of LLMs. Recent studies have highlighted the superior text generation capability of LLMs in few-shot and zero-shot scenarios (Brown et al., 2020; Chowdhery et al., 2022; Zhou et al., 2023; Bubeck et al., 2023). The knowledge stored in LLMs could be retrieved during inference (Petroni et al., 2019; Roberts et al., 2020). Hence, some studies directly prompt LLMs to generate answers to the question in ODQA (Qin et al., 2023; Kamalloo et al., 2023; Wang et al., 2023; Sun et al., 2023a). Other approaches utilize the generation capability to expand the query or enrich the context (Mao et al., 2020; Yu et al., 2023; Chuang et al., 2023; Nogueira et al., 2019).

Ranking capability of LLMs. Previous works show that compared to few-shot information extraction, LLMs are better at reranking for hard examples. Ma et al. propose a filter-then-rerank paradigm, which utilizes LLMs to rerank the candidates filtered by smaller language models to generate the final response. Chuang et al. apply LLMs to rerank expanded queries for better results. Ma et al. replace pointwise reranking with listwise reranking to reorder the list of documents based on the relevance to the query.

Evaluation capability of LLMs. LLMs are chosen as evaluators due to their robust comprehension and reasoning capabilities (Bubeck et al., 2023; Chiang et al., 2023; Gao et al., 2024; Sun et al., 2023b). Weng et al. leverage the self-verification capability of LLMs for better reasoning. Shinn et al. use self-reflective feedback as a semantic gradient providing a concrete direction to learn from prior mistakes. Madaan et al. present an iterative self-refinement algorithm that alternates between feedback and refinement. Additionally, LLMs are also used to evaluate attribution between generated answers and references (Bohnet et al., 2022; Yue et al., 2023).

In this paper, we aim to effectively integrate multi-role capabilities of LLMs to enhance the overall performance on ODQA.

2.3. Prompt Optimization

Previous works have emphasized that subtle differences in prompts could lead to tremendous performance degradation in generated results (Gao et al., 2020; Zhou et al., 2022; Liu et al., 2023). Consequently, prompt optimization has attracted great attention in recent years, with two primary approaches: manual design (Reynolds and McDonell, 2021) or automatic generation (Shin et al., 2020). Gradient-based prompt tuning can optimize prompts embedding in a continuous space (Liu et al., 2021b, a). In contrast, discrete prompt optimization has been extensively studied including prompt scoring (Davison et al., 2019), prompt generation (Gao et al., 2020) and prompt paraphrasing (Yuan et al., 2021). Recently, Zhou et al. propose APE for automatic prompt optimization by iteratively selecting prompt candidates to maximize the potential score functions. DLN (Sordoni et al., 2023) steps further by viewing LLMs as language layers and prompts as learnable parameters.

Inspired by these methods, we present a novel prompt optimization algorithm to refine essential prompts for query expansion, document reranking, and answer generation, enabling LLMs to produce better evidence and answers.

3. Method

3.1. Task Formulation

Previous methods collect evidence by retrieving or generating relevant background passages or explanations to facilitate accurate answer identification (Chen et al., 2017; Karpukhin et al., 2020; Ouyang et al., 2022). Expanding upon this concept, we formulate the generation process of ODQA as the following three fundamental steps: (1) Query expansion: We commence with the input question, designated as query $q$ . To enrich the context and improve document selection and answer generation, we utilize knowledge stored in language models to generate additional background information, denoted as query expansion $e$ ; (2) Document selection: Leveraging both the query $q$ and its expansion $e$ , we initially retrieve the top- $n$ documents that are relevant to answering the question as candidates. Subsequently, we compare these candidates to prioritize those documents most likely to contain the answer. Based on this criterion, we rerank these $n$ candidates and retain top- $k$ documents, represented as $d$ , which collectively constitute the evidence in conjunction with query expansion $(e,d)$ ; (3) Answer generation: Based on the query $q$ and the derived evidence $(e,d)$ , we proceed to generate the final answer $a$ in response to the question with a reader model.

Furthermore, this generation process can be effectively formulated using a Bayesian graphical model that aligns closely with the three aforementioned steps, parameterized by the following probability distribution:

(1)

P(a|q)=\sum_{e}\sum_{d}P(e|q)P(d|q,e)P(a|q,e,d),

where we consider the evidence $(e,d)$ as latent variables, which require to be optimized by maximizing this marginal likelihood. Consequently, the acquisition of the most appropriate evidence for question answering becomes a critical aspect of this task. Considering the prominent performance of LLMs on various tasks, we harness LLMs in multiple roles that collaborate with each other in the ODQA generation process. The framework overview of LLMQA is shown in Figure 2. In the subsequent sections, we will introduce in detail how to leverage the multi-role capabilities of LLMs to enhance the ODQA task.

Refer to caption — Figure 1. Collaborative interactions of multiple LLM roles.

3.2. Query Expansion

Generally, the questions posed in ODQA datasets are brief and concise, indicating that relying solely on the question itself as a query can lead to a substantial challenge: inadequate query context makes it difficult to support accurate document selection and answer generation. To address this challenge, we add a pivotal step known as query expansion that aims to enrich the original question with a broader context. The generated expansions are mainly used to analyze the key points required to answer a given question and provide sufficient background information for subsequent steps. In this process, we instruct an LLM to play the role of generator leveraging its powerful context understanding and text generation capabilities. Specifically, we employ an LLM-based expansion generator $G_{e}$ to facilitate the query expansion step. Given a question $q$ , its query expansion $e$ can be generated by

(2)

e=G_{e}(q;\theta_{e}),

where $\theta_{e}$ represents the prompt to instruct the query expansion.

3.3. Document Selection

In addition to the query expansion, relevant documents are more commonly used as evidence to include accurate answers to the question. To identify the most appropriate documents, we divide this document selection process into two distinct stages:

(1) Coarse-grained retrieval of top- $n$ documents: we first retrieve a set of top- $n$ documents that are potentially relevant to the given question by employing established information retrieval techniques such as DPR (Karpukhin et al., 2020) or BM25 (Chen et al., 2017). These retrieval methods provide an initial score for each candidate to describe the relevance between documents and questions. However, such methods may not always capture nuanced semantic relationships between the query and documents, leading to false positives or irrelevant documents in the initial set.

(2) Fine-grained reranking of top- $k$ documents from $n$ candidates: we proceed with the reranking of documents to ensure that those more likely to contain the answer are prioritized. This stage involves comparing the documents to determine which ones exhibit higher quality and relevance to the query. Inspired by LLM-based ranking approaches (Ma et al., 2023a; Chuang et al., 2023; Ma et al., 2023b), We instruct LLM to play the role of document reranker $R_{d}$ for further screening out top- $k$ ( $k<n$ ) documents from the initial pool of $n$ candidates. Considering the limitation on input tokens for LLMs, we iteratively rerank a subset of documents each time and complete the reranking of all candidates through a sliding window. Specifically, we set the window size to $w$ and the step size to $l$ . We start from the last position of the initially sorted documents. In each iteration, we focus on comparing $w$ documents within the sliding window and reorder the documents based on their likelihood of containing the answer. With the sliding window moving forward by $l$ steps, thus the top $w-l$ reranked documents in the original window are reserved and $l$ new documents are added, then the next $w$ documents can be reordered. This iterative process continues until the sliding window reaches the front, and we consider the first $k=w-l$ documents as the final evidence documents $d$ . Overall, this document selection process can be simplified as:

(3)

d=R_{d}(q,e;\theta_{d}),

where $\theta_{d}$ denotes the prompt for $R_{d}$ to ensure that documents are ranked in alignment with the desired relevance and quality.

3.4. Answer Generation

Based on the query $q$ , and the evidence $(e,d)$ , the final step in ODQA is to generate the final answer with the integration and comprehension of pertinent information within the evidence. The evidence can encompass essential information that directly provides the answer to the question, or it may comprise an analysis and explanation necessary for formulating the answer. Consequently, the central objective of answer generation is to employ a reader model for the systematic extraction and comprehension of valuable insights from the evidence context. We utilize an LLM-based reader $G_{a}$ to generate a precise and dependable response as the predicted answer to the question, and formulate this process as:

(4)

a=G_{a}(q,e,d;\theta_{a}),

where $\theta_{a}$ indicates the prompt for answer generation to ensure that the generated answer can align with the context and requirements of the original question and its evidence.

3.5. Evaluators for Generation and Reranking

As shown in Figure 1, evaluators also play a crucial role in query expansion and document reranking, engaging in a dynamic interaction with both the generator and reranker. Leveraging the advanced capabilities of LLMs to evaluate text quality under specific standards, we can instruct LLMs to play the role of evaluators to assess the performance of the generator and the reranker. The primary objective of evaluators is to assign quality scores to multiple candidates generated by the generator and reranker. These scores reflect the likelihood that each candidate is appropriate and accurate for specific conditions or requirements. For a given question, we employ the expansion evaluator $S_{e}$ to individually score each candidate expansion ranging from 0 to 1, which is used to assess the degree of their relevance and logical consistency. Similarly, we use the reranking evaluator $S_{r}$ to score different top- $k$ reranking candidates generated by the reranker $R_{d}$ , assessing the contribution of each ranking result to answering the question. The scoring process of evaluators $S_{e}$ and $S_{r}$ can be formulated as:

(5)		$\displaystyle s_{e_{j}}$	$\displaystyle=S_{e}(e_{j};G_{e}(q)),$
(6)		$\displaystyle s_{d_{j}}$	$\displaystyle=S_{r}(d_{j};R_{d}(q,e)),$

where $s_{e_{j}}$ and $s_{d_{j}}$ represent the scores assigned to the $j$ -th candidate of generated query expansion and reranked documents. These scores serve as critical metrics for evaluating the performance of the generator and reranker and further promoting overall generation and ranking capabilities of LLMs.

Algorithm 1 Training Process for Prompt Optimization.

Input: Training data: $X_{tra}$ , LLM-based roles: $G_{e}$ , $R_{d}$ , $G_{a}$ , $S_{e}$ , $S_{r}$ , $S_{a}$ , backward updating functions: $U_{e}$ , $U_{d}$ , $U_{a}$ .

Parameters: $\theta_{e}$ , $\theta_{d}$ , $\theta_{a}$ .

1: for

(q,a)

X_{tra}

2: Generate the prior

\widetilde{e}=G_{e}(q;\theta_{e})

by Expansion Generator

3: Generate the prior

\widetilde{d}=R_{d}(q,\widetilde{e};\theta_{d})

by Document Reranker

4: Generate the prior

\widetilde{a}=G_{a}(q,\widetilde{e},\widetilde{d};\theta_{a})

by Answer Generator

5: Sample

n

posterior

\widehat{d}_{1},\widehat{d}_{2},\cdots,\widehat{d}_{n}

from

R_{d}(q,\widetilde{e},\widetilde{d},a;\phi_{d})

6: Score

s_{d_{i}}=S_{r}(\widehat{d}_{i};R_{d}(q,\widetilde{e}))

s_{a_{i}}=S_{a}(a;G_{a}(q,\widetilde{e},\widehat{d}_{i}))

for

\widehat{d}_{i}

7: Calculate posterior score

v_{d_{i}}=s_{d_{i}}*s_{a_{i}}

8: Select the best posterior

\widehat{d}_{*}=\text{argmax}_{i}\{v_{d_{i}}\}

9: Sample

m

posterior

\widehat{e}_{1},\widehat{e}_{2},\cdots,\widehat{e}_{m}

from

G_{e}(q,\widetilde{e},\widehat{d}_{*},a;\phi_{e})

10: Score

s_{e_{j}}=S_{e}(\widehat{e}_{j};G_{e}(q))

s_{d_{j}}=S_{r}(\widehat{d}_{*};R_{d}(q,\widehat{e}_{j}))

, and

s_{a_{j}}=S_{a}(a;G_{a}(q,\widehat{e}_{j},\widehat{d}_{*}))

for each

\widehat{e}_{j}

11: Calculate posterior score

v_{e_{j}}=s_{e_{j}}*s_{d_{j}}*s_{a_{j}}

12: Select the best posterior

\widehat{e}_{*}=\text{argmax}_{j}\{v_{e_{j}}\}

13: Sample

K

candidates

\widehat{\theta}_{a_{k}}=U_{a}(q,\widetilde{e},\widetilde{d},\widetilde{a},a)

for

\theta_{a}

14: Select

\widehat{\theta}_{a_{*}}=\operatorname*{argmax}\limits_{k}\sum\limits_{i=1}^{n% }\sum\limits_{j=1}^{m}v_{e_{j}}v_{d_{i}}\log S_{a}(a;G_{a}(q,\widehat{e}_{j},% \widehat{d}_{i};\widehat{\theta}_{a_{k}}))

15: Sample

K

candidates

\widehat{\theta}_{d_{k}}=U_{d}(q,\widetilde{e},\widetilde{d},\widehat{d}_{*})

for

\theta_{d}

16: Select

\widehat{\theta}_{d_{*}}=\operatorname*{argmax}\limits_{k}\sum\limits_{i=1}^{n% }\sum\limits_{j=1}^{m}v_{e_{j}}v_{d_{i}}\log S_{r}(\widehat{d}_{i};R_{d}(q,% \widehat{e}_{j};\widehat{\theta}_{d_{k}}))

17: Sample

K

candidates

\widehat{\theta}_{e_{k}}=U_{e}(q,\widetilde{e},\widehat{e}_{*})

for

\theta_{e}

18: Select

\widehat{\theta}_{e_{*}}=\operatorname*{argmax}\limits_{k}\sum\limits_{j=1}^{m% }v_{e_{j}}\log S_{e}(\widehat{e}_{j};G_{e}(q;\widehat{\theta}_{e_{k}}))

19: Update parameters:

\theta_{a}\leftarrow\widehat{\theta}_{a_{*}}

\theta_{d}\leftarrow\widehat{\theta}_{d_{*}}

\theta_{e}\leftarrow\widehat{\theta}_{e_{*}}

20: end for

21: return

\theta_{e}

\theta_{d}

\theta_{a}

3.6. Prompt Optimization

The role-play performance of the generator and reranker still heavily relies on the prompt design in each ODQA generation process. Therefore, we explore how to design better role-play prompts or expansion generation $\theta_{e}$ , document reranking $\theta_{d}$ , and answer generation $\theta_{a}$ to fully exploit the potential of LLMs. We propose a novel algorithm to enable prompt optimization under the unique graphical model structure of ODQA. Throughout the ODQA generation process, we do not require the LLM parameters, but instead treat three natural language prompts as learnable parameters. In Equation (1), the distributions of latent variables $e$ and $d$ are determined by these prompts and need to be approximated by probabilistic inference techniques. To ensure consistency with the graphical model, we propose to use variational inference to learn the hidden distributions and optimize prompts. We denote the prior distribution as $P_{\theta}$ and the posterior distribution as $P_{\phi}$ , and the original log-likelihood could be bounded by the following ELBO:

		$\displaystyle\log P(a\|q)$
(7)		$\displaystyle\geq$	$\displaystyle\sum_{e}\sum_{d}P_{\phi_{e}}(e\|q,a)P_{\phi_{d}}(d\|q,e,a)\log\frac% {P_{\theta_{e}}(e\|q)P_{\theta_{d}}(d\|q,e)P_{\theta_{a}}(a\|q,e,d)}{P_{\phi_{e}}% (e\|q,a)P_{\phi_{d}}(d\|q,e,a)}.$

As shown in Algorithm 1, for the question $q$ , we use predefined $G_{e}$ , $R_{d}$ and $G_{a}$ to sequentially simulate the priors $P_{\theta_{e}}$ , $P_{\theta_{d}}$ , and $P_{\theta_{a}}$ , and generate the query expansion $\widetilde{e}$ , the reranked documents $\widetilde{d}$ and the predicted answer $\widetilde{a}$ during forward inference. Next, to approximate the posteriors $P_{\phi_{e}}$ and $P_{\phi_{d}}$ , we consider the following two aspects: (1) We add the ground-truth target as an additional condition to estimate the posteriors; (2) We sample several posterior candidates near the prior to ensure low Kullback–Leibler (KL) divergence between them in the space of discrete texts. Denoting the prior reranked documents as $\widetilde{d}=(\widetilde{d}_{1},\widetilde{d}_{2},\cdots,\widetilde{d}_{k-1},% \widetilde{d}_{k})$ , the $i$ -th posterior reranked documents can be denoted as $\widehat{d}_{i}=(\widetilde{d}_{1},\widetilde{d}_{2},\cdots,\widetilde{d}_{k-1% },\widetilde{d}_{k+i})$ , where only the “last document” in the list is replaced. Then we use evaluators $S_{r}$ and $S_{e}$ to score each posterior candidate for estimating $P_{\phi_{d}}$ . The best posterior $\widehat{d}_{*}$ among these candidates is selected as the current “ground-truth” reranking documents. The posterior query expansions are generated by a minor edit of the prior expansion, then the best posterior $\widehat{e}_{*}$ is selected by analogy.

Subsequently, we define a backward process to update prompts. For the answer generation prompt $\theta_{a}$ , we sample $K$ candidates near it using an updating function $U_{a}(q,\widetilde{e},\widetilde{d},\widetilde{a},a)$ , to guide prompts to update in a direction that brings the predicted answer $\widetilde{a}$ closer to the actual answer $a$ . Then all the previous posterior candidates are used to estimate ELBO, and the best $\widehat{\theta}_{a_{*}}$ to maximize ELBO can be selected as the refined prompt for answer generation. Similar processes are introduced to refine prompts $\theta_{d}$ and $\theta_{e}$ , while updating functions $U_{d}$ and $U_{e}$ are used to guide the directions to refine document reranking and query expansion.

4. Experiments

Table 1. Comparison results on TriviaQA, WebQ, and NQ datasets. Our EM scores are given by the mean of 10 rounds of bootstrap** sampling, with bold numbers indicating

p

-values below 0.01 under a significance test.

Method

#Reader

parameters

#Documents

TriviaQA

WebQ

baselines without LLMs;

{\dagger}

was reported by paper,

*

was reproduced by ours.

BM25+Bert

{}^{\dagger}

220M

47.1

21.3

26.5

REALM

{}^{\dagger}

330M

40.7

40.4

DPR

{}^{\dagger}

110M

100

56.8

41.1

41.5

RAG

{}^{\dagger}

400M

56.1

45.2

44.5

FiD-l

{}^{\dagger}

770M

61.9

48.1

46.7

FiD-xl

{}^{\dagger}

66.3

50.8

50.1

Baselines employing LLMs as generators;

{\dagger}

was reported by paper,

*

was reproduced by ours.

GenRead (FiD-l) (sampling)

{}^{\dagger}

770M

67.8

51.5

40.3

GenRead (FiD-l) (clustering)

{}^{\dagger}

770M

70.2

53.5

43.5

GenRead (FiD-xl)) (sampling)

{}^{\dagger}

69.6

52.6

42.6

GenRead (FiD-xl) (clustering)

{}^{\dagger}

71.6

54.4

45.6

EAR+FiD-l

{}^{\dagger}

770M

100

71.2

51.4

EAR+FiD-xl

{}^{*}

100

72.9

53.8

LLMQA

76.9

56.2

56.9

LLMQA

76.6

57.1

57.5

4.1. Experimental Setup

Datasets. We select three widely used ODQA benchmarks to evaluate the model performance of baselines and our LLMQA: (1) WebQ (WebQuestions) is a dataset that consists of questions obtained using the Google Suggest API, with the answers being entities from Freebase. (2) NQ (Natural Questions) is a dataset generated from real Google search queries, and the answers are spans within Wikipedia articles. (3) TriviaQA is a collection of trivia questions sourced from trivia and quiz-league websites. Statistics of these three datasets are available in Appendix.

Baselines. To verify the effectiveness of our method, we compare LLMQA with the following two main types of baselines: (1) Baselines without LLMs: BM25+Bert (Lee et al., 2019) combines sparse retrieval methods with BERT for text representations. REALM (Guu et al., 2020) retrieves relevant documents from a knowledge corpus and incorporates them into the training process of the language model. DPR (Karpukhin et al., 2020) utilizes a dense encoder to encode text passages and questions and retrieves relevant passages based on vector similarity. RAG (Lewis et al., 2020) utilizes retrieval to augment generation techniques to enhance the ODQA tasks. FiD (Izacard and Grave, 2021) follows the classic retrieve-then-read paradigm with reader sizes of 770M and 3B. (2) Baselines employing LLMs as generators: GenRead (Yu et al., 2023) propose a clustering-based method to use LLMs to generate diverse documents. EAR (Chuang et al., 2023) improves evidence quality through query re-ranking for enhanced expansion.

Evaluation Metrics. Mainstream ODQA methods evaluate answer accuracy using the Exact Match (EM) score (Zhu et al., 2021), which compares the predicted answer $\widetilde{a}$ to each ground-truth answer $a$ in the answer list, to determine if they match. Additionally, the recall score serves as an important metric for assessing the quality of evidence. These two metrics are given by:

(8)		$\displaystyle EM$	$\displaystyle=\frac{\sum_{\widetilde{a},a\in D}exact\_match(\widetilde{a},a)}{% \|D\|}$
(9)		$\displaystyle recall$	$\displaystyle=\frac{\sum_{docs,a\in D}{answer\_hit(docs,a)}}{\|D\|}$

where $D$ is the dataset, $\widetilde{a}$ is the predict answer, $a$ is the ground truth answer, $docs$ is the reference documents.

Implementation Details. In our proposed approach, we take advantage of the multi-role capability of LLMs. As for the query expansion, we use gpt-3.5-turbo-16k as a generator by directly accessing to API (temperature=0.7,n=10). As for document selection, we first retrieve top-100 documents using DPR (Karpukhin et al., 2020). To select top-10 reranked documents, we implement sliding window reranking and set the window size $w=20$ , step $l=10$ . We get the rerank result in the window by accessing gpt-3.5-turbo-16k (temperature=0.7) as well. As for answer generation, we follow GenRead (Yu et al., 2023) adapting FiD-xl (3B) as our reader model and fine-tuning it for 10000 steps with $lr$ set to 3e-5. As for prompt optimization, we refer to P3 (Bach et al., 2022) along with carefully designed role-play instructions to initialize crucial prompts. As for evaluators used in prompt optimization, we use text-embedding-ada-002 from OpenAI by requesting for embeddings to estimate the posterior probability for prompt optimization.¹¹1Our code and data are available at https://github.com/EthanLeo-LYX/LLMQA.

4.2. Overall Performance

The overall performance of the experiment is shown in Table 1. Compared with the baselines without LLMs, our proposed LLMQA exhibited a notable improvement over three datasets (10.3@TriviaQA, 6.3@WebQ, 7.4@NQ), which strongly demonstrated the effectiveness of the LLM on the ODQA, indicating that the effect of model scale on the final results is remarkable. Our LLMQA surpassed FiD-xl by 8 on average of three datasets, even though the documents we used are less than it. Thus, different role-play LLMs can be competent with previously specifically designed models.

Compared with the baselines employing LLMs as generators, our LLMQA also achieved considerable performance improvement. Both GenRead and EAR+FiD utilize the generation capability of LLMs to generate documents or query expansions. The enhancement of our approach primarily leverages the collaboration between multiple role-playing LLMs. In addition to the query expansion used in our approach, we also adapted LLMs to rerank the retrieved documents. The remarkable improvement fully demonstrated that multiple roles can interact and collaborate with each other and fulfill the tasks well under specific instruction.

4.3. Ablation Study

In this section, we eliminate the generator, reranker, and evaluator, respectively, and explore to what extent the three aspects of LLM capabilities have an impact on the ODQA performance. In addition, we validate the effectiveness of the proposed prompt optimization.

Table 2 shows that the generator role of LLMs has the most significant impact among the three different roles played by LLMs, which indicates that the query expansion can serve as an auxiliary document. The reranker contributes to the ODQA as well because the reranked documents are more relevant to the question. The feasibility of the evaluator has also been demonstrated as it can estimate the evidence quality and select the most suitable one. Our experiment on prompt optimization shows that the quality of the prompt design directly affects the performance of role-play LLMs for the final result and that the prompts on discrete space could be optimized as well.

4.4. Case Study and Error Analysis

Table 2. Ablation results on TriviaQA, WebQ, and NQ datasets.

Method	TriviaQA	WebQ	NQ
LLMQA w/o expansions generator	68.70	52.61	53.66
LLMQA w/o documents reranker	73.06	52.71	54.68
LLMQA w/o candidates evaluator	73.91	55.56	57.12
LLMQA w/o prompt optimization	73.60	54.82	56.68
LLMQA	76.62	57.15	57.56

Table 3. Case study of more relevant evidence than baselines.

Question: when did little polveir win the grand national

Golden Answer: [1989]

LLMQA

Selected Doc. Hit: 10 / 10

Top-3 docs:

”…He won the 1989 Grand National steeplechase…”

”…on 8 April 1989. The race was won in a time…”

”…He is best known…for his performance in the 1989 Derby …”

Generated answer: 1989 (True)

GenRead

Selected Doc. Hit: 1 / 10

Top-3 docs:

”On 6 April 2019, Little Polveir won the Grand …”

”… Little Polveir won the Grand National in 1951.”

”… last time Little Polveir won the Grand National was 1869.”

Generated answer: 1951 (False)

Table 4. Case study of imprecise evidence for hard examples.

Question: who wrote the first declaration of human rights

Golden Answer: [Cyrus]

LLMQA

Selected Doc. Hit: 2 / 10

Top-3 docs:

”… Cyrus the Great… the first human rights document…”

”…first recording of human rights … by Cyrus the Great…”

”…is a human civil rights document … the French Revolution.”

Generated answer: Cyrus the Great (False)

GenRead

Selected Doc. Hit: 1/ 10

Top-3 docs:

”The first declaration of human…written by George Mason”

”…first declaration of human…Virginia Declaration…”

”…first declaration of human rights… Virginia Declaration…”

Generated answer: George Mason (False)

Case study of evidence and answer. In addition to prompt optimization, we also focus on the specific performance of evidence quality and answer generation during the inference process. We choose GenRead as a strong baseline for comparison. As shown in Table 3, all the top-10 evidence documents of LLMQA contain answers, which are highly relevant to the given question resulting in accurate answer prediction. However, the virtual documents generated by GenRead introduce an inaccurate year 1951 and miss the golden answer. This indicates that reranking retrieved documents can help improve the evidence quality and answer accuracy.

Case study of prompt optimization. We analyze the differences between the prompts for query expansion and document reranking after optimization. Figure 5 in Appendix shows that compared to the initial prompts, optimized prompts can include more details and insights for describing the instruction. As for the expansion prompt, a more detailed role-play description and an alternative instruction to solve the task were added. As for the reranking prompt, some of the ambiguous content in the prompt has been refined after optimization. As a consequence, our proposed prompt optimization method can achieve more detailed, instructive, and explicit prompts.

Error Analysis. Although we achieve the most advanced results on evidence quality and answer accuracy, some issues remain challenging. The challenges may contain contradictions with facts and world knowledge, and they may have led to incorrect predictions or reasoning results. For instance, Table 4 displays a typical failure case where both LLMQA and GenRead struggle to capture precise evidence, leading to low evidence quality and incorrect answer predictions. Despite these ongoing challenges, our LLMQA is still the best choice in terms of overall performance on the ODQA task.

4.5. Further Analysis

Analysis of Evidence Quality. We estimate the evidence quality using answer recall on the top- $k$ selected documents. We compare our proposed LLMQA with GenRead (Yu et al., 2023) on three datasets. Table 6 shows that our LLMQA achieves the highest recall on all of top- $k$ settings over three datasets. The results show that relying solely on LLM-generated documents is insufficient. While hybrid utilization of both generated expansion and retrieved documents can gain tremendous answer recall increase, contributing to the final performance improvement.

Table 5. Effect of location to insert expansion in documents

Location	Expansion@fisrt	Expansion@last	Expansion@random
Example	[ $expansion,doc_{1},\cdots,doc_{n}$ ]	[ $doc_{1},\cdots,doc_{n},expansion$ ]	[ $doc_{1},\cdots,expansion,\cdots,doc_{n}$ ]
EM Score	57.56	56.89	57.14

Table 6. Answer recall for evidence quality.

Dataset	Method	Top-2	Top-4	Top-8
NQ	GenRead-sampling	55.12	62.58	69.64
	GenRead-clustering	55.12	62.58	69.64
	LLMQA	61.99	78.59	82.94
TriviaQA	GenRead-sampling	73.55	77.99	81.55
	GenRead-clustering	76.09	79.65	82.96
	LLMQA	80.67	86.21	87.22
WebQ	GenRead-sampling	58.02	64.67	69.59
	GenRead-clustering	61.17	67.47	72.00
	LLMQA	67.57	77.81	80.07

Table 7. Analysis of strategies in document selection.

Strategies	LLM Sliding	DPR Score	Random
NQ	57.56	54.68	49.19
TriviaQA	76.62	73.04	69.42
WebQ	57.15	52.71	46.82

Quality of Query Expansions. We first evaluate the quality of query expansions generated by LLMs. In the query expansion procedure, we generate 10 candidates and instruct LLMs to estimate the candidates according to the specified rules. The left part in Figure 3 shows the recall for the highest-scored K expansions. Most of the generated query expansions have already contained the answer as the large amount of knowledge that may cover the answer has been stored in the parameters of LLMs during pre-training.

We also analyze the number of expansions in our proposed approach. The right part in Figure 3 shows the EM score for different numbers of expansions, the approach without expansion encounters a massive performance drop and the approach with expansions can benefit from the increment of the expansions number. The result indicates that when the retrieved documents are of poor quality, the query expansions generated by the LLMs can be used as auxiliary documents to assist in selecting the relevant documents and answering the question.

As we treat the expansions as auxiliary documents, the location to insert the expansions may also have an impact. Table 5 shows that constrained by the context length, expansions inserted to the beginning of the documents gain the best EM score, indicating that the reader model may be much more sensitive to the beginning of the context.

Strategies in Documents Reranking. In order to demonstrate the reranking capability of LLMs, we implemented different strategies in the reranking stage for document selection. As Table 7 shows, LLM-based sliding window reranking achieved the best result. Compared to using DPR score directly, sliding window reranking can have a more comprehensive understanding of the retrieved documents and obtain the coarse-grain relevance between the question, expansion, and documents. In contrast, the high-level embedding similarity used in DPR may include much noisy information.

Impact of Document Number. Intuitively, the question is more likely to be answered correctly when the number of retrieved documents is larger. We experiment on the number of documents input to the reader model, and Figure 4 shows that within a certain range, the EM score rises along with the number of documents, while when the number exceeds 30, the accuracy drops to some extent. This is because simply increasing the number of documents may lead to a decrease in the percentage of valid information as shown in Figure 4, making it difficult for the reader model to mine the correct answer from a large number of documents.

5. Conclusion

In this paper, we propose LLMQA that formulates the ODQA generation process as three fundamental steps: query expansion, document selection, and answer generation, which combines the superiority of both retrieval-based and generation-based evidence. Since LLMs have showcased remarkable performance on generation, ranking, and evaluation, we use a generalized framework to integrate multi-role LLMs: generator, reranker and evaluator, which collaboratively contribute to each key step in the ODQA generation process. Furthermore, we design a novel prompt optimization algorithm, to address the limitation of prompt sensitivity, guiding LLMs in producing higher-quality evidence and more accurate answers.

Acknowledgements.

This work was supported by the National Natural Science Foundation of China (NSFC Grant No. 62122089, U2001212, 62032001, and 61932004), Bei**g Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, and Intelligent Social Governance Platform, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China, the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China.

References

(1)
Bach et al. (2022) Stephen H Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, et al. 2022. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279 (2022).
Bohnet et al. (2022) Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, et al. 2022. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037 (2022).
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017. Association for Computational Linguistics (ACL), 1870–1879.
Cheng et al. (2021) Hao Cheng, Yelong Shen, Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2021. UnitedQA: A Hybrid Approach for Open Domain Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 3080–3090. https://doi.org/10.18653/v1/2021.acl-long.240
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) (2023).
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
Chuang et al. (2023) Yung-Sung Chuang, Wei Fang, Shang-Wen Li, Wen-tau Yih, and James Glass. 2023. Expand, Rerank, and Retrieve: Query Reranking for Open-Domain Question Answering. arXiv preprint arXiv:2305.17080 (2023).
Davison et al. (2019) Joe Davison, Joshua Feldman, and Alexander M Rush. 2019. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 1173–1178.
Gao et al. (2019) Shen Gao, Zhaochun Ren, Yihong Zhao, Dongyan Zhao, Dawei Yin, and Rui Yan. 2019. Product-aware answer generation in e-commerce question-answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 429–437.
Gao et al. (2024) Shen Gao, Zhengliang Shi, Minghang Zhu, Bowen Fang, Xin Xin, Pengjie Ren, Zhumin Chen, Jun Ma, and Zhaochun Ren. 2024. Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum. In AAAI.
Gao et al. (2020) Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723 (2020).
Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, 3929–3938.
Izacard et al. (2021) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised Dense Information Retrieval with Contrastive Learning. https://doi.org/10.48550/ARXIV.2112.09118
Izacard and Grave (2021) Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 874–880. https://doi.org/10.18653/v1/2021.eacl-main.74
Kamalloo et al. (2023) Ehsan Kamalloo, Nouha Dziri, Charles LA Clarke, and Davood Rafiei. 2023. Evaluating Open-Domain Question Answering in the Era of Large Language Models. arXiv preprint arXiv:2305.06984 (2023).
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
Khattab et al. (2021) Omar Khattab, Christopher Potts, and Matei Zaharia. 2021. Relevance-guided supervision for openqa with colbert. Transactions of the association for computational linguistics 9 (2021), 929–944.
Lee et al. (2019) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300 (2019).
Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
Liu et al. (2023) Pengfei Liu, Weizhe Yuan, **lan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
Liu et al. (2021a) Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2021a. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602 (2021).
Liu et al. (2021b) Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT understands, too. arXiv preprint arXiv:2103.10385 (2021).
Ma et al. (2023b) Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. 2023b. Zero-Shot Listwise Document Reranking with a Large Language Model. arXiv preprint arXiv:2305.02156 (2023).
Ma et al. (2023a) Yubo Ma, Yixin Cao, YongChing Hong, and Aixin Sun. 2023a. Large language model is not a good few-shot information extractor, but a good reranker for hard samples! arXiv preprint arXiv:2303.08559 (2023).
Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 (2023).
Mao et al. (2020) Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. 2020. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553 (2020).
Min et al. (2021) Sewon Min, Jordan Boyd-Graber, Chris Alberti, Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, Jennimaria Palomaki, et al. 2021. Neurips 2020 efficientqa competition: Systems, analyses and lessons learned. In NeurIPS 2020 Competition and Demonstration Track. PMLR, 86–111.
Nogueira et al. (2019) Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019).
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
Petroni et al. (2020) Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. 2020. KILT: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252 (2020).
Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066 (2019).
Qin et al. (2023) Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is ChatGPT a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476 (2023).
Qu et al. (2020) Yingqi Qu, Yuchen Ding, **g Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2020. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191 (2020).
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
Reynolds and McDonell (2021) Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. 1–7.
Roberts et al. (2020) Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model?. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5418–5426. https://doi.org/10.18653/v1/2020.emnlp-main.437
Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
Sachan et al. (2023) Devendra Singh Sachan, Mike Lewis, Dani Yogatama, Luke Zettlemoyer, Joelle Pineau, and Manzil Zaheer. 2023. Questions are all you need to train a dense passage retriever. Transactions of the Association for Computational Linguistics 11 (2023), 600–616.
Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4222–4235.
Shinn et al. (2023) Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv preprint arXiv:2303.11366 (2023).
Sordoni et al. (2023) Alessandro Sordoni, Xingdi Yuan, Marc-Alexandre Côté, Matheus Pereira, Adam Trischler, Ziang Xiao, Arian Hosseini, Friederike Niedtner, and Nicolas Le Roux. 2023. Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference. arXiv preprint arXiv:2306.12509 (2023).
Sun et al. (2023a) Hao Sun, Xiao Liu, Yeyun Gong, Yan Zhang, and Nan Duan. 2023a. BeamSearchQA: Large Language Models are Strong Zero-Shot QA Solver. arXiv preprint arXiv:2305.14766 (2023).
Sun et al. (2023b) Hongda Sun, Weikai Xu, Wei Liu, Jian Luan, Bin Wang, Shuo Shang, Ji-Rong Wen, and Rui Yan. 2023b. From Indeterminacy to Determinacy: Augmenting Logical Reasoning Capabilities with Large Language Models. arXiv preprint arXiv:2310.18659 (2023).
Wang et al. (2023) Cunxiang Wang, Sirui Cheng, Zhikun Xu, Bowen Ding, Yidong Wang, and Yue Zhang. 2023. Evaluating open question answering evaluation. arXiv preprint arXiv:2305.12421 (2023).
Weng et al. (2022) Yixuan Weng, Minjun Zhu, Shizhu He, Kang Liu, and Jun Zhao. 2022. Large language models are reasoners with self-verification. arXiv preprint arXiv:2212.09561 (2022).
Yu et al. (2023) Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. 2023. Generate rather than retrieve: Large language models are strong context generators. In International Conference for Learning Representation (ICLR).
Yuan et al. (2021) Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems 34 (2021), 27263–27277.
Yue et al. (2023) Xiang Yue, Boshi Wang, Kai Zhang, Ziru Chen, Yu Su, and Huan Sun. 2023. Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311 (2023).
Zhou et al. (2023) Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guang**g Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. 2023. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419 (2023).
Zhou et al. (2022) Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
Zhu et al. (2021) Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming Zheng, Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774 (2021).

Appendix A Datatet Statistics

The dataset split and statistics are presented in Table 8.

Table 8. Statistics of ODQA datasets.

Dataset	Train	Dev	Test
WebQ	3,417	361	2,032
TriviaQA	78,785	8,837	11,313
NQ	79,168	8,757	3,610

Appendix B Results of Zero-shot Setting

In principle, our proposed method supports only using LLMs to play different roles for the ODQA task. However, previous methods (Yu et al., 2023; Chuang et al., 2023) often use the supervised setting to fine-tune the answer generator to achieve SOTA performance. To facilitate a fair comparison, we follow this setting to fine-tune our answer generator and obtain the main results in Table 1.

Here, we add the zero-shot setting to directly use the same LLM (e.g., gpt-3.5-turbo) in all modules including answer generation. The baseline methods, including BM25 (Chen et al., 2017) / DPR (Karpukhin et al., 2020) / Contriever (Izacard et al., 2021) + InstructGPT (Ouyang et al., 2022), share the same input format as Genread (Yu et al., 2023). All the baseline results are adopted in (Yu et al., 2023). The results in Table 9 are as expected: The model performance in the zero-shot setting is generally worse than the supervised setting, but our LLMQA can still outperform all baselines in the same setting.

Table 9. Zero-shot ODQA performance.

Method	NQ	TriviaQA	WebQ
BM25+InstructGPT	19.7	50.5	15.8
DPR+InstructGPT	29.1	53.8	20.2
Contriever+InstructGPT	18.0	51.3	16.6
GenRead	28.0	59.0	24.6
LLMQA	35.3	63.3	25.5

Appendix C Complexity Analysis.

Regarding the number of model parameters to be learned, we compare them from two aspects: evidence collection and answer generation. Some previous methods require specifying specialized models to collect evidence (e.g., document retrieval and extension generation), which introduces the training cost for specialized models in evidence collection. Our framework is instead based on guiding LLMs to play different roles, with evidence collection only involving the inference process. Since the inherent capabilities of the reader (answer generator) have an important impact on the performance of the ODQA task, state-of-the-art ODQA performance comes from fine-tuning the reader. Following (Yu et al., 2023), we employ T5-xl (3B) as the backbone of the answer generator, whose training cost is comparable to the baseline.

Appendix D Details of Prompt Optimization

The comparison results before and after prompt optimization for query expansion and document reranking are depicted in Figure 5.