Query Performance Prediction using Relevance Judgments Generated by Large Language Models

Chuan Meng 0000-0002-1434-7596 University of AmsterdamAmsterdamThe Netherlands [email protected] , Negar Arabzadeh 0000-0002-4411-7089 University of WaterlooWaterlooCanada [email protected] , Arian Askari 0000-0003-4712-832X Leiden UniversityLeidenThe Netherlands [email protected] , Mohammad Aliannejadi 0000-0002-9447-4172 University of AmsterdamAmsterdamThe Netherlands [email protected] and Maarten de Rijke 0000-0002-1086-0202 University of AmsterdamAmsterdamThe Netherlands [email protected]

(2024)

Abstract.

\Acf

QPP aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous query performance prediction (QPP) methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item’s relevance by using open-source large language models to ensure scientific reproducibility.

We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019–2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.

Query performance prediction, Large language models, Relevance judgments

^†^†copyright: rightsretained^†^†journalyear: 2024^†^†journal: TOIS^†^†journalvolume: 1^†^†journalnumber: 1^†^†article: 1^†^†publicationmonth: 6^†^†ccs: Information systems Evaluation of retrieval results

1. Introduction

\Acf

QPP, a.k.a. query difficulty prediction, has attracted the attention of the information retrieval (IR) community throughout the years (Arabzadeh et al., 2024; Ganguly et al., 2022; Carmel and Yom-Tov, 2010). QPP aims to estimate the retrieval quality of a search system for a query without using human-labeled relevance judgments (Faggioli et al., 2021b). Effective QPP benefits various tasks (Faggioli et al., 2023b), e.g., query variant selection (Di Nunzio and Faggioli, 2021; Scells et al., 2018; Thomas et al., 2017), IR system configuration selection (Deveaud et al., 2016; Tonellotto et al., 2013), and query-specific pool depth prediction (Ganguly and Yilmaz, 2023) to reduce human relevance judgment costs.

Current limitations

QPP methods can be applied in various domains and scenarios (Faggioli et al., 2023b). We are usually concerned with the predicted retrieval quality w.r.t. various IR measures across different scenarios, e.g., our emphasis might be on precision (Faggioli et al., 2023d, 2021a) for conversational search and on recall for legal search (Tomlinson et al., 2007). However, existing QPP approaches typically predict only a single real-valued score that indicates the retrieval quality for a query (Ganguly et al., 2022) and do not require the predicted score to approximate a specific IR evaluation measure (Arabzadeh et al., 2023; Singh et al., 2023; Tao and Wu, 2014; Shtok et al., 2012; Zhou and Croft, 2007). This results in two key limitations: (i) While predicted performance scores have been shown to correlate with some IR evaluation metrics (Datta et al., 2022c; Ganguly et al., 2022), relying on a single value to represent different IR evaluation measures leads to a “one size fits all” approach, which is problematic because the literature shows that some IR metrics do not correlate well and the agreement varies across scenarios and queries (Gupta et al., 2019; Jones et al., 2015). Although some studies train regression-based QPP models to predict a specific IR evaluation measure (Khodabakhsh and Bagheri, 2023; Chen et al., 2022; Datta et al., 2022a; Arabzadeh et al., 2021b; Hashemi et al., 2019), they require training separate models to predict different measures, leading to lots of storage and running costs. (ii) A single-score prediction limits the interpretability of QPP. It is insufficient to explain QPP outputs or to analyze and fix inaccurate QPP results based solely on a single score. We argue that more in-depth and interpretable insights into QPP outputs are required.

A novel QPP framework

We propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), in which we decompose QPP into independent subtasks of automatically predicting the relevance of each item in a ranked list for a given query. QPP-GenRE comes with various advantages: (i) it allows us to directly predict any desired IR evaluation measure at no additional cost, using automatically generated relevance judgments as pseudo-labels; and (ii) the generated relevance judgments provide an explanation beyond simply gauging how difficult or easy a query is by offering information about why the query is predicted as being difficult or easy; moreover, we can translate the “QPP errors” into easily observable “relevance judgment errors,” e.g., false positives or negatives, informing potential ways of improving QPP quality by fixing observed relevance judgment errors.

QPP-GenRE can be integrated with various approaches for judging relevance. Obviously, for QPP-GenRE to excel in QPP, it is crucial to equip QPP-GenRE with an approach capable of accurately generating relevance judgments. Recent studies (Rahmani et al., 2024; Ma et al., 2024; Thomas et al., 2023; Faggioli et al., 2023a; MacAvaney and Soldaini, 2023; Gilardi et al., 2023) have shown the potential effectiveness of using LLMs to generate relevance judgments. We build on them to equip QPP-GenRE with LLMs for judging relevance. However, those studies have certain limitations: several authors have prompted commercial LLMs (e.g., ChatGPT, GPT-3.5/4, GPT-4o) to generate relevance judgments (e.g., Upadhyay et al., 2024b; Rahmani et al., 2024; Upadhyay et al., 2024a; Zhang et al., 2024; Ma et al., 2024; Thomas et al., 2023; Faggioli et al., 2023a); commercial LLMs come with limitations like non-reproducibility, non-deterministic outputs and potential data leakage between pretraining and evaluation data, impeding their value in scientific research (Pradeep et al., 2023b; Zhang et al., 2023c; Pradeep et al., 2023a). Although MacAvaney and Soldaini (2023) prompt small-scale open-source language models (e.g., Flan-T5 (Chung et al., 2022) with 3B parameters) for generating relevance judgments, they focus on a setting wherein the model is already given one relevant item for each query, which does not apply to QPP as we typically do not know any relevant item for a query in advance. In this paper, we focus on the use of open-source LLMs for generating relevance judgments in a realistic setting where we lack prior knowledge of any relevant items for a query. There are only few studies (Zhang et al., 2024; Salemi and Zamani, 2024; Upadhyay et al., 2024a; Khramtsova et al., 2024) attempting to prompt open-source LLMs in this setting.

Challenges

We face two challenges when using QPP-GenRE for QPP: (i) predicting IR metrics considering recall, ideally, entails identifying all relevant items in the entire corpus for a query; however, using an LLM to judge the entire corpus per query is impractical due to the significant computational overhead; (ii) our experiments reveal that directly prompting open-source LLMs in a zero-/few-shot manner yields limited effectiveness in predicting relevance, resulting in limited QPP quality; this aligns with recent findings indicating limited success in prompting open-source LLMs for specific tasks (Qin et al., 2023).

Solutions

To address the challenges listed above, (i) we devise an approximation strategy to predict IR measures considering recall by only judging a few items in the ranked list for a query and using them to estimate the metric, hence avoiding the cost of traversing the entire corpus to identify all relevant items for a query; the approximation strategy also enables us to investigate the impact of various judging depths in the ranked list on QPP quality; and (ii) we enhance an open-source LLM’s ability to generate relevance judgments by training it with parameter-efficient fine-tuning (PEFT) (Dettmers et al., 2023) on human-labeled relevance judgments; unlike previous supervised QPP methods that need to train separate models for predicting different IR evaluation measures, training LLMs to judge relevance is agnostic to a specific IR metric.

Experiments

We integrate QPP-GenRE with a leading open-source LLM, LLaMA (Touvron et al., 2023b, a) for generating relevance judgments; LLaMA has shown competitive performance on many natural language processing (NLP) and IR tasks (Touvron et al., 2023b, a). Experiments on datasets from the TREC 2019–2022 deep learning (TREC-DL) tracks (Craswell et al., 2022, 2021, 2020, 2019) show that QPP-GenRE achieves state-of-the-art QPP quality in estimating the retrieval quality of a lexical ranker (BM25) and a neural ranker (ANCE, Xiong et al., 2021) in terms of RR@10, a precision-oriented IR metric, and nDCG@10, an IR metric considering recall. See Sections 6.1 and 6.2.

We also find that using LLMs to directly model QPP, i.e., asking LLMs to directly generate values of IR evaluation metrics, performs much worse than QPP-GenRE. This finding reveals that QPP-GenRE is a more effective way of modeling QPP using LLMs. Furthermore, our experiments demonstrate the effectiveness of our devised approximation strategy in nDCG@10: QPP-GenRE achieves state-of-the-art QPP quality at the shallow judging depth 10, and QPP-GenRE’s QPP quality reaches saturation when it further judges up to 100–200 retrieved items in a ranked list. See Section 7.1.

Moreover, we investigate the impact of fine-tuning and the choice of LLMs on the quality of generated relevance judgments and QPP. Besides LLaMA-7B, we consider the newly-released Llama-3-8B (AI@Meta, 2024), and Llama-3-8B-Instruct (AI@Meta, 2024), and compare their fine-tuned versions against their few-shot prompted counterparts. We find that fine-tuning markedly improves the quality of relevance judgment generation and QPP for all LLMs; the performance of fine-tuned LLMs in terms of judging relevance exceeds that of a commercial LLM (GPT-3.5) (Faggioli et al., 2023a). Also, we observe that the newly released and instruction-tuned LLMs generally deliver better performance. This implies that with a more effective LLM, QPP-GenRE has the potential to be more effective at QPP. See Section 7.2.

Additionally, to show QPP-GenRE’s compatibility with other types of relevance prediction methods, we adapt a state-of-the-art pointwise LLM-based re-ranker, RankLLaMA (Ma et al., 2023a), into a relevance judgment generator by applying a threshold to its re-ranking scores. Our results indicate that QPP-GenRE integrated with RepLLaMA achieves high QPP quality, at the cost of tuning a proper threshold. The high QPP quality achieved by RankLLaMA demonstrates QPP-GenRE’s compatibility with other types of relevance prediction methods. See Section 7.3.

We also analyze QPP errors based on automatically generated relevance judgments, demonstrating QPP-GenRE’s interpretability. See Section 7.4.

Finally, our computational cost analysis shows that QPP-GenRE shows lower latency than some supervised QPP baselines when predicting multiple measures because multiple measures can be derived from the same set of relevance judgments. Although QPP-GenRE shows higher latency than other QPP baselines when predicting only one metric, QPP-GenRE ’s latency is still 20 times smaller than the state-of-the-art GPT-4-based listwise re-ranker (Sun et al., 2023b). See Section 7.5.

Application scenarios

Given QPP-GenRE’s high QPP quality and interpretability, it is well-suited for some knowledge-intensive professional search scenarios, e.g., legal (Tomlinson et al., 2007) or patent search (Lupu et al., 2013), where accurate QPP is prioritized, interpretable QPP results are preferred, and users may have a higher tolerance level for latency than users in web search. Plus, QPP-GenRE can be used to analyze how well a search system performs in offline settings (Faggioli et al., 2023d), where latency is not necessarily an issue.

One might argue, if QPP-GenRE needs to be integrated with an LLM to predict ranking quality, why not directly use the LLM for re-ranking? However, we reveal that QPP-GenRE integrated with LLaMA-7B already achieves high QPP quality and remains significantly more efficient than costly state-of-the-art LLM-based re-rankers (e.g., the GPT-4-based listwise re-ranker (Sun et al., 2023b)). Calling those expensive LLM-based re-rankers is often unnecessary, as many initial rankings are good enough and either do not require re-ranking or only need very shallow re-ranking depths (Meng et al., 2024). Therefore, sufficiently accurate QPP for initial rankings is needed to guide the decision on whether to use the expensive re-ranker, or to determine the optimal re-ranking depth that does not waste computational resources. Given QPP-GenRE’s substantial improvements in QPP quality over previous QPP methods and significantly lower latency compared to those expensive re-rankers, it is valuable to make QPP-GenRE work with state-of-the-art, yet much more costly, LLM-based re-rankers (Sun et al., 2023b) to achieve a better balance between effectiveness and efficiency in re-ranking.

Reproducibility

To facilitate future research, we release our data, scripts for fine-tuning/inference, sampled demonstration examples for few-shot prompting, and fine-tuned checkpoints for LLaMA-7B, Llama-3-8B, and Llama-3-8B-Instruct at https://github.com/ChuanMeng/QPP-GenRE.

Contributions

Our main contributions are as follows:

•

We propose a novel QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to the query, and predicts different IR evaluation measures based on the relevance predictions.
•

We devise an approximation strategy to predict IR measures considering recall, avoiding the cost of traversing the entire corpus to identify all relevant items for a query.
•

We fine-tune leading open-source LLMs, LLaMA-7B, Llama-3-8B, and Llama-3-8B-Instruct for the task of automatically generating relevance judgments.
•

We conduct experiments on four datasets, showing that QPP-GenRE outperforms the state-of-the-art QPP baselines on the TREC-DL 19–22 datasets in predicting RR@10 and nDCG@10 in terms of Pearson’s $\rho$ and Kendall’s $\tau$ .

2. Related Work

Our work is relevant to four strands of research: query performance prediction (QPP) (Section 2.1), zero/few-shot prompting and parameter-efficient fine-tuning (PEFT) for LLMs (Section 2.2), LLMs for generating relevance judgments (Section 2.3), and LLMs for re-ranking (Section 2.4).

2.1. Query performance prediction

\Acf

QPP has attracted lots of attention in the IR and NLP community and has been widely studied in ad-hoc search (Singh et al., 2023; Faggioli et al., 2023e, f; Datta et al., 2022c), conversational search (Faggioli et al., 2023c, d; Abbasiantaeb et al., 2023; Meng et al., 2023c, a; Sun et al., 2021; Meng, 2024; Meng et al., 2023b), question answering (Samadi and Rafiei, 2023; Hashemi et al., 2019), and image retrieval (Poesina et al., 2023). This paper focuses on QPP for ad-hoc search.

Typically, QPP methods are divided into two categories: pre- and post-retrieval methods (Carmel and Yom-Tov, 2010). The former predicts the difficulty of a given query by using features of the query and corpus, while the latter further uses features of a ranked list returned by a ranker for the query (Carmel and Yom-Tov, 2010). This paper focuses on post-retrieval QPP methods.

A large number of unsupervised and supervised post-retrieval QPP methods have been proposed (Carmel and Yom-Tov, 2010) for predicting the performance of lexical rankers, such as query likelihood (Lafferty and Zhai, 2001) and BM25 (Robertson et al., 2009). Unsupervised QPP methods can be classified into clarity-based (Cronen-Townsend et al., 2002), robustness-based (Zhou and Croft, 2007, 2006; Aslam and Pavlu, 2007), coherence-based (Arabzadeh et al., 2021a; Diaz, 2007), and score-based (Tao and Wu, 2014; Shtok et al., 2012; Cummins et al., 2011; Pérez-Iglesias and Araujo, 2010; Zhou and Croft, 2007). More recently, a set of supervised QPP methods have been proposed (Arabzadeh et al., 2021b; Hashemi et al., 2019; Zamani et al., 2018; Datta et al., 2022a, b; Chen et al., 2022; Khodabakhsh and Bagheri, 2023). NeuralQPP (Zamani et al., 2018) and Deep-QPP (Datta et al., 2022a) are optimized from scratch. NQA-QPP (Hashemi et al., 2019) and BERT-QPP (Arabzadeh et al., 2021b) fine-tune BERT (Devlin et al., 2019) to improve QPP effectiveness. Further, Datta et al. (2022c) propose qppBERT-PL, which considers list-wise-document information, while Chen et al. (2022) propose BERT-groupwise-QPP that considers both cross-query and cross-document information. Khodabakhsh and Bagheri (2023) propose a multi-task query performance prediction framework (M-QPPF), learning document ranking and QPP simultaneously.

Post-retrieval QPP methods designed for lexical rankers struggle to predict the retrieval quality of neural rankers (Faggioli et al., 2023f; Hashemi et al., 2019), motivating several new unsupervised post-retrieval QPP methods designed for neural rankers. Datta et al. (2022b) propose a weighted relative information gain-based model (WRIG), which assesses a neural ranker for a given query by considering the relative difference of predicted performance between the given query and its variants; Zendel et al. (2023) assess a neural re-ranker by measuring the entropy of scores returned by it; Faggioli et al. (2023e) propose neural-ranker-specific ways of calculating regularization terms used by unsupervised post-retrieval QPP methods; Vlachou and Macdonald (2023) propose an unsupervised coherence-based QPP method that employs neural embedding representations to assess dense retrievers; and Singh et al. (2023) propose pairwise rank preference-based QPP (QPP-PRP) for predicting the performance of a neural ranker by measuring the degree to which a pairwise neural re-ranker (e.g., DuoT5 (Pradeep et al., 2021)) agrees with the ranked list returned by the neural ranker.

We present a novel QPP perspective: we start by automatically generating relevance judgments for a ranked list for a query and then proceed to predict IR evaluation measures for the ranked list. To the best of our knowledge, no prior work addresses QPP from this perspective.

Unlike regression-based QPP models (Khodabakhsh and Bagheri, 2023; Chen et al., 2022; Datta et al., 2022a; Arabzadeh et al., 2021b; Hashemi et al., 2019), which require training separate models to predict different IR evaluation measures, the training of LLMs for judging relevance in the QPP-GenRE method that we propose is agnostic to a specific IR evaluation measure, and different measures can be derived from the same set of generated relevance judgments.

We also differ from qppBERT-PL (Datta et al., 2022c), which first predicts the number of relevant items for each chunk in a ranked list and then aggregates those numbers into a general QPP score. However, qppBERT-PL’s output is still presented as a single scalar, which is insufficient to accurately represent different evaluation measures; also, it is infeasible to predict arbitrary IR measures only using the number of relevant items in a ranked list.

The work closest to QPP-GenRE, which is still different, is QPP using effectiveness evaluation without relevance judgments (EEwRJ) (Mizzaro et al., 2018). The goal of EEwRJ methods is to predict search system effectiveness in a TREC-like environment. E.g., a method proposed by Soboroff et al. (2001) randomly samples items from a pool for a query and treats these items as relevant; the intuition is that if an item is ranked highly by many search systems, it is likely to be pooled and therefore considered relevant. Mizzaro et al. (2018) explore applying QPP EEwRJ (Mizzaro et al., 2018) methods to QPP. However, QPP using EEwRJ suffers from two limitations: (i) EEwRJrequires obtaining ranked lists returned by all search systems in a given TREC edition to predict the difficulty of a query, and (ii) EEwRJencounters normalization challenges when predicting the ranking quality for a ranked list returned by a specific search system (Mizzaro et al., 2018). QPP-GenRE does not face these limitations.

2.2. Zero/few-shot prompting and parameter-efficient fine-tuning (PEFT) for LLMs

While fine-tuning pre-trained language models has given rise to many state-of-the-art results (Devlin et al., 2019), fully fine-tuning LLMs for a specific task on consumer-level hardware is typically infeasible (Zhu et al., 2023) because of the large number of parameters of LLMs. As a result, there are three prevailing ways to adapt LLMs to a specific task: zero-shot prompting, few-shot prompting, a.k.a. in-context learning (ICL) (Dong et al., 2022; Brown et al., 2020), and parameter-efficient fine-tuning (PEFT) (Dettmers et al., 2023; Liu et al., 2022; Hu et al., 2021).

There is limited success in only prompting open-source LLMs for certain tasks (Qin et al., 2023). Zero-shot prompting instructs an LLM to perform a specific task by inputting a text instruction. To get a promising result, zero-shot prompting is usually based on instruction-tuned LLMs (Zhang et al., 2023b; Qin et al., 2023), such as Flan-T5 (Chung et al., 2022), Flan-UL2 (Tay et al., 2022). However, Sun et al. (2023a) show that the performance of zero-shot prompting degrades considerably if an LLM is fed an instruction that was not observable during its training. ICL inputs a few input-target pairs (a.k.a. demonstrations) to an LLM, which would make an LLM learn from analogy (Dong et al., 2022) without updating its parameters. However, ICL has a high computational cost because it needs to feed input-target pairs to an LLM for each prediction; also, ICL requires substantial manual prompt engineering because an LLM’s performance (Liu et al., 2022) is sensitive to the formatting of the prompt (e.g., the wording and the order of input-target pairs).

PEFT can solve the above limitations; it aims to adapt an LLM to a specific task by training only a small fraction of its parameters. \AcLoRA, a widely-used PEFT method (Gema et al., 2023; Liu and Low, 2023; Zhang et al., 2023a; Santilli and Rodolà, 2023), has been shown to achieve comparable performance to full-model fine-tuning (Dettmers et al., 2023; Lu et al., 2023); low-rank adaptation (LoRA) adds learnable low-rank adapters to each network layer of an LLM (Hu et al., 2021) while all original parameters of the LLM are frozen. QLoRA (Dettmers et al., 2023) further reduces the memory usage of LoRA without sacrificing performance; QLoRA first quantizes an LLM model to 4-bits before adding and optimizing low-rank adapters. Our work explores the possibility of training open-source LLMs with QLoRA to generate relevance judgments.

2.3. \AcpLLM for generating relevance judgments

Automatically generating relevance judgments is a long-standing goal in IR that has been studied for multiple decades (Makary et al., 2017, 2016; Ravana et al., 2015; Nuray and Can, 2006, 2003; Soboroff et al., 2001). Recent studies have demonstrated promising results of using LLMs for the automatic generation of relevance judgments (Thomas et al., 2023; Faggioli et al., 2023a). In this paper we focus on studies into generating relevance judgments with discrete classes (e.g., “Relevant” or “Irrelevant”), instead of generating continuous relevance labels in real numbers (Yan et al., 2024). We discuss related studies into LLM-based automatic generation of relevance judgments from two dimensions: (i) how LLMs are used to generate relevance judgments, and (ii) their applications.

Recent studies have explored prompting commercial LLMs (e.g., GPT-3.5 and GPT 4) or open-source LLMs in zero- or few-shot manners. Specifically, Faggioli et al. (2023a) use zero- and few-shot prompting to instruct GPT-3.5 to predict the relevance of an item to a query. Thomas et al. (2023) instruct GPT-4 by zero-shot prompting, and add to the prompt a detailed query description and consider chain-of-thought (Wei et al., 2022). Similarly, Rahmani et al. (2024) follow Thomas et al. (2023)’s prompt to instruct GPT-4. Ma et al. (2024) instruct GPT-3.5 to generate relevance judgments for a domain-specific scenario, i.e., legal case retrieval (Ma et al., 2021); they use prompts specifically designed for this scenario. More recently, Upadhyay et al. (2024b) prompt GPT-4o in a zero-shot manner. Besides using commercial LLMs, only few studies (Zhang et al., 2024; Salemi and Zamani, 2024; Upadhyay et al., 2024a; Khramtsova et al., 2024) explore prompting open-source LLMs to generate relevance judgments. E.g., Khramtsova et al. (2024), Upadhyay et al. (2024a) and Salemi and Zamani (2024) prompt Flan-T5 (Chung et al., 2022), Vicuña-7B (Zheng et al., 2024) and Mistral (Jiang et al., 2023), respectively, in either zero-shot or few-shot manners. MacAvaney and Soldaini (2023) focus on a special scenario where a relevant item for a given query is already known and use Flan-T5 (Chung et al., 2022) to estimate the relevance of another item to the query given the known relevant item.

Recent studies have explored using LLM-generated relevance judgments to benefit (i) search system evaluation (Upadhyay et al., 2024a; Rahmani et al., 2024; Thomas et al., 2023; MacAvaney and Soldaini, 2023; Faggioli et al., 2023a), (ii) ranker selection (Khramtsova et al., 2024), (iii) item selection and retrieval quality evaluation in retrieval-augmented generation (RAG) (Zhang et al., 2024; Salemi and Zamani, 2024) and (iv) retriever fine-tuning (Ma et al., 2024). Concerning (i), recent studies (Rahmani et al., 2024; Upadhyay et al., 2024a; Abbasiantaeb et al., 2024; Faggioli et al., 2023a; Thomas et al., 2023; MacAvaney and Soldaini, 2023) explore evaluating search systems either entirely using LLM-generated relevance judgments or partially using LLM-generated relevance judgments (a.k.a. filling holes). They have demonstrated a high correlation between search system rankings based on LLM- and human-labelled relevance judgments. As to (ii), given a pool of dense retrievers, Khramtsova et al. (2024) select a suitable one for a target corpus by estimating their performance using LLM-generated queries and relevance judgments specific to the target corpus. For (iii), for item selection, Zhang et al. (2024) prompt LLMs to generate relevance judgments for retrieved candidate items in RAG; the items that are predicted as “relevant” are used for text generation. Zhang et al. (2024) observe that items selected via relevance prediction resulted in sub-optimal text generation quality. For retrieval quality evaluation, Salemi and Zamani (2024) generate relevance judgments for retrieved candidate items and aggregate those judgments into a score. However, Salemi and Zamani (2024) found that the aggregated score based on the LLM-generated relevance judgments achieves a low correlation with the text generation quality of RAG. Concerning (iv), Ma et al. (2024) fine-tune a legal case retriever on a training set augmented with LLM-generated relevance judgments. They show that fine-tuning a legal case retriever using the generated relevance judgments results in enhanced performance.

Our work differs from the studies mentioned above: (i) we explore the possibility of fine-tuning open-source LLMs for generating relevance judgments; unlike MacAvaney and Soldaini (2023), we focus on a more practical scenario wherein no relevant item is known in advance for each query; and (ii) we focus on QPP and predict the ranking quality of a ranked list for a query using LLM-generated relevance judgments, which previous studies have not explored.

2.4. \AcpLLM for re-ranking

Recent studies on using LLMs for re-ranking have witnessed remarkable progress (Meng et al., 2024; Askari et al., 2023; Zhuang et al., 2023b; Bommasani et al., 2023; Ma et al., 2023a, b; Zhuang et al., 2023a; Drozdov et al., 2023; Sachan et al., 2022; Zhang et al., 2023c; Pradeep et al., 2023b; Tang et al., 2023; Pradeep et al., 2023a; Ma et al., 2023b; Sun et al., 2023b; Zhuang et al., 2023c; Hou et al., 2023). There are four paradigms of LLM-based re-ranking: pointwise, pairwise, listwise, and setwise (Zhuang et al., 2023c). Given a query, pointwise re-rankers produce a relevance score for each item independently, and the final ranking is formed by sorting items by relevance score (Ma et al., 2023a; Drozdov et al., 2023; Zhuang et al., 2023b; Sachan et al., 2022). The pairwise paradigm (Qin et al., 2023) eliminates the need for computing relevance scores; given a query and a pair of items, a pairwise re-ranker estimates whether one item is more relevant than the other for the query. Listwise re-rankers (Zhang et al., 2023c; Pradeep et al., 2023b; Tang et al., 2023; Pradeep et al., 2023a; Ma et al., 2023b; Sun et al., 2023b) frame re-ranking as a pure generation task and directly output the reordered ranked list given a query and a ranked list return by first-stage retriever (Zhang et al., 2023c; Pradeep et al., 2023b; Tang et al., 2023; Pradeep et al., 2023a; Ma et al., 2023b; Sun et al., 2023b). Given the low efficiency of pairwise (multiple inference passes) and listwise (multiple decoding steps) re-rankers, the setwise paradigm (Zhuang et al., 2023c) is meant to improve the efficiency while retaining re-ranking effectiveness. Given a query and set of items, an LLM is asked which item is the most relevant one to the query; these items are reordered according to the LLM’s output logits of each item being chosen as the most relevant item to the query, which only requires one decoding step of an LLM.

Our work differs from this line of research because we generate explicit relevance judgments with discrete classes (e.g., “Relevant” or “Irrelevant”), whereas studies into LLMs for re-ranking aim to predict the relevance order of items. However, using LLMs for generating relevance judgments and for re-ranking are intrinsically the same task: relevance prediction. Thus, an LLM-based re-ranker has the potential to serve as a relevance judgment generator.

Our main contribution in this paper is the introduction of QPP-GenRE, a novel QPP framework, which, in theory, can be integrated with various relevance prediction approaches. To demonstrate the compatibility of QPP-GenRE with various relevance prediction approaches, we adapt a state-of-the-art pointwise LLM-based re-ranker, RankLLaMA (Ma et al., 2023a), into a relevance judgment generator by applying a threshold for its re-ranking scores; we then integrate QPP-GenRE with this adapted RankLLaMA. Note that adapting other types of LLM-based re-rankers, e.g., pairwise and listwise LLM-based re-rankers into relevance judgment generators is challenging because they do not explicitly produce explicit relevance scores. Exploring their use as relevance judgment generators is beyond the scope of this paper.

3. Task definition

In this paper, we focus on post-retrieval QPP (Carmel and Yom-Tov, 2010). Generally, a post-retrieval QPP method $\psi$ aims to estimate the retrieval quality of a ranked list $L=[d_{1},,\dots,d_{i},\dots,d_{|L|}]$ with $|L|$ retrieved items induced by a ranker $M$ over a corpus $C$ in response to query $q$ without human-labeled relevance judgments, formally:

(1)

p=\psi(q,L,C)\in\mathbb{R}~{},

where $p$ indicates the predicted retrieval quality of the ranker $M$ in response to the query $q$ ; typically, $p$ is expected to be correlated with an IR evaluation measure, such as reciprocal rank (RR).

4. Method

4.1. Overview of QPP-GenRE

We propose QPP-GenRE, which consists of two steps: (i) generating relevance judgments using LLMs, and (ii) predicting IR evaluation measures. In (i), we employ an LLM to generate relevance judgments for the top- $n$ retrieved items in the ranked list for a given query; to improve LLM’s effectiveness in generating relevance judgments, we fine-tune an LLM with PEFT using human-labeled relevance judgments. In (ii), we regard the generated relevance judgments as pseudo labels to calculate different IR evaluation measures.

4.2. Generating relevance judgments using LLMs

4.2.1. Inference

Given the ranked list $L=[d_{1},\dots,d_{i},\dots,d_{|L|}]$ with $|L|$ items returned by a ranker $M$ for a query $q$ , an LLM is employed to automatically predict the relevance of each item in the top- $n$ positions of the ranked list $L$ to the query $q$ , formally:

(2)

\hat{r}_{i}=\mathrm{LLM}(q,d_{i})~{},\vspace*{-1.5mm}

where $\hat{r}_{i}$ is a predicted relevance value for the item $d_{i}$ at rank $i$ . $\hat{r}_{i}\in\{1,0\}$ , where “1” indicates relevant and “0” irrelevant. We leave the prediction of multi-graded labels as future work. After automatically judging the top- $n$ items in the ranked list $L$ , we get a list of generated relevant judgments $\hat{\mathcal{R}}_{L_{1:n}}=[\hat{r}_{1},\dots,\hat{r}_{i},\dots,\hat{r}_{n}]$ , where $\hat{r}_{i}$ is the predicted relevance value for $d_{i}$ in $L$ . We design a prompt to instruct an LLM on the task of automatic generation of relevance judgments, as illustrated in Figure 1.

4.2.2. \AcfPEFT

To further improve an LLM’s effectiveness in generating relevance judgments, we use human-labeled relevance judgments to train an LLM with an effective PEFT method, QLoRA (Dettmers et al., 2023). Specifically, we first quantize an LLM model to 4-bit, add learnable low-rank adapters to each network layer of the LLM, and then optimize low-rank adapters. Formally, given the query $q$ and an item $d_{i}$ in the ranked list $L$ , we optimize the LLM to generate the human-labeled relevance value $r_{i}$ for the item $d_{i}$ . See Section 5 for more details.

4.3. Predicting IR evaluation measures

4.3.1. Predicting precision-oriented measures

We compute a precision-oriented measure based on LLM generated relevant judgments $\hat{\mathcal{R}}_{L_{1:n}}$ for the top- $n$ items in the ranked list $L$ . The following is an example to compute RR@k:

(3)

RR@k=1/\min_{i}\{\hat{r}_{i}>0\}~{},

where $0<i\leqslant k$ . $RR@k$ would be equal to 0 if there is no top- $k$ item that is predicted as relevant to the query $q$ ; in this case, $n=k$ .

Figure 1. Prompt used by LLMs for automatic generation of relevance judgments.

4.3.2. An approximation strategy to predict measures considering recall

As the computation of a measure considering recall requires the information of all relevant items in the corpus $C$ for a given query $q$ , we need to automatically assess every item in corpus $C$ , which is infeasible due to the high computational cost. To address this issue, we devise an approximation strategy for predicting an IR measure considering recall, which only judges the top- $n$ items in the ranked list $L$ and uses the items predicted as relevant to approximate all relevant items in the corpus, to avoid the cost of judging the entire corpus. Fröbe et al. (2023); Moffat (2017); Lu et al. (2016) define normalized discounted cumulative gain (nDCG) (Järvelin and Kekäläinen, 2002) at a cutoff $k$ as a recall-oriented IR evaluation metric because it is normalized by a recall-oriented “best possible” ranking.¹¹1In this paper, we employ nDCG@10 and believe that nDCG@10 is a metric considering recall: Figure 3 illustrates that to reach saturation in predicting nDCG@10 values for ANCE and BM25, judgments up to the top 100 and 200 retrieved items are needed, respectively. If it were a precision-based metric, saturation could be achieved by judging around 10 items. nDCG@10 is also the most primary official IR evaluation metric in TREC-DL 19–22 (Craswell et al., 2022, 2021, 2020, 2019). Thus, here we show an example of predicting nDCG@ $k$ (Järvelin and Kekäläinen, 2002), formally:

(4)

\begin{split}nDCG@k={DCG@k}/{IDCG@k}~{},\end{split}

where ${DCG@k}$ can be computed easily using the generated relevance judgments for the top- $k$ items in the ranked list $L$ , namely:²²2Note that we consider the definition of $DCG@k$ for binary relevance labels.

(5)

\begin{split}DCG@k=\hat{r}_{1}+\sum_{i=2}^{k}\hat{r}_{i}/\log_{2}i~{}.\end{split}

$IDCG@k$ is the ideal ranked list with $k$ items, which requires knowing all the relevant items in the corpus $C$ . We approximate all relevant items in the corpus by considering the items that are predicted as relevant at the top- $n$ ranks in the ranked list $L$ , and compute $IDCG@k$ based on that. First, we reorder the LLM-generated relevant judgments $\hat{\mathcal{R}}_{L_{1:n}}=[\hat{r}_{1},\dots,\hat{r}_{i},\dots,\hat{r}_{n}]$ for the ranked list $L$ into $\hat{\mathcal{R}}_{iL_{1:n}}=[\hat{ir}_{1},\dots,\hat{ir}_{i},\dots,\hat{ir}_{% n}]$ in descending order of predicted relevance; then, we compute $IDCG@k$ based on $\hat{\mathcal{R}}_{iL_{1:n}}$ , namely:

(6)

\begin{split}IDCG@k=\hat{ir}_{1}+\sum_{i=2}^{k}\hat{ir}_{i}/\log_{2}i~{}.\end{% split}\vspace*{-1.5mm}

5. Experimental setup

5.1. Research questions

In this section, we study the following research questions:

RQ1

To what extent does QPP-GenRE improve QPP quality for lexical and neural rankers in terms of RR@10, a precision-oriented IR metric, compared to state-of-the-art baselines?
RQ2

To what extent does QPP-GenRE improve QPP quality for lexical and neural rankers in terms of nDCG@10, an IR metric considering recall, compared to state-of-the-art baselines?
RQ3

How does judging depth in a ranked list affect the prediction of nDCG@10, an IR metric that considers recall?
RQ4

To what extent do fine-tuning and the choice of LLMs affect the quality of generated relevance judgments and QPP?

5.2. Datasets

We experiment with 4 widely-used IR datasets from the TREC 2019–2022 deep learning (TREC-DL) tracks (Craswell et al., 2022, 2021, 2020, 2019). These datasets provide relevance judgments in multi-graded relevance scales per query. TREC-DL 19, 20, 21 and 22 have 43, 54, 53 and 76 queries, respectively. TREC-DL 19/20 and TREC-DL 21/22 are based on the MS MARCO V1 and MS MARCO V2 passage ranking collections respectively. In the V1 edition, the corpus comprises 8.8 million passages while the V2 edition has over 138 million passages.

5.3. Retrieval approaches

We consider BM25 (Robertson et al., 2009) as a lexical ranker, and ANCE (Xiong et al., 2021) as a neural-based dense retriever. To increase the comparability and reproducibility of our paper, we get the retrieval results of both rankers using the publicly available resource from Pyserini (Lin et al., 2021). We get BM25’s retrieval result with top-1000 retrieved items per query on the TREC-DL 19–22 datasets using the default parameters ( $k1=0.9$ , $b=0.4$ ). BM25’s actual nDCG@10 values are 0.506, 0.480, 0.446 and 0.269 on TREC-DL 19, 20, 21 and 22, respectively. We get ANCE’s retrieval result with top-1000 retrieved items per query on TREC-DL 19–20, using the publicly available dense vector index of ANCE on MS MARCO V1. ANCE’s actual nDCG@10 values are 0.645 and 0.646 on TREC-DL 19 and 20, respectively. We rely on the publicly available dense vector index of ANCE; at the time of writing, there is no dense vector index of ANCE publicly available on MS MARCO V2 for TREC-DL 21 and 22.³³3Building the dense vector index on MS MARCO V2 with over 138 million passages is resource-intensive and out of the scope of our work.

5.4. QPP baselines

We consider three groups of baselines: unsupervised post-retrieval QPP methods, supervised post-retrieval QPP methods, and the LLM-based QPP methods. Specifically, we consider the following unsupervised QPP approaches that already showed high correlation with actual retrieval performance in previous work:

•

Clarity (Cronen-Townsend et al., 2002) computes the KL divergence between language models (Lavrenko and Croft, 2001) induced from the top- $k$ items in a ranked list and the corpus.
•

Weighted information gain (WIG) (Zhou and Croft, 2007) calculates the difference between retrieval scores of the top- $k$ items in a ranked list and the retrieval score of the entire corpus.
•

Normalized query commitment (NQC) (Shtok et al., 2012) calculates the standard deviation of retrieval scores of the top- $k$ items in a ranked list to a query; the standard deviation is normalized by the retrieval score of the entire corpus to the query.
•

$\sigma_{max}$ (Pérez-Iglesias and Araujo, 2010) computes the standard deviation of retrieval scores from the first item to each point in a ranked list and outputs the maximum standard deviation.
•

n( $\sigma_{x\%}$ ) (Cummins et al., 2011) calculates the standard deviation for each query by considering the items whose retrieval scores are at least $x\%$ of the top retrieval score in a ranked list.
•

Score magnitude and variance (SMV) (Tao and Wu, 2014) considers both the magnitude of retrieval scores (WIG) and their variance (NQC).
•

UEF(NQC) (Shtok et al., 2010) uses a pseudo-effective reference list to improve QPP quality; we follow (Arabzadeh et al., 2023; Datta et al., 2022c; Arabzadeh et al., 2021b) to use NQC as a base predictor.
•

RLS(NQC) (Roitman, 2017) generates and selects both pseudo-effective and pseudo-ineffective reference lists; we use NQC as a base predictor because Roitman (2017) show that RLS works better with NQC.
•

QPP-PRP (Singh et al., 2023) measures the degree to which a pairwise neural re-ranker (DuoT5 (Pradeep et al., 2021)) agrees with the ranked list for the query.
•

Dense-QPP (Arabzadeh et al., 2023) is robustness-based and designed for dense retrievers only: it injects noise neural representation of the given query, and then measures the similarity between ranked lists for the original query and perturbed query representations. Note that Dense-QPP (Arabzadeh et al., 2023) is designed for predicting the ranking quality of neural-based retrievers; it cannot predict the ranking quality of BM25.

Since studies show that BERT-based post-retrieval supervised QPP methods (Hashemi et al., 2019; Arabzadeh et al., 2021b; Datta et al., 2022c; Chen et al., 2022) perform better than their neural-based counterparts, we only consider BERT-based supervised QPP approaches:

•

NQA-QPP (Hashemi et al., 2019) is a regression-based method, which predicts a QPP score by using BERT representations for the query and query-item pairs, and the standard deviation of retrieval scores.
•

BERTQPP (Arabzadeh et al., 2021b) is a regression-based method, which predicts a QPP score by using BERT representations for the query and the top-ranked item. We use the cross-encoder version of BERTQPP because of its promising results.
•

qppBERT-PL (Datta et al., 2022c) first splits the ranked list into chunks, predicts the number of relevant items in each chunk, and calculates a weighted average of the number of relevant items in all chunks.
•

M-QPPF (Khodabakhsh and Bagheri, 2023) is also regression-based and models QPP and document ranking jointly, by adopting a shared BERT layer to learn representations for query-document pairs, and using two layers to model QPP and document ranking, respectively.

Figure 2. Prompt used by QPP-LLM.

While to the best of our knowledge there is no LLM-based QPP method yet, to have a fair comparison with LLM-based approaches, we propose two LLM-based QPP baselines. Research on using LLMs for arithmetic tasks shows that LLaMA treats numbers as distinct tokens and can understand and generate numerical values (Liu and Low, 2023). Inspired by this, we prompt LLaMA-7B to directly generate a numerical score given a query and the ranked list with $k$ passages for the query; the prompt is shown in Figure 2. We consider two variants:

•

QPP-LLM (few-shot) uses in-context learning (ICL) and inserts several demonstration examples after the instruction in the prompt; each example is composed of a query, $k$ passages and the actual performance in terms of an IR evaluation measure.
•

QPP-LLM (fine-tuned) fine-tune LLaMA-7B to learn to directly generate numerical values of an IR metric, similar to the way other regression-based supervised QPP methods are trained.

5.5. QPP evaluation and target IR evaluation measures

We follow established best practices (Datta et al., 2022c; Zamani et al., 2018; Carmel and Yom-Tov, 2010; Hauff et al., 2008; Cronen-Townsend et al., 2002) to evaluate QPP by measuring linear correlation by Pearson’s $\rho$ as well as ranked-based correlation through Kendall’s $\tau$ correlation coefficients between the actual and predicted performance of a query set.⁴⁴4We also consider scaled Mean Absolute Ranking Error (sMARE) (Faggioli et al., 2022, 2021b) and draw the same conclusion as Pearson’s $\rho$ and Kendall’s $\tau$ ; we include the results in our repository at https://github.com/ChuanMeng/QPP-GenRE. As for target IR metrics, we consider the two primary official IR metrics used in TREC DL 19–22 (Craswell et al., 2022, 2021, 2020, 2019), RR@10 (precision-oriented) and nDCG@10 (considering recall); recent QPP studies (Arabzadeh et al., 2023; Faggioli et al., 2023e; Khodabakhsh and Bagheri, 2023) consider either or both of these metrics as their target metrics. Following (Datta et al., 2022c), we use relevance scale $\geq$ 2 as positive to compute actual binary IR measures (e.g., RR). When calculating correlation for nDCG@10, the actual values of nDCG@10 are calculated by human-labeled and multi-graded relevance judgments, while the nDCG@10 values predicted by QPP-GenRE are based on its generated binary judgments.

5.6. Implementation details

For all unsupervised QPP baselines, we tune the hyper-parameters for predicting the ranking quality of a ranker (either BM25 or ANCE) on TREC-DL 19 (TREC-DL 21) based on Pearson’s $\rho$ correlation for predicting the ranking quality of the same ranker on TREC-DL 20 (TREC-DL 22), and vice versa. We select the cut-off value $k$ for Clarity, NQC, WIG, SMV and so on from $\{5,10,15,20,25,50,100,300,500,1000\}$ . n( $\sigma_{x\%}$ ) has a a hyper-parameter $x$ , which we choose from the set $\{0.25,0.4,0.5,0.6,0.75,0.9\}$ .

To predict the performance of a certain ranker (either BM25 or ANCE), we train all supervised QPP baselines based on the ranked list returned by the target ranker. To predict a certain IR evaluation measure, regression-based methods (Khodabakhsh and Bagheri, 2023; Arabzadeh et al., 2021b; Hashemi et al., 2019) are trained to learn to output the target evaluation measure during training. However, our preliminary result shows that training supervised QPP baselines, especially for regression-based supervised methods (Khodabakhsh and Bagheri, 2023; Arabzadeh et al., 2021b; Hashemi et al., 2019), on the training set of MS MARCO V1 leads to inferior QPP quality for predicting the performance of ANCE. We hypothesize that this is because ANCE was originally trained on the training set of MS MARCO V1 (Xiong et al., 2021), and so the ranked list returned by ANCE on the training set of MS MARCO V1 would have higher quality than the ranked list returned by ANCE on the evaluation sets; therefore, supervised QPP methods that share the same training set as ANCE, tend to predict inflated performance on the evaluation sets, leading to degraded QPP quality. To solve the issue and ensure the consistency of the paper, we train all supervised QPP methods (including QPP-GenRE) on the development set of MS MARCO V1 (6980 queries) for predicting the performance of either BM25 or ANCE. We train all supervised QPP methods for 5 epochs and pick the best checkpoint for predicting the performance of a ranker on TREC-DL 19 (TREC-DL 21) based on Pearson’s $\rho$ correlation for predicting the performance of the same ranker on TREC-DL 20 (TREC-DL 22) and vice versa. All supervised QPP baselines use bert-base-uncased,⁵⁵5https://github.com/huggingface/transformers a constant learning rate (0.00002), and the Adam optimizer (Kingma and Ba, 2015).

For QPP-LLM, we prompt LLaMA-7B with the top- $k$ retrieved items, where $k$ is set to 10. For QPP-LLM (few-shot), we randomly sample demonstration examples from the development set of MS MARCO V1; our preliminary experiments show that sampling 2 demonstrations works best. For QPP-LLM (fine-tuned), we fine-tune LLaMA-7B using PEFT as QPP-GenRE fine-tunes LLMs.

We equip QPP-GenRE with LLaMA-7B for judging relevance. We use a recent PEFT method, 4-bit QLoRA (Dettmers et al., 2023), to fine-tune LLaMA-7B. The training of judging relevance needs positive and negative items per query. For positive items, we use the items annotated as relevant in qrels per query; we randomly sample one negative item from the ranked list (1,000 items) returned by BM25 per query. We fine-tune LLaMA-7B for 5 epochs on the development set of MS MARCO V1, taking about an hour and a half on an NVIDIA A100 GPU (40GB).

Table 1. Correlation coefficients (Pearson’s

\rho

and Kendall’s

\tau

) between actual retrieval quality, in terms of RR@10, of BM25 and performance predicted by QPP-GenRE/baselines, on TREC-DL 19–22. ^∗ indicates statistically significant correlation coefficients (

p

-value

<0.05

). ^† indicates the statistically significant improvement of QPP-GenRE compared to all the baselines (paired

t

-test;

p

-value

<0.001

with Bonferroni correction for multiple testing). The best value in each column is marked in bold.

n

denotes QPP-GenRE’s judgment depth in a ranked list.

QPP method	Ranker: BM25
	TREC-DL 19		TREC-DL 20		TREC-DL 21		TREC-DL 22
	P- $\rho$	K- $\tau$	P- $\rho$	K- $\tau$	P- $\rho$	K- $\tau$	P- $\rho$	K- $\tau$
Clarity	0.135	0.028	0.050	0.021	0.183	0.161	0.253^∗	0.099
WIG	0.113	0.164	0.286^∗	0.218^∗	0.237	0.206^∗	0.029	0.082
NQC	0.194	0.117	0.152	0.191	0.227	0.195	0.223	0.048
$\sigma_{max}$	0.195	0.164	0.200	0.211^∗	0.278^∗	0.174	0.038	0.048
n( $\sigma_{x\%}$ )	0.144	0.181	0.187	0.123	0.127	0.140	0.169	0.113
SMV	0.141	0.097	0.126	0.193	0.240	0.189	0.227^∗	0.094
UEF(NQC)	0.235	0.256^∗	0.270^∗	0.211^∗	0.231	0.111	0.216	0.065
RLS(NQC)	0.272	0.122	0.290^∗	0.193	0.234	0.195	0.224	0.095
QPP-PRP	0.292	0.189	0.163	0.184	-0.080	-0.017	0.122	0.091
NQA-QPP	0.181	0.122	0.062	0.069	0.161	0.163	0.224	0.177^∗
BERTQPP	0.281	0.136	0.237	0.155	0.206	0.134	0.148	0.122
qppBERT-PL	0.145	0.138	0.166	0.152	0.339^∗	0.244^∗	0.131	0.206^∗
M-QPPF	0.317^∗	0.208	0.335^∗	0.273^∗	0.282^∗	0.209^∗	0.161	0.187^∗
QPP-LLM (few-shot)	0.008	0.003	-0.081	-0.129	-0.053	-0.053	-0.241	-0.155
QPP-LLM (fine-tuned)	0.171	0.158	0.228	0.206	0.030	0.099	-0.038	0.009
QPP-GenRE ( $n=10$ )	0.538^†^∗	0.486^†^∗	0.560^†^∗	0.475^†^∗	0.524^†^∗	0.435^†^∗	0.350^†^∗	0.262^†^∗

Table 2. Correlation coefficients (Pearson’s

\rho

and Kendall’s

\tau

) between actual retrieval quality, in terms of RR@10, of ANCE and performance predicted by QPP-GenRE/baselines, on TREC-DL 19 and 20. ^∗ indicates statistically significant correlation coefficients (

p

-value

<0.05

). ^† indicates the statistically significant improvement of QPP-GenRE compared to all the baselines (paired

t

-test;

p

-value

<0.001

with Bonferroni correction for multiple testing). The best value in each column is marked in bold.

n

denotes QPP-GenRE’s judgment depth in a ranked list.

QPP method	Ranker: ANCE
	TREC-DL 19		TREC-DL 20
	P- $\rho$	K- $\tau$	P- $\rho$	K- $\tau$
Clarity	-0.078	-0.012	-0.074	-0.048
WIG	0.313^∗	0.228	0.059	0.048
NQC	0.350^∗	0.200	0.145	0.112
$\sigma_{max}$	0.384^∗	0.287^∗	0.171	0.118
n( $\sigma_{x\%}$ )	0.200	0.176	-0.008	0.022
SMV	0.352^∗	0.256^∗	0.182	0.161
UEF(NQC)	0.340^∗	0.260^∗	0.131	0.108
RLS(NQC)	0.359^∗	0.273^∗	0.178	0.139
QPP-PRP	0.259	0.246	0.100	-0.008
Dense-QPP	0.452^∗	0.280^∗	0.209	0.139
NQA-QPP	-0.026	-0.009	-0.059	-0.080
BERTQPP	0.330^∗	0.214	0.046	-0.012
qppBERT-PL	0.092	0.025	-0.224	-0.218
M-QPPF	0.292	0.200	0.068	0.038
QPP-LLM (few-shot)	-0.008	0.005	-0.226	-0.207
QPP-LLM (fine-tuned)	-0.073	0.011	-0.022	0.069
QPP-GenRE ( $n=10$ )	0.567^†^∗	0.440^†^∗	0.293^†^∗	0.257^†^∗

6. Results

6.1. Predicting a precision-oriented IR measure

To answer RQ1, we compare QPP-GenRE and all baselines in predicting the performance of BM25 and ANCE w.r.t. a widely-used precision-oriented metric, RR@10; see Table 2. We have three main observations.

First, our proposed method, QPP-GenRE, outperforms all baselines in terms of both correlation coefficients on all datasets when predicting the performance of both rankers. In particular, we observe that QPP-GenRE outperforms QPP-PRP (Singh et al., 2023), which is a recently proposed baseline by 84% (0.292 vs. 0.538) in terms of Pearson’s $\rho$ when predicting RR@10 for BM25 on TREC-DL 19.

Second, QPP-LLM (few-shot) gets the worst result compared to other approaches. While QPP-LLM (fine-tuning) performs slightly better than QPP-LLM (few-shot), its performance is still limited in most cases. This indicates that it is ineffective for an LLM to model QPP in a straightforward way of directly predicting a score.

Third, there is no clear winner among the baselines, and the performance of baselines shows a bigger variance than QPP-GenRE across different datasets and rankers. E.g., the unsupervised method WIG achieves a good result among baselines for assessing BM25 on TREC-DL 20, while it gets nearly zero correlation coefficients on TREC-DL 22 when assessing BM25. Conversely, QPP-GenRE consistently achieves the best performance across datasets and rankers, thus showing robust performance.

Table 3. Correlation coefficients (Pearson’s

\rho

and Kendall’s

\tau

) between actual retrieval quality, in terms of nDCG@10, of BM25 and performance predicted by QPP-GenRE/baselines, on TREC-DL 19–22.

n

denotes QPP-GenRE’s judgment depth in a ranked list. ^∗ indicates statistically significant correlation coefficients (

p

-value

<0.05

). ^† indicates the statistically significant improvement of QPP-GenRE (

n

=200) compared to all the baselines (paired

t

-test;

p

-value

<0.001

with Bonferroni correction for multiple testing). The best value in each column is marked in bold.

QPP method	TREC-DL 19		TREC-DL 20		TREC-DL 21		TREC-DL 22
QPP method	P- $\rho$	K- $\tau$	P- $\rho$	K- $\tau$	P- $\rho$	K- $\tau$	P- $\rho$	K- $\tau$
Clarity	0.091	0.056	0.358^∗	0.250^∗	0.137	0.078	0.202	0.090
WIG	0.520^∗	0.331^∗	0.615^∗	0.423^∗	0.311^∗	0.281^∗	0.350^∗	0.249^∗
NQC	0.468^∗	0.300^∗	0.508^∗	0.401^∗	0.134	0.221^∗	0.360^∗	0.156^∗
$\sigma_{max}$	0.478^∗	0.327^∗	0.529^∗	0.440^∗	0.298^∗	0.258^∗	0.142^∗	0.196^∗
n( $\sigma_{x\%}$ )	0.532^∗	0.311^∗	0.622^∗	0.443^∗	0.328^∗	0.234^∗	0.336^∗	0.228^∗
SMV	0.376^∗	0.271^∗	0.463^∗	0.383^∗	0.327^∗	0.236^∗	0.338^∗	0.155^∗
UEF(NQC)	0.499^∗	0.322^∗	0.517^∗	0.356^∗	0.153	0.232^∗	0.311^∗	0.145
RLS(NQC)	0.469^∗	0.169	0.522^∗	0.376^∗	0.272^∗	0.223^∗	0.337^∗	0.157^∗
QPP-PRP	0.321	0.181	0.189	0.157	0.027	0.004	0.077	0.012
NQA-QPP	0.210	0.147	0.244	0.210^∗	0.286^∗	0.201^∗	0.312^∗	0.194^∗
BERTQPP	0.458^∗	0.207	0.426^∗	0.300^∗	0.351^∗	0.223^∗	0.369^∗	0.229^∗
qppBERT-PL	0.171	0.175	0.410^∗	0.279^∗	0.277^∗	0.182	0.300^∗	0.242^∗
M-QPPF	0.404^∗	0.254^∗	0.435^∗	0.297^∗	0.265	0.226^∗	0.345^∗	0.204^∗
QPP-LLM (few-shot)	-0.024	-0.031	0.167	0.138	0.238	0.201	-0.073	-0.077
QPP-LLM (fine-tuned)	0.313^∗	0.215	0.309^∗	0.254^∗	0.264	0.198	-0.075	-0.009
QPP-GenRE ( $n=200$ )	0.724^†^∗	0.474^†^∗	0.638^†^∗	0.469^†^∗	0.546^†^∗	0.435^†^∗	0.388^∗	0.251^∗
QPP-GenRE ( $n=10$ )	0.605^∗	0.482^∗	0.490^∗	0.323^∗	0.462^∗	0.350^∗	0.316^∗	0.245^∗
QPP-GenRE ( $n=100$ )	0.712^∗	0.472^∗	0.609^∗	0.457^∗	0.545^∗	0.427^∗	0.332^∗	0.246^∗
QPP-GenRE ( $n=1,000$ )	0.715^∗	0.477^∗	0.627^∗	0.459^∗	0.547^∗	0.436^∗	0.388^∗	0.251^∗

Table 4. Correlation coefficients (Pearson’s

\rho

and Kendall’s

\tau

) between actual retrieval quality, in terms of nDCG@10, of ANCE and performance predicted by QPP-GenRE/baselines, on TREC-DL 19 and 20.

n

denotes QPP-GenRE’s judgment depth in a ranked list. ^∗ indicates statistically significant correlation coefficients (

p

-value

<0.05

). ^† indicates the statistically significant improvement of QPP-GenRE (

n

=200) compared to all the baselines (paired

t

-test;

p

-value

<0.001

with Bonferroni correction for multiple testing). The best value in each column is marked in bold.

QPP method	TREC-DL 19		TREC-DL 20
QPP method	P- $\rho$	K- $\tau$	P- $\rho$	K- $\tau$
Clarity	-0.088	-0.062	-0.091	-0.045
WIG	0.515^∗	0.368^∗	0.218	0.150
NQC	0.548^∗	0.372^∗	0.411^∗	0.290^∗
$\sigma_{max}$	0.455^∗	0.339^∗	0.403^∗	0.288^∗
n( $\sigma_{x\%}$ )	0.388^∗	0.315^∗	0.103	0.075
SMV	0.496^∗	0.359	0.380^∗	0.283^∗
UEF(NQC)	0.548^∗	0.372^∗	0.413^∗	0.290^∗
RLS(NQC)	0.466^∗	0.346^∗	0.333^∗	0.271^∗
QPP-PRP	0.129	0.049	0.216	0.121
Dense-QPP	0.565^∗	0.389^∗	0.419^∗	0.318^∗
NQA-QPP	0.089	-0.038	0.186	0.113
BERTQPP	0.222	0.117	0.137	0.089
qppBERT-PL	0.116	0.098	-0.119	-0.046
M-QPPF	0.287	0.160	0.225	0.177
QPP-LLM (few-shot)	0.136	0.120	-0.130	-0.094
QPP-LLM (fine-tuned)	0.203	0.117	0.081	0.097
QPP-GenRE ( $n=200$ )	0.712^†^∗	0.483^†^∗	0.457^†^∗	0.343^†^∗
QPP-GenRE ( $n=10$ )	0.624^∗	0.406^∗	0.306^∗	0.238^∗
QPP-GenRE ( $n=100$ )	0.719^∗	0.489^∗	0.456^∗	0.355^∗
QPP-GenRE ( $n=1,000$ )	0.719^∗	0.492^∗	0.447^∗	0.321^∗

6.2. Predicting an IR measure considering recall

To answer RQ2, Table 4 lists the performance of QPP-GenRE along with all the baselines on assessing BM25 and ANCE in terms of nDCG@10. For QPP-GenRE, we universally set the judging depth $n$ to 200 for all evaluation sets. The result reveals that by judging only 200 items per query, we can achieve state-of-the-art QPP quality in terms of nDCG@10 for all rankers on all evaluation sets; we will investigate the impact of judging depth on QPP-GenRE’s performance in the next section. Also, QPP-LLM (few-shot) and QPP-LLM (fine-tuning) are among the worst-performing baselines, showing that the LLMs struggle to generate numerical scores. Different from the results for RQ1, most QPP methods tend to perform better when predicting nDCG@10 than RR@10; this observation indicates that predicting RR@10 is a more challenging task.

7. Analysis

Refer to caption — (a) BM25 on TREC-DL 19

7.1. Judging depth analysis

To answer RQ3, as detailed in Section 4.3.2, for predicting IDCG, we devise an approximation strategy and use the items in the top $n$ ranks of the ranked list $L$ that are predicted as relevant by QPP-GenRE to approximate all the relevant items for a query in the corpus. To investigate the impact of the value of $n$ on the quality of the prediction, we investigate the relationship between the QPP quality of predicting nDCG@10 and the judgment depth to answer the following question: What depth of relevance judgment $n$ do we need to consider to get a satisfactory performance for predicting nDCG@10? In Figure 3, we plot the correlation coefficients between actual nDCG@10 values and nDCG@10 values predicted by QPP-GenRE for different judging depths in {10, 25, 50, 75, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000} on TREC-DL 19 and 20. We also show exact QPP results with depths at 10, 100 and 1000 in Table 4.

Table 4 reveals that, by judging only 10 items in the ranked list of BM25 and ANCE, we can already outperform all the baselines and achieve state-of-the-art QPP quality on half of the evaluation sets we used, e.g., assessing BM25 on TREC-DL 19 and 21, and assessing ANCE on TREC-DL 19. While judging deeper in the ranked list is essential for predicting recall-oriented measures, satisfactory QPP quality is still attainable with a relatively shallow depth. Moreover, Figure 3 illustrates that judging the top 200 items in a ranked list already reaches the saturation point for assessing BM25, i.e., there is no significant improvement by judging a higher number of items, while judging less than 100 top items reaches the saturation point for ANCE. We speculate that this is because ANCE has better retrieval quality than BM25, and more relevant items would appear earlier in the ranked list of ANCE than BM25; therefore, a shallower judging depth suffices to approximate all relevant items in the corpus. This emphasizes the need to consider retrieval quality when determining the optimal judgment depth for various rankers.

Table 5. Relevance judgment agreement (Cohen’s

\kappa

) between TREC assessors and each LLM, and Pearson’s

\rho

correlation coefficients between BM25’ actual nDCG@10 values and those predicted by QPP-GenRE integrated with each LLM on TREC-DL 19–22. The best value in each column is marked in bold.

LLM	TREC-DL 19		TREC-DL 20		TREC-DL 21		TREC-DL 22
LLM	$\kappa$	P- $\rho$	$\kappa$	P- $\rho$	$\kappa$	P- $\rho$	$\kappa$	P- $\rho$
GPT-3.5 (text-davinci-003) (Faggioli et al., 2023a)	-	-	-	-	0.260	-	-	-
LLaMA-7B (few-shot)	-0.001	-0.062	-0.003	0.087	0.003	-0.002	-0.010	0.214
Llama-3-8B (few-shot)	0.018	0.042	0.027	0.087	0.021	0.180	-0.035	0.087
Llama-3-8B-Instruct (few-shot)	0.315	0.510	0.227	0.372	0.238	0.462	0.049	0.388
LLaMA-7B (fine-tuned)	0.258	0.715	0.238	0.627	0.333	0.547	0.038	0.388
Llama-3-8B (fine-tuned)	0.381	0.544	0.342	0.681	0.347	0.612	0.082	0.568
Llama-3-8B-Instruct (fine-tuned)	0.397	0.647	0.316	0.743	0.418	0.699	0.066	0.573

7.2. Impact of fine-tuning and the choice of LLMs

To answer RQ4, we analyze the impact of fine-tuning and the choice of LLMs on the quality of generated relevance judgments and QPP. We consider LLaMA-7B (Touvron et al., 2023a), Llama-3-8B (AI@Meta, 2024), and Llama-3-8B-Instruct (AI@Meta, 2024) under two settings: (i) trained with PEFT on human relevance labels (following the same fine-tuning setup as in Section 5.6), and (ii) few-shot prompted (in-context learning) .⁶⁶6We randomly sample human-labeled demonstration examples from the same set used for fine-tuning LLMs; each example is a triplet (query, passage, relevant/irrelevant); our experiments show that two examples work best, one with relevant passages and one with irrelevant passages. We do not report the results for a zero-shot setting because our preliminary experiments show that prompting these LLMs in a zero-shot way yields pretty poor performance.

To evaluate the performance of judging relevance, we compute Cohen’s $\kappa$ metric to measure the agreement between relevance judgments made by the TREC assessors (i.e., relevance judgments in the qrels) and relevance judgments automatically generated by a fine-tuned or few-shot LLM, on TREC-DL 19–22. Faggioli et al. (2023a) reported the relevance judgment agreement in terms of Cohen’s $\kappa$ between TREC assessors and GPT-3.5 (text-davinci-003) on TREC-DL 21; we also consider their Cohen’s $\kappa$ value for comparison. To evaluate QPP quality, we compute the Pearson’s $\rho$ correlation coefficients between BM25’s actual nDCG@10 values and those predicted by QPP-GenRE using relevance judgments generated by an LLM, on TREC-DL 19–22.⁷⁷7We do not report the Pearson’s $\rho$ correlation for GPT-3.5 (text-davinci-003) because the relevance judgments generated by Faggioli et al. (2023a) are not available to us. The judging depth is set to 1000 in a ranked list. We report the results in Table 5.

We have two observations. First, fine-tuning improves the quality of relevance judgment generation and QPP for all LLMs on all datasets. Specifically, all fine-tuned LLMs exhibit improved relevance judgment agreement with the TREC assessors on TREC-DL 19–22. After fine-tuning, LLaMA-7B and Llama-3-8B achieve “fair” agreement with the TREC assessors on TREC-DL 19, 20 and 21,⁸⁸8Note that unlike the qrels files for TREC-DL 19, 20, and 21 which are fully manually annotated, the qrels file for TREC-DL 22 is constructed by first detecting near-duplicate items and manually judging only one representative item from each near-duplicate cluster for a given query (Craswell et al., 2021); this difference may result in variation in Cohen’s $\kappa$ values of LLMs across TREC-DL 19, 20, 21, and TREC-DL 22. Llama-3-8B-Instruct (fine-tuned) even achieves “moderate” agreement on TREC-DL 21 (a Cohen’s $\kappa$ value of 0.418). All fine-tuned LLMs exhibit a higher Cohen’s $\kappa$ value than the commercial LLM, GPT-3.5 (text-davinci-003). All fine-tuned LLMs surpass their corresponding few-shot counterpart on all datasets in terms of Pearson’s $\rho$ . This reveals that fine-tuning is an effective way to improve the quality of LLMs in generating relevance judgments, which finally translates to better QPP quality.

Second, newly released and instruction-tuned LLMs generally perform better. Llama-3-8B shows improved relevance judgment agreement and QPP quality compared to LLaMA-7B in most cases. Llama-3-8B-Instruct further enhances relevance judgment generation and QPP quality over both Llama-3-8B and LLaMA-7B across most cases. Notably, Llama-3-8B-Instruct (few-shot) even performs better than or equally as well as LLaMA-7B (fine-tuned) on TREC-DL 19 and 22. This finding implies that with a more effective LLM QPP-GenRE has the potential to achieve imrpoved QPP performance.

7.3. Integrating QPP-GenRE with an LLM-based re-ranker

To show QPP-GenRE’s compatibility with other types of relevance prediction methods instead of directly asking an LLM to explicitly generate explicit relevance judgments, we adapt a state-of-the-art pointwise LLM-based re-ranker, RankLLaMA (Ma et al., 2023a), into a relevance judgment generator, and then integrate QPP-GenRE with the adapted RankLLaMA. Specifically, we translate a re-ranking score into a relevance judgment by applying a threshold: an item is deemed as “relevant” if its re-ranking score meets or exceeds a given threshold value. We analyze Pearson’s $\rho$ and Kendall’s $\tau$ correlation coefficients between BM25’s actual nDCG@10 values and those predicted by QPP-GenRE integrated with RankLLaMA w.r.t. different threshold values on TREC-DL 19 and 20. We employ RepLLaMA (7B) from Tevatron.⁹⁹9https://github.com/texttron/tevatron/tree/main/examples/rankllama RankLLaMA’s re-ranking scores for BM25 range from -12.93 to 89.90 for TREC-DL 19 and from -14.38 to 8.82 for TREC-DL 20. Thresholds are set at intervals of 0.5. The judging depth is set to 1,000 in a ranked list.

We report the results in Figure 4. We find that RepLLaMA achieves the highest QPP quality on both datasets when the threshold is 1. At this particular threshold, RepLLaMA achieves high Pearson’s $\rho$ values of 0.789 and 0.788 on TREC-DL 19 and 20, respectively. These values exceed those of fine-tuned LLaMA-7B, which achieves Pearson’s $\rho$ values of 0.715 and 0.627 on TREC-DL 19 and 20, respectively, as well as Llama-3-8B-Instruct, which achieves Pearson’s $\rho$ values of 0.647 and 0.743 on TREC-DL 19 and 20, respectively (see Figure 5).¹⁰¹⁰10Note that the comparison is not fair because (i) LLaMA-7B, Llama-3-8B-Instruct and all other supervised QPP methods used in this paper are trained on the development set of MS MARCO V1, while RankLLaMA (Ma et al., 2023a) was trained on the training set of MS MARCO V1, which is much larger. (ii) We employ the official version of MS MARCO V1, while RankLLaMA (Ma et al., 2023a) uses the Tevatron version of MS MARCO V1, where passages are enriched with document titles; Lassance and Clinchant (2023) reveal that incorporating titles leads to enhanced ranking performance. This means that a state-of-the-art pointwise LLM-based re-ranker can be adapted into an effective relevance judgment generator. The high QPP quality achieved by RankLLaMA demonstrates QPP-GenRE’s compatibility with other types of relevance prediction methods besides directly using LLMs as relevance judgment generators (i.e., asking an LLM to explicitly generate explicit relevance judgments).

However, compared to directly regarding an LLM as a relevance judgment generator, adapting an LLM-based re-ranker into a relevance judgment generator requires tuning an appropriate threshold. As demonstrated, re-ranking scores are not normalized and their ranges vary across datasets. Directly using a re-ranker as a relevance judgment generator can cause issues in real-world scenarios. Extra calibration work for re-ranking scores might be necessary.

7.4. QPP-GenRE’s interpretability

As QPP-GenRE computes QPP based on generated relevance judgments, we analyze QPP errors from the perspective of relevance judgment generation. Figure 5 shows the QPP errors of QPP-GenRE integrated with LLaMA-7B in predicting the performance of BM25 and ANCE in terms of RR@10 on TREC-DL 19 and 20; the error is defined as the distance between the RR@10 values predicted by QPP-GenRE and actual RR@10 values, namely “predicted RR@10 minus actual RR@10.” We find that most RR@10 values predicted by QPP-GenRE tend to be smaller than the actual RR@10 values, indicating that QPP-GenRE performs less effectively in identifying relevant items than irrelevant ones in the top of the ranked list. Table 6 shows the confusion matrices that compare relevance judgments made by TREC assessors (i.e., relevance judgments in qrels) and QPP-GenRE integrated with LLaMA-7B on TREC-DL 19 and 20. We find that QPP-GenRE tends to wrongly predict some relevant items as irrelevant (false negatives), which provides a further interpretation of the QPP errors we found above. Therefore, reducing false negatives in generating relevance judgments is a potential way to improve the QPP quality of QPP-GenRE. We leave this exploration for future work.

Table 6. Confusion matrices comparing relevance judgments made by TREC assessors and QPP-GenRE integrated with LLaMA-7B on TREC-DL 19 and 20.

QPP-GenRE	TREC-DL 19 assessors		TREC-DL 20 assessors
QPP-GenRE	Relevant	Irrelevant	Relevant	Irrelevant
Relevant	0752	0553	0486	0763
Irrelevant	1,749	6,206	1,180	8,957

7.5. Computational cost analysis

Table 7. Inference efficiency of supervised QPP baselines and QPP-GenRE integrated with LLaMA-7B on TREC-DL 19 to predict 1–4 different IR metrics.

n

denotes QPP-GenRE’s judgment depth in a ranked list. Cases with higher latency than QPP-GenRE (

n=10

) are underlined.

QPP Method	Inference latency per query (ms)
QPP Method	1	2	3	4
NQA-QPP	0118.40	0236.80	0355.20	0473.60
BERTQPP	0030.29	0060.58	0090.87	0121.16
qppBERT-PL	0316.80	0316.80	0316.80	316.80
M-QPPF	0289.27	0578.54	0867.81	1157.08
QPP-GenRE ( $n=10$ )	0452.60	0452.60	0452.60	0452.60
QPP-GenRE ( $n=100$ )	1,566.25	1,566.25	1,566.25	1,566.25
QPP-GenRE ( $n=200$ )	2,845.43	2,845.43	2,845.43	2,845.43

Table 7 shows the online QPP latency of QPP-GenRE integrated with LLaMA-7B and other BERT-based supervised QPP baselines, on TREC-DL 19, on a single NVIDIA A100 GPU. We compute the inference latency when queries are processed individually. For QPP-GenRE, we consider judging depths at 10, 100, and 200; QPP-GenRE can use batch acceleration for judging items for the same query because each item in a ranked list for a query is independent of each other.¹¹¹¹11qppBERT-PL first splits a ranked list with 100 items into 25 chunks and then predicts the number of relevant items in each chunk. For a fair comparison, we put 25 chunks into one batch for acceleration. Although QPP-GenRE is more expensive than all baselines when predicting one measure due to the much larger parameter size of LLaMA-7B compared to BERT, QPP-GenRE has lower latency compared to some baselines when predicting multiple IR evaluation measures because multiple measures can be derived from the same set of relevance judgments at no additional cost. E.g., while QPP-GenRE is 56% more expensive than M-QPPF for predicting one measure, it becomes more efficient when predicting 2 or more metrics than M-QPPF. Nevertheless, we acknowledge that QPP-GenRE has higher computational costs than supervised QPP methods when predicting a single measure. Conversely, regression-based QPP baselines (NQA-QPP, BERTQPP and M-QPPF) need to train separate models for different IR evaluation metrics. Although qppBERT-PL is not optimized to learn to output one specific IR evaluation measure, qppBERT-PL does not achieve a promising QPP quality (see Table 2 and 4).

We argue that QPP-GenRE’s latency is still much smaller than some high-performing LLM-based re-rankers. E.g., Sun et al. (2023b) show that a GPT-4-based listwise re-ranker needs 10 API calls (one call takes 3,200ms) to re-rank 100 items for a query, resulting in 32,000ms in total, which is around 20 times worse than QPP-GenRE’s latency with a judging depth of 100. QPP-GenRE can well fit some knowledge-intensive professional search scenarios where QPP quality is prioritized or users may have a higher tolerance level for latency, such as patent search (Lupu et al., 2013), legal search (Tomlinson et al., 2007). Besides using QPP online, QPP can also be used to analyze a search system’s performance in an offline setting (Faggioli et al., 2023d).

8. Conclusions & Future Work

We have proposed a new QPP framework, QPP-GenRE, which models QPP from the perspective of predicting IR evaluation measures based on automatically generated relevance judgments. We have devised an approximation strategy for predicting an IR evaluation measure considering recall, which only judges a limited number of items in a given ranked list for a query, to avoid the cost of traversing the entire corpus to find all relevant items; the approximation strategy also enables us to study into the impact of various judging depths on QPP quality. We have explored using open-source LLMs for generating relevance judgments, to ensure scientific reproducibility. In addition, we have examinged training open-source LLMs with parameter-efficient fine-tuning (PEFT) on human-labeled relevance judgments, to improve the quality of relevance judgment generation and QPP.

Main findings

Experiments on datasets from the TREC-DL 19–22 tracks demonstrate that QPP-GenRE significantly surpasses existing QPP approaches, achieving state-of-the-art QPP quality in assessing lexical and neural rankers for either a precision-oriented IR metric or an IR metric considering recall. Moreover, we have shown that QPP-GenRE has the potential to conduct QPP more accurately when integrated with a more effective LLM, has a good compatibility with other types of relevance prediction methods (e.g., an LLMs-based re-ranker), and exhibits good interpretability.

Broader implications

QPP-GenRE has the potential to facilitate the practical use of QPP. The limited accuracy and interpretability of current QPP methods make them difficult to use in practical applications (Arabzadeh et al., 2024). However, QPP-GenRE demonstrates significantly improved QPP accuracy and better interpretability, enhancing the reliability of QPP results and potentially facilitating the practical use of QPP. Especially, QPP-GenRE has the potential to benefit some knowledge-intensive professional search scenarios, e.g., legal (Tomlinson et al., 2007) or patent search (Lupu et al., 2013). In such scenarios, accurate QPP is prioritized, interpretable QPP results are needed, and users may have a higher tolerance level for latency. QPP-GenRE also has the potential for practical application in commercial search engines: commercial search engines receive many frequent and repeated queries, and QPP-GenRE can improve QPP efficiency by reusing stored relevance judgments for repeated query-item pairs and only generating relevance judgments for new query-item pairs. Moreover, QPP-GenRE can be used to analyze the ranking quality of a search system in an purely offline setting (Faggioli et al., 2023d), where latency is not necessarily an issue.

Limitations and future work

First, we only consider predicting the ranking quality of widely-used lexical and dense retrievers, and have not investigated QPP-GenRE’s bias towards LLMs-based rankers (Ma et al., 2023a). Given that QPP-GenRE is based on LLM-based relevance predictors, it would be particularly interesting to explore QPP-GenRE’s potential biases when it predicts the ranking quality of LLM-based rankers.

Second, QPP-GenRE is a QPP framework that can be integrated with various relevance prediction approaches. We show the success of QPP-GenRE equipped with LLaMA-7B, Llama-3-8B, Llama-3-8B-Instruct as well as a state-of-the-art pointwise LLM-based re-ranker, RankLLaMA (Ma et al., 2023a). Exploring various LLMs to find the optimal one for relevance prediction is beyond the scope of our work. However, in future, we believe it is valuable to investigate QPP-GenRE’s performance integrated with other open-source LLMs (e.g., Mistral (Jiang et al., 2023)) as relevance judgment generators. It is also interesting to adapt pairwise or listwise LLM-based re-rankers into relevance judgment generators and integrate QPP-GenRE with them.

Third, we only show QPP-GenRE’s high effectiveness in predicting two primary metrics (RR@10 and nDCG@10) used at TREC DL 19–22 (Craswell et al., 2022, 2021, 2020, 2019). It is worthwhile to consider other metrics at various cutoffs in future work, e.g., nDCG@20 and MAP@100.

Fourth, while QPP-GenRE exhibits a promising QPP quality and can be used in scenarios where QPP quality is prioritized and users have a higher tolerance level for latency, e.g., patent search or post analysis, it is worth improving QPP-GenRE’s efficiency in future to widen its scope of applications. We plan to investigate (i) the use of multiple GPUs because judging each item in a ranked list is independent of each other, (ii) distilling knowledge from LLMs to smaller language models (Gu et al., 2023), and (iii) compressing LLMs by using lower-bit (e.g., 2-bit) quantization (Chee et al., 2023) or using low-rank factorization (Xu et al., 2023).

Acknowledgements.

This research was partially supported by the China Scholarship Council (CSC) under grant number 202106220041, the Hybrid Intelligence Center, a 10-year program funded by the Dutch Ministry of Education, Culture and Science through the Netherlands Organisation for Scientific Research, https://hybrid-intelligence-centre.nl, project LESSEN with project number NWA.1389.20.183 of the research program NWA ORC 2020/21, which is (partly) financed by the Dutch Research Council (NWO), project ROBUST with project number KICH3.LTP.20.006, which is (partly) financed by the Dutch Research Council (NWO), DPG Media, RTL, and the Dutch Ministry of Economic Affairs and Climate Policy (EZK) under the program LTP KIC 2020-2023, and the FINDHR (Fairness and Intersectional Non-Discrimination in Human Recommendation) project that received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070212. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

References

(1)
Abbasiantaeb et al. (2024) Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi. 2024. Can We Use Large Language Models to Fill Relevance Judgment Holes? arXiv preprint arXiv:2405.05600 (2024).
Abbasiantaeb et al. (2023) Zahra Abbasiantaeb, Chuan Meng, David Rau, Antonis Krasakis, Hossein A. Rahmani, and Mohammad Aliannejadi. 2023. LLM-based Retrieval and Generation Pipelines for TREC Interactive Knowledge Assistance Track (iKAT) 2023. In TREC.
AI@Meta (2024) AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
Arabzadeh et al. (2021a) Negar Arabzadeh, Amin Bigdeli, Morteza Zihayat, and Ebrahim Bagheri. 2021a. Query Performance Prediction Through Retrieval Coherency. In ECIR. Springer, 193–200.
Arabzadeh et al. (2023) Negar Arabzadeh, Radin Hamidi Rad, Maryam Khodabakhsh, and Ebrahim Bagheri. 2023. Noisy Perturbations for Estimating Query Difficulty in Dense Retrievers. In CIKM. 3722–3727.
Arabzadeh et al. (2021b) Negar Arabzadeh, Maryam Khodabakhsh, and Ebrahim Bagheri. 2021b. BERT-QPP: Contextualized Pre-trained Transformers for Query Performance Prediction. In CIKM. 2857–2861.
Arabzadeh et al. (2024) Negar Arabzadeh, Chuan Meng, Mohammad Aliannejadi, and Ebrahim Bagheri. 2024. Query Performance Prediction: From Fundamentals to Advanced Techniques. In ECIR. Springer, 381–388.
Askari et al. (2023) Arian Askari, Mohammad Aliannejadi, Chuan Meng, Evangelos Kanoulas, and Suzan Verberne. 2023. Expand, Highlight, Generate: RL-driven Document Generation for Passage Reranking. In EMNLP. 10087–10099.
Aslam and Pavlu (2007) Javed A Aslam and Virgil Pavlu. 2007. Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions. In ECIR. Springer, 198–209.
Bommasani et al. (2023) Rishi Bommasani, Percy Liang, and Tony Lee. 2023. Holistic Evaluation of Language Models. Annals of the New York Academy of Sciences (2023).
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners. In NeurIPS. 1877–1901.
Carmel and Yom-Tov (2010) David Carmel and Elad Yom-Tov. 2010. Estimating the Query Difficulty for Information Retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services 2, 1 (2010), 1–89.
Chee et al. (2023) Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. 2023. QuIP: 2-Bit Quantization of Large Language Models With Guarantees. arXiv preprint arXiv:2307.13304 (2023).
Chen et al. (2022) Xiaoyang Chen, Ben He, and Le Sun. 2022. Groupwise Query Performance Prediction with Bert. In ECIR. Springer, 64–74.
Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling Instruction-finetuned Language Models. arXiv preprint arXiv:2210.11416 (2022).
Craswell et al. (2020) Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2020 Deep Learning Track. In TREC.
Craswell et al. (2019) Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2019. Overview of the TREC 2019 Deep Learning Track. In TREC.
Craswell et al. (2021) Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Fernando Campos, and Jimmy Lin. 2021. Overview of the TREC 2021 Deep Learning Track. In TREC.
Craswell et al. (2022) Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Fernando Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning Track. In TREC.
Cronen-Townsend et al. (2002) Steve Cronen-Townsend, Yun Zhou, and W. Bruce Croft. 2002. Predicting Query Performance. In SIGIR. 299–306.
Cummins et al. (2011) Ronan Cummins, Joemon Jose, and Colm O’Riordan. 2011. Improved Query Performance Prediction Using Standard Deviation. In SIGIR. 1089–1090.
Datta et al. (2022a) Suchana Datta, Debasis Ganguly, Derek Greene, and Mandar Mitra. 2022a. Deep-QPP: A Pairwise Interaction-based Deep Learning Model for Supervised Query Performance Prediction. In WSDM. 201–209.
Datta et al. (2022b) Suchana Datta, Debasis Ganguly, Mandar Mitra, and Derek Greene. 2022b. A Relative Information Gain-based Query Performance Prediction Framework with Generated Query Variants. TOIS (2022).
Datta et al. (2022c) Suchana Datta, Sean MacAvaney, Debasis Ganguly, and Derek Greene. 2022c. A ‘Pointwise-Query, Listwise-Document’ based Query Performance Prediction Approach. In SIGIR. 2148–2153.
Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314 (2023).
Deveaud et al. (2016) Romain Deveaud, Josiane Mothe, and Jian-Yun Nie. 2016. Learning to Rank System Configurations. In CIKM. 2001–2004.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171–4186.
Di Nunzio and Faggioli (2021) Giorgio Maria Di Nunzio and Guglielmo Faggioli. 2021. A Study of a Gain Based Approach for Query Aspects in Recall Oriented Tasks. Applied Sciences 11, 19 (2021), 9075.
Diaz (2007) Fernando Diaz. 2007. Performance Prediction Using Spatial Autocorrelation. In SIGIR. 583–590.
Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, **g**g Xu, and Zhifang Sui. 2022. A Survey for In-Context Learning. arXiv preprint arXiv:2301.00234 (2022).
Drozdov et al. (2023) Andrew Drozdov, Honglei Zhuang, Zhuyun Dai, Zhen Qin, Razieh Rahimi, Xuanhui Wang, Dana Alon, Mohit Iyyer, Andrew McCallum, Donald Metzler, et al. 2023. PaRaDe: Passage Ranking using Demonstrations with LLMs. In Findings of EMNLP. 14242–14252.
Faggioli et al. (2023a) Guglielmo Faggioli, Laura Dietz, Charles LA Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, et al. 2023a. Perspectives on Large Language Models for Relevance Judgment. In ICTIR. 39–50.
Faggioli et al. (2021a) Guglielmo Faggioli, Marco Ferrante, Nicola Ferro, Raffaele Perego, and Nicola Tonellotto. 2021a. Hierarchical Dependence-aware Evaluation Measures for Conversational Search. In SIGIR. 1935–1939.
Faggioli et al. (2023b) Guglielmo Faggioli, Nicola Ferro, Josiane Mothe, Fiana Raiber, and Maik Fröbe. 2023b. Report on the 1st Workshop on Query Performance Prediction and Its Evaluation in New Tasks (QPP++ 2023) at ECIR 2023. In ACM SIGIR Forum, Vol. 57. 1–7.
Faggioli et al. (2023c) Guglielmo Faggioli, Nicola Ferro, Cristina Muntean, Raffaele Perego, and Nicola Tonellotto. 2023c. A Spatial Approach to Predict Performance of Conversational Search Systems. In IIR. 41–46.
Faggioli et al. (2023d) Guglielmo Faggioli, Nicola Ferro, Cristina Ioana Muntean, Raffaele Perego, and Nicola Tonellotto. 2023d. A Geometric Framework for Query Performance Prediction in Conversational Search. In SIGIR. 1355–1365.
Faggioli et al. (2023e) Guglielmo Faggioli, Thibault Formal, Simon Lupart, Stefano Marchesin, Stephane Clinchant, Nicola Ferro, and Benjamin Piwowarski. 2023e. Towards Query Performance Prediction for Neural Information Retrieval: Challenges and Opportunities. In ICTIR. 51–63.
Faggioli et al. (2023f) Guglielmo Faggioli, Thibault Formal, Stefano Marchesin, Stéphane Clinchant, Nicola Ferro, and Benjamin Piwowarski. 2023f. Query Performance Prediction for Neural IR: Are We There Yet?. In ECIR. Springer, 232–248.
Faggioli et al. (2021b) Guglielmo Faggioli, Oleg Zendel, J. Shane Culpepper, Nicola Ferro, and Falk Scholer. 2021b. An Enhanced Evaluation Framework for Query Performance Prediction. In ECIR. Springer, 115–129.
Faggioli et al. (2022) Guglielmo Faggioli, Oleg Zendel, J. Shane Culpepper, Nicola Ferro, and Falk Scholer. 2022. sMARE: A New Paradigm to Evaluate and Understand Query Performance Prediction Methods. Information Retrieval Journal 25, 2 (2022), 94–122.
Fröbe et al. (2023) Maik Fröbe, Lukas Gienapp, Martin Potthast, and Matthias Hagen. 2023. Bootstrapped nDCG Estimation in the Presence of Unjudged Documents. In ECIR. Springer, 313–329.
Ganguly et al. (2022) Debasis Ganguly, Suchana Datta, Mandar Mitra, and Derek Greene. 2022. An Analysis of Variations in the Effectiveness of Query Performance Prediction. In ECIR. Springer, 215–229.
Ganguly and Yilmaz (2023) Debasis Ganguly and Emine Yilmaz. 2023. Query-specific Variable Depth Pooling via Query Performance Prediction. In SIGIR. 2303–2307.
Gema et al. (2023) Aryo Gema, Luke Daines, Pasquale Minervini, and Beatrice Alex. 2023. Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain. arXiv preprint arXiv:2307.03042 (2023).
Gilardi et al. (2023) Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. arXiv preprint arXiv:2303.15056 (2023).
Gu et al. (2023) Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023. Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2306.08543 (2023).
Gupta et al. (2019) Soumyajit Gupta, Mucahid Kutlu, Vivek Khetan, and Matthew Lease. 2019. Correlation, Prediction and Ranking of Evaluation Metrics in Information Retrieval. In ECIR. Springer, 636–651.
Hashemi et al. (2019) Helia Hashemi, Hamed Zamani, and W. Bruce Croft. 2019. Performance Prediction for Non-factoid Question Answering. In ICTIR. 55–58.
Hauff et al. (2008) Claudia Hauff, Djoerd Hiemstra, and Franciska de Jong. 2008. A Survey of Pre-retrieval Query Performance Predictors. In CIKM. 1419–1420.
Hou et al. (2023) Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2023. Large Language Models are Zero-Shot Rankers for Recommender Systems. arXiv preprint arXiv:2305.08845 (2023).
Hu et al. (2021) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR.
Järvelin and Kekäläinen (2002) Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. TOIS 20, 4 (2002), 422–446.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
Jones et al. (2015) Timothy Jones, Paul Thomas, Falk Scholer, and Mark Sanderson. 2015. Features of Disagreement Between Retrieval Effectiveness Measures. In SIGIR. 847–850.
Khodabakhsh and Bagheri (2023) Maryam Khodabakhsh and Ebrahim Bagheri. 2023. Learning to Rank and Predict: Multi-task Learning for Ad Hoc Retrieval and Query Performance Prediction. Information Sciences 639 (2023), 119015.
Khramtsova et al. (2024) Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, and Guido Zuccon. 2024. Leveraging LLMs for Unsupervised Dense Retriever Ranking. arXiv preprint arXiv:2402.04853 (2024).
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
Lafferty and Zhai (2001) John Lafferty and Chengxiang Zhai. 2001. Document Language Models, Query Models, and Risk Minimization for Information Retrieval. In SIGIR. 111–119.
Lassance and Clinchant (2023) Carlos Lassance and Stéphane Clinchant. 2023. The Tale of Two MSMARCO - and Their Unfair Comparisons. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2431–2435.
Lavrenko and Croft (2001) Victor Lavrenko and W. Bruce Croft. 2001. Relevance-Based Language Models. In SIGIR. 120–127.
Lin et al. (2021) Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In SIGIR. 2356–2362.
Liu et al. (2022) Haokun Liu, Derek Tam, Muqeeth Mohammed, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. 2022. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. In NeurIPS.
Liu and Low (2023) Tiedong Liu and Bryan Kian Hsiang Low. 2023. Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks. arXiv preprint arXiv:2305.14201 (2023).
Lu et al. (2016) Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The Effect of Pooling and Evaluation Depth on IR Metrics. Information Retrieval Journal 19, 4 (2016), 416–445.
Lu et al. (2023) Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, and Yelong Shen. 2023. An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models. arXiv preprint arXiv:2309.09958 (2023).
Lupu et al. (2013) Mihai Lupu, Allan Hanbury, et al. 2013. Patent Retrieval. Foundations and Trends® in Information Retrieval 7, 1 (2013), 1–97.
Ma et al. (2024) Shengjie Ma, Chong Chen, Qi Chu, and Jiaxin Mao. 2024. Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval. arXiv preprint arXiv:2403.18405 (2024).
Ma et al. (2023a) Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2023a. Fine-Tuning LLaMA for Multi-Stage Text Retrieval. arXiv preprint arXiv:2310.08319 (2023).
Ma et al. (2023b) Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. 2023b. Zero-Shot Listwise Document Reranking with a Large Language Model. arXiv preprint arXiv:2305.02156 (2023).
Ma et al. (2021) Yixiao Ma, Yunqiu Shao, Yueyue Wu, Yiqun Liu, Ruizhe Zhang, Min Zhang, and Shao** Ma. 2021. LeCaRD: A Legal Case Retrieval Dataset for Chinese Law System. In SIGIR. 2342–2348.
MacAvaney and Soldaini (2023) Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In SIGIR. 2230–2235.
Makary et al. (2017) Mireille Makary, Michael Oakes, Ruslan Mitkov, and Fadi Yammout. 2017. Using Supervised Machine Learning to Automatically Build Relevance Judgments for a Test Collection. In 2017 28th International Workshop on Database and Expert Systems Applications (DEXA). IEEE, 108–112.
Makary et al. (2016) Mireille Makary, Michael Oakes, and Fadi Yamout. 2016. Towards Automatic Generation of Relevance Judgments for a Test Collection. In ICDIM. IEEE, 121–126.
Meng (2024) Chuan Meng. 2024. Query Performance Prediction for Conversational Search and Beyond. In SIGIR.
Meng et al. (2023a) Chuan Meng, Mohammad Aliannejadi, and Maarten de Rijke. 2023a. Performance Prediction for Conversational Search Using Perplexities of Query Rewrites. In QPP++2023. 25–28.
Meng et al. (2023b) Chuan Meng, Mohammad Aliannejadi, and Maarten de Rijke. 2023b. System Initiative Prediction for Multi-turn Conversational Information Seeking. In CIKM. 1807–1817.
Meng et al. (2023c) Chuan Meng, Negar Arabzadeh, Mohammad Aliannejadi, and Maarten de Rijke. 2023c. Query Performance Prediction: From Ad-hoc to Conversational Search. In SIGIR. 2583–2593.
Meng et al. (2024) Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. 2024. Ranked List Truncation for Large Language Model-based Re-Ranking. In SIGIR.
Mizzaro et al. (2018) Stefano Mizzaro, Josiane Mothe, Kevin Roitero, and Md Zia Ullah. 2018. Query Performance Prediction and Effectiveness Evaluation Without Relevance Judgments: Two Sides of the Same Coin. In SIGIR. 1233–1236.
Moffat (2017) Alistair Moffat. 2017. Computing Maximized Effectiveness Distance for Recall-based Metrics. TKDE 30, 1 (2017), 198–203.
Nuray and Can (2003) Rabia Nuray and Fazli Can. 2003. Automatic Ranking of Retrieval Systems in Imperfect Environments. In SIGIR. 379–380.
Nuray and Can (2006) Rabia Nuray and Fazli Can. 2006. Automatic Ranking of Information Retrieval Systems using Data Fusion. IPM 42, 3 (2006), 595–614.
Pérez-Iglesias and Araujo (2010) Joaquín Pérez-Iglesias and Lourdes Araujo. 2010. Standard Deviation as a Query Hardness Estimator. In SPIRE. Springer, 207–212.
Poesina et al. (2023) Eduard Poesina, Radu Tudor Ionescu, and Josiane Mothe. 2023. IQPP: A Benchmark for Image Query Performance Prediction. In SIGIR. 2953–2963.
Pradeep et al. (2021) Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2021. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. arXiv preprint arXiv:2101.05667 (2021).
Pradeep et al. (2023a) Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023a. RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models. arXiv preprint arXiv:2309.15088 (2023).
Pradeep et al. (2023b) Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023b. RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! arXiv preprint arXiv:2312.02724 (2023).
Qin et al. (2023) Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, et al. 2023. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. arXiv preprint arXiv:2306.17563 (2023).
Rahmani et al. (2024) Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos. 2024. Synthetic Test Collections for Retrieval Evaluation. arXiv preprint arXiv:2405.07767 (2024).
Ravana et al. (2015) Sri Devi Ravana, Prabha Rajagopal, and Vimala Balakrishnan. 2015. Ranking Retrieval Systems using Pseudo Relevance Judgments. Aslib Journal of Information Management 67, 6 (2015), 700–714.
Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
Roitman (2017) Haggai Roitman. 2017. An Enhanced Approach to Query Performance Prediction Using Reference Lists. In SIGIR. 869–872.
Sachan et al. (2022) Devendra Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen-tau Yih, Joelle Pineau, and Luke Zettlemoyer. 2022. Improving Passage Retrieval with Zero-Shot Question Generation. In EMNLP. 3781–3797.
Salemi and Zamani (2024) Alireza Salemi and Hamed Zamani. 2024. Evaluating Retrieval Quality in Retrieval-Augmented Generation. arXiv preprint arXiv:2404.13781 (2024).
Samadi and Rafiei (2023) Mohammadreza Samadi and Davood Rafiei. 2023. Performance Prediction for Multi-hop Questions. arXiv preprint arXiv:2308.06431 (2023).
Santilli and Rodolà (2023) Andrea Santilli and Emanuele Rodolà. 2023. Camoscio: An Italian Instruction-tuned Llama. arXiv preprint arXiv:2307.16456 (2023).
Scells et al. (2018) Harrisen Scells, Leif Azzopardi, Guido Zuccon, and Bevan Koopman. 2018. Query Variation Performance Prediction for Systematic Reviews. In SIGIR. 1089–1092.
Shtok et al. (2010) Anna Shtok, Oren Kurland, and David Carmel. 2010. Using Statistical Decision Theory and Relevance Models for Query-performance Prediction. In SIGIR. 259–266.
Shtok et al. (2012) Anna Shtok, Oren Kurland, David Carmel, Fiana Raiber, and Gad Markovits. 2012. Predicting Query Performance by Query-Drift Estimation. TOIS 30, 2 (2012), 1–35.
Singh et al. (2023) Ashutosh Singh, Debasis Ganguly, Suchana Datta, and Craig McDonald. 2023. Unsupervised Query Performance Prediction for Neural Models utilising Pairwise Rank Preferences. In SIGIR. 2486–2490.
Soboroff et al. (2001) Ian Soboroff, Charles Nicholas, and Patrick Cahan. 2001. Ranking Retrieval Systems without Relevance Judgments. In SIGIR. 66–73.
Sun et al. (2023a) Jiuding Sun, Chantal Shaib, and Byron C. Wallace. 2023a. Evaluating the Zero-shot Robustness of Instruction-tuned Language Models. arXiv preprint arXiv:2306.11270 (2023).
Sun et al. (2021) Weiwei Sun, Chuan Meng, Qi Meng, Zhaochun Ren, Pengjie Ren, Zhumin Chen, and Maarten de Rijke. 2021. Conversations Powered by Cross-Lingual Knowledge. In SIGIR. 1442–1451.
Sun et al. (2023b) Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023b. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. In EMNLP. 14918–14937.
Tang et al. (2023) Raphael Tang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, and Ferhan Ture. 2023. Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. arXiv preprint arXiv:2310.07712 (2023).
Tao and Wu (2014) Yongquan Tao and Shengli Wu. 2014. Query Performance Prediction by Considering Score Magnitude and Variance Together. In CIKM. 1891–1894.
Tay et al. (2022) Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, et al. 2022. Ul2: Unifying language learning paradigms. In ICLR.
Thomas et al. (2017) Paul Thomas, Falk Scholer, Peter Bailey, and Alistair Moffat. 2017. Tasks, Queries, and Rankers in Pre-Retrieval Performance Prediction. In Proceedings of the 22nd Australasian Document Computing Symposium. 1–4.
Thomas et al. (2023) Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2023. Large Language Models Can Accurately Predict Searcher Preferences. arXiv preprint arXiv:2309.10621 (2023).
Tomlinson et al. (2007) Stephen Tomlinson, Douglas W Oard, Jason R Baron, and Paul Thompson. 2007. Overview of the TREC 2007 Legal Track.. In TREC.
Tonellotto et al. (2013) Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. 2013. Efficient and Effective Retrieval using Selective Pruning. In WSDM. 63–72.
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971 (2023).
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288 (2023).
Upadhyay et al. (2024a) Shivani Upadhyay, Ehsan Kamalloo, and Jimmy Lin. 2024a. LLMs Can Patch Up Missing Relevance Judgments in Evaluation. arXiv preprint arXiv:2405.04727 (2024).
Upadhyay et al. (2024b) Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, and Jimmy Lin. 2024b. UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor. arXiv:2406.06519 [cs.IR]
Vlachou and Macdonald (2023) Maria Vlachou and Craig Macdonald. 2023. On Coherence-based Predictors for Dense Query Performance Prediction. arXiv preprint arXiv:2310.11405 (2023).
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, Denny Zhou, et al. 2022. Chain-of-thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 35 (2022), 24824–24837.
Xiong et al. (2021) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In ICLR.
Xu et al. (2023) Mingxue Xu, Yao Lei Xu, and Danilo P Mandic. 2023. TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition. arXiv preprint arXiv:2307.00526 (2023).
Yan et al. (2024) Le Yan, Zhen Qin, Honglei Zhuang, Rolf Jagerman, Xuanhui Wang, Michael Bendersky, and Harrie Oosterhuis. 2024. Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing. arXiv preprint arXiv:2404.11791 (2024).
Zamani et al. (2018) Hamed Zamani, W. Bruce Croft, and J. Shane Culpepper. 2018. Neural Query Performance Prediction Using Weak Supervision from Multiple Signals. In SIGIR. 105–114.
Zendel et al. (2023) Oleg Zendel, Binsheng Liu, J. Shane Culpepper, and Falk Scholer. 2023. Entropy-Based Query Performance Prediction for Neural Information Retrieval Systems. In QPP++2023. 37–44.
Zhang et al. (2024) Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024. Are Large Language Models Good at Utility Judgments? arXiv preprint arXiv:2403.19216 (2024).
Zhang et al. (2023b) Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. 2023b. Instruction Tuning for Large Language Models: A Survey. arXiv preprint arXiv:2308.10792 (2023).
Zhang et al. (2023c) Xinyu Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, and Jimmy Lin. 2023c. Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models. arXiv preprint arXiv:2312.02969 (2023).
Zhang et al. (2023a) Yue Zhang, Leyang Cui, Deng Cai, Xinting Huang, Tao Fang, and Wei Bi. 2023a. Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance. arXiv preprint arXiv:2305.13225 (2023).
Zheng et al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 36 (2024).
Zhou and Croft (2006) Yun Zhou and W. Bruce Croft. 2006. Ranking Robustness: A Novel Framework to Predict Query Performance. In CIKM. 567–574.
Zhou and Croft (2007) Yun Zhou and W. Bruce Croft. 2007. Query Performance Prediction in Web Search Environments. In SIGIR. 543–550.
Zhu et al. (2023) Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong Wen. 2023. Large Language Models for Information Retrieval: A Survey. arXiv preprint arXiv:2308.07107 (2023).
Zhuang et al. (2023b) Honglei Zhuang, Zhen Qin, Kai Hui, Junru Wu, Le Yan, Xuanhui Wang, and Michael Berdersky. 2023b. Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels. arXiv preprint arXiv:2310.14122 (2023).
Zhuang et al. (2023a) Shengyao Zhuang, Bing Liu, Bevan Koopman, and Guido Zuccon. 2023a. Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking. arXiv preprint arXiv:2310.13243 (2023).
Zhuang et al. (2023c) Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. 2023c. A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models. arXiv preprint arXiv:2310.09497 (2023).