ConvSDG: Session Data Generation for Conversational Search

Fengran Mo 0000-0002-0838-6994 RALI, Université de MontréalMontréalQuébecCanada [email protected] , Bole Yi 0009-0005-0898-1119 RALI, Université de MontréalMontréalQuébecCanada [email protected] , Kelong Mao 0000-0002-5648-568X Renmin University of ChinaBei**gChina [email protected] , Chen Qu 0000-0002-3273-7109 University of Massachusetts Amherst Amherst MA USA [email protected] , Kaiyu Huang 0000-0001-6779-1810 Bei**g Jiaotong UniversityBei**gChina [email protected] and Jian-Yun Nie 0000-0003-1556-3335 RALI, Université de MontréalMontréalQuébecCanada [email protected]

(2024)

Abstract.

Conversational search provides a more convenient interface for users to search by allowing multi-turn interaction with the search engine. However, the effectiveness of the conversational dense retrieval methods is limited by the scarcity of training data required for their fine-tuning. Thus, generating more training conversational sessions with relevant labels could potentially improve search performance. Based on the promising capabilities of large language models (LLMs) on text generation, we propose ConvSDG, a simple yet effective framework to explore the feasibility of boosting conversational search by using LLM for session data generation. Within this framework, we design dialogue/session-level and query-level data generation with unsupervised and semi-supervised learning, according to the availability of relevance judgments. The generated data are used to fine-tune the conversational dense retriever. Extensive experiments on four widely used datasets demonstrate the effectiveness and broad applicability of our ConvSDG framework compared with several strong baselines.

Conversational Search, Session Data Generation, Large Language Model

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Companion Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singapore^†^†booktitle: Companion Proceedings of the ACM Web Conference 2024 (WWW ’24 Companion), May 13–17, 2024, Singapore, Singapore^†^†doi: 10.1145/3589335.3651940^†^†isbn: 979-8-4007-0172-6/24/05^†^†ccs: Computing methodologies Discourse, dialogue and pragmatics^†^†ccs: Information systems Information retrieval

1. Introduction

Conversational search is an emerging area within information retrieval that is poised to become the future of search engines (Gao et al., 2022b). The systems empower users to engage in interactive, multi-turn conversations when searching for information, simplifying the process of addressing intricate information needs. The central hurdle lies in accurately understanding users’ genuine search intent, given that their queries are context-dependent and prone to linguistic issues like omission, coreference, and ambiguity (Dalton et al., 2021).

When tackling conversational search, an intuitive approach is to use conversational query rewriting (CQR). This method involves breaking down the task by employing rewrite models to convert the query of the current turn into a de-contextualized one and then conducting an ad-hoc search using the rewritten query. This approach allows for the use of existing retrievers for the search process. However, it is challenging to directly optimize the rewriting towards search (Lin et al., 2021a; Wu et al., 2022; Mo et al., 2023a; Mao et al., 2023b). Another approach, known as conversational dense retrieval (CDR), focuses on training a conversational dense retriever to grasp the search intent by implicitly learning the latent representations of encoded queries and passages. Unlike CQR, the conversational dense retriever has the ability to naturally learn from the relevance signals between queries and passages.

Nonetheless, the current CQR and CDR techniques, which are trained on limited data, struggle to deliver satisfactory performance due to the prevalence of the long-tail phenomenon (Mao et al., 2022a; Mo et al., 2023b) within conversational sessions and the scarcity of available samples in existing datasets (Dalton et al., 2020, 2021, 2022). Creating conversational search datasets manually is a costly and difficult endeavor, so an intuitive approach is to automatically enrich the session data for model training. While some prior studies (Mao et al., 2022b; Dai et al., 2022) have demonstrated its feasibility, the additional session data generated in this manner often lack the necessary power for continuous model improvement, and the generation process remains complex. Furthermore, these approaches still rely on the assumption that there is a substantial amount of relevant in-domain data available to build data augmentation models. The recent success of large language models (LLMs) (Brown et al., 2020; Wei et al., 2022; Thoppilan et al., 2022; Zhang et al., 2023), which excel in generating texts, has brought notable advancements to the field of information retrieval (Mao et al., 2020). These LLMs are now being applied to support various techniques within the field, such as query expansion (Wang et al., 2023), query generation (Bonifacio et al., 2022; Dai et al., 2023), and document prediction (Gao et al., 2022a; Mackie et al., 2023). Motivated by these developments, our research explores the potential of harnessing LLMs for the automatic generation of session data, adapted appropriately to enhance conversational search performance. In essence, we aim to address the following research questions related to the utilization of LLMs:

RQ1: Is it feasible to exploit automatic session data generation for conversational search models to achieve comparable or better performance than relying on manually constructed datasets?

RQ2: Can we improve conversational search performance by augmenting the diversity of session queries via query rewriting and existing annotations?

In order to address these inquiries, we introduce ConvSDG, a data augmentation framework aimed at employing Large Language Models (LLMs) to accomplish Session Data Generation for Conversational search. Our framework leverages the robust text generation capabilities of LLMs to automatically generate session data that can adapt effectively to the demands of conversational search scenarios. With well-defined supervision signals, this approach mitigates the problem of limited data availability and enhances the performance of conversational dense retrieval. Specifically, we designed two different prompts in our framework, each tailored to specific scenarios, enabling LLMs to generate dialogue-level session data and query-level augmented session data. Subsequently, we create or assign appropriate supervision signals for each query turn, catering to both unsupervised and semi-supervised learning settings. Finally, the generated session data, along with the annotations, are used to fine-tune the conversational dense retriever. We carry out comprehensive experiments using four widely used conversational search datasets and compare ConvSDG against several strong baselines, demonstrating its superior performance.

Our contributions are summarized as follows:

(1) We propose a simple yet effective framework to automatically generate session data for conversational search, showing the feasibility of using automatic data to train the models.

(2) We develop two approaches for instructing the LLM to produce session data, at both dialogue and query levels. Additionally, we generate the necessary supervision signals to facilitate conversational dense retrieval with different learning manners.

(3) We demonstrate the effectiveness of ConvSDG by achieving better results compared to models trained on manually curated data across four datasets and under two distinct settings. The analysis offers additional insights into the potential of automatic data generation to enhance conversational search.

2. Related Work

2.1. Conversational Search

Conversational query rewriting (CQR) and conversational dense retrieval (CDR) represent the two primary approaches to conversational search. CQR focuses on transforming context-dependent queries within a search session into stand-alone queries. Common methods involve selecting relevant tokens from the search session (Voskarides et al., 2020; Lin et al., 2021b; Qian and Dou, 2022) and training a generative rewriter model using human-rewritten queries paired with their respective sessions (Lin et al., 2020; Yu et al., 2020; Vakulenko et al., 2021; Mao et al., 2023a). Some research efforts incorporate reinforcement learning (Wu et al., 2022; Chen et al., 2022) or ranking signals (Mo et al., 2023a; Mao et al., 2023b) to align the generation process with the downstream search task. In contrast, CDR utilizes conversational search session data to perform end-to-end dense retrieval. To enhance conversational search performance, advanced techniques like context denoising (Qu et al., 2020; Yu et al., 2021; Krasakis et al., 2022; Mao et al., 2022a, 2023c; Mo et al., 2023b), data augmentation (Lin et al., 2021a; Dai et al., 2022; Mao et al., 2022b), and the mining of challenging negative examples (Kim and Kim, 2022; Mo et al., 2024) have been explored.

2.2. Large Language Models for Data Generation

The quantity and quality of data hold significant value across various research domains within natural language processing (NLP) and information retrieval (IR). The emergence of pre-trained language models (Devlin et al., 2019; Radford et al., 2019), and more recently, large language models (Brown et al., 2020; Wei et al., 2022; Thoppilan et al., 2022), has opened up new opportunities for automatic text generation. For example, some studies utilize these language models to generate data for a wide array of NLP tasks, including text classification (Wang et al., 2021), acquiring commonsense knowledge (West et al., 2022), natural language inference (Liu et al., 2022), open-domain dialogue generation (Zheng et al., 2023), and sequence labeling (Ding et al., 2023). The enhanced performance observed in downstream tasks through these approaches validates the effectiveness of employing language models for data generation.

2.3. LLM-based Data Generation for IR

There are also several other approaches using large language models (LLMs) to generate data for ad-hoc IR: to generate queries from a document (Bonifacio et al., 2022; Dai et al., 2023), to generate a document from a query (Gao et al., 2022a; Mackie et al., 2023; Wang et al., 2023), etc. Different from them, our work concentrates on investigating how LLMs can be harnessed to create conversational search data, an underexplored area in the existing literature. A recent work – CONVERSER (Huang et al., 2023) focuses on few-shot query generation with in-context learning. It relies on two-stage generation and the needed generated samples are quite large, while we only require one-stage generation with much less generated samples (higher efficiency). Besides, this work is narrower in its scope compared to ours, and its effectiveness, evaluated on the CAsT-19 dataset only, is much lower than our method and other state-of-the-art baselines.

3. Methodology

3.1. Task Definition

Conversational search tries to identify relevant passages $p^{*}$ from a vast collection $\mathcal{C}$ in response to the current query ( $n$ -th) $q_{n}$ . This process is based on the context provided by the ongoing conversation session $\mathcal{S}=\{q_{i}\}_{i=1}^{n-1}$ . Each query turn within a session depends on the preceding context, necessitating the conversational retriever to possess the capability to comprehend the user’s search intent. Thus, having access to high-quality and adequate conversational search session data is important. The goal of this paper is to generate new session data $\mathcal{S^{\prime}}=\{q_{i}^{\prime}\}_{i=1}^{n}$ based on LLMs and produce a series of query-passage pairs $\{(q_{i}^{\prime},p_{i}^{\prime})\}_{i=1}^{n}$ for fine-tuning conversational dense retrieval.

Refer to caption — Figure 1. Overview of ConvSDG. Three parts are included: (1) Two prompts for session data generation at different levels, (2) Produce supervision signals for generated data, PRF for session generation and existing annotations for query augmented, (3) Conduct conversational dense retrieval fine-tuning with the generated data.

3.2. Overview

The construction of existing conversational search datasets (Dalton et al., 2020, 2021, 2022) heavily relies on human effort, resulting in insufficient data samples to adequately support fine-tuning of end-to-end conversational dense retrievers. Drawing inspiration from recent successes in harnessing Large Language Models (LLMs) for data generation in various downstream tasks, we introduce the ConvSDG framework. This framework explores the potent generative capabilities of LLMs to create high-quality conversational session data for conversational search, regardless of whether relevant judgments are available or not. In the case where relevant judgments are unavailable, we employ LLMs to efficiently generate the entire conversational sessions at the dialogue-level, using only the topic description. We then apply pseudo-relevance feedback as supervision signals. In contrast, when relevant judgments exist in the dataset, we perform query-level augmentation by rephrasing the query formulation for each turn, recognizing that queries with the same search intent can be expressed in multiple ways.

The overview of our ConvSDG is depicted in Fig. 1. It consists of three main steps: (1) Guiding the LLM to generate session data at two different levels, (2) Generating supervision signals for each generated query turn, and (3) Conducting fine-tuning of conversational dense retriever based on the generated data.

3.3. Dialogue-level Session Generation

A conversational session typically centers around a particular topic (Adlakha et al., 2022), like “animals”, and each turn explores different aspects of that topic, such as “habits” or “various breeds”. To mimic the real-world scenario, it is essential to consider specific conversational phenomena, such as ensuring coherent transitions between turns, handling co-references, and addressing instances of omission, when constructing these conversational sessions (Zamani et al., 2023). Based on our initial experiments and insights gleaned from existing research (Kim et al., 2022; Zheng et al., 2023), we have found that generating a conversation session one turn at a time using generative models does not yield high-quality results. Additionally, we have noticed that maintaining turn-to-turn coherence for Large Language Models (LLMs) solely through prompt instructions is challenging. As a solution, we opt for dialogue-level session data generation, which involves creating the entire conversation session in one go by providing a specific topic description. This approach helps avoid the generation of inconsistent query turns.

In detail, we begin by sampling the topic description for a session from existing datasets¹¹1Topic descriptions can be sampled in any suitable ways., which we then use to create a prompt instruction. Our prompt instruction template is structured as [Instruction, Topic Information], enabling the LLM to generate an entire session in one go. A comprehensive illustration is shown in Fig. 2.

Once we have the generated session data, we require relevance judgments that establish the connection between query turns and passages for the fine-tuning of the conversational dense retriever. In practice, obtaining such annotations is significantly more costly compared to acquiring conversation session data (Dalton et al., 2020). To enhance the flexibility of our framework, we attempt to train models in an unsupervised manner, i.e. we do not rely on relevance judgments provided by human experts. Instead, we adopt the idea of pseudo-relevance feedback (Tao and Zhai, 2007) to create pseudo supervision signals for each query turn. It is worth noting that there is not a single fixed method for this purpose, and we leave further exploration of this area for future research.

Concretely, we perform off-the-shelf retrieval on each query turn, selecting three passages from top-5 at random as pseudo-relevant documents for that specific turn. However, it is important to consider the format of the input query used for this off-the-shelf retrieval because the original queries in the conversational session are not stand-alone and always rely on the context of the ongoing conversation. To prevent topic shifts and find an optimal solution, we explore four query forms that take contextual information into account: (1) q+a: which concatenates the current query and the corresponding hypothetical answer generated by LLM, (2) q+a+topic: which combines the current query, its answer, and the session topic information, (3) convq+topic: which concatenates the current query, all previous queries, and the session topic information, (4) convq+conva+topic: which combines the current query, all previous queries and their corresponding answers, along with the session topic information. Ultimately, we select the q+a+topic format for reformulating the input query due to its demonstrated effectiveness, as discussed later in Sec. 5.2.

3.4. Query-level Augmented Generation

While utilizing pseudo-relevance feedback to create supervision signals is efficient and often yields comparable results to fully supervised methods, there is still a potential downside of introducing false positive signals that could misguide model training. To mitigate this risk, we can leverage the limited relevance judgments provided by human experts in existing conversational search datasets to carry out query-level augmented generation. Our objective is to generate additional conversational search session data based on the original annotations, specifically for each query turn. The underlying assumption here is that the sequence of conversation should not be unique. In other words, the same user search intent can be expressed in different natural language forms, leading to various conversational sessions on the same topic. This variability is common in real-world scenarios.

Our method operates on the principle of generating new data by making adjustments to existing data points, guided by prior knowledge about the problems’ underlying structure. The augmented data points, derived from labeled data within our framework, can be directly applied in semi-supervised learning through consistency regularization. “Semi-supervised” means we combine the original data points with manual labels and the generated data points without manual labels for model training. In practice, we prompt the LLM to rewrite each query, providing an alternative natural language expression with the same meaning. This instruction template adheres to the format [Instruction, Input Query], as illustrated in Fig. 2. By repeating this process $t$ times, we generate $t$ augmented queries for each original query turn, effectively expanding the initial dataset by a factor of $t$ . Subsequently, the original relevance judgments for each query turn are associated with each corresponding augmented query, serving as supervision signals.

3.5. Conversational Dense Retrieval

We conduct fine-tuning for conversational dense retrieval using the session data we have generated and the associated supervision signals. For this task, we employ a widely used ad-hoc search retriever, ANCE (Xiong et al., 2020), which serves as both the query and passage encoder, denoted as $\mathcal{F}_{Q}$ and $\mathcal{F}_{P}$ , respectively. In this process, we consider all preceding queries within the same session to reformulate the current query turn $q_{n}^{ref}$ , expressed as

(1)

q_{n}^{ref}=q_{1}\circ\cdots q_{i}\cdots\circ q_{n-1}\circ q_{n},\quad i\in[1,% n-1]

where $\circ$ denotes concatenation. Then, a similarity function $\mathcal{S}$ based on dot product is applied to score a candidate passage $p$ as:

(2)

\mathcal{S}(q_{n}^{ref},p)=\mathcal{F}_{Q}(q_{n}^{ref})^{T}\cdot\mathcal{F}_{P% }(p)

During the training phase, we update only the query encoder, while the passage encoder is frozen. The training objective follows the widely used contrastive learning loss:

(3)

\mathcal{L}=\frac{e^{\mathcal{S}(q_{n}^{ref},p^{+})}}{e^{\mathcal{S}(q_{n}^{% ref},p^{+})}+\sum_{p_{n}^{-}\in C^{-}}e^{\mathcal{S}(q_{n}^{ref},p^{-})}}

where the $p^{+}$ and $p^{-}$ denote the positive and negative samples for each query turn. During the inference phase, we perform the Approximate Nearest Neighbors (ANN) search based on the dense index using Faiss (Johnson et al., 2019).

4. Experimental Settings

4.1. Datasets and Evaluation Metrics

We carry out extensive experiments on four widely used conversational search datasets: CAsT-19 (Dalton et al., 2020), CAsT-20 (Dalton et al., 2021), CAsT-21 (Dalton et al., 2022), and TopiOCQA (Adlakha et al., 2022). The three CAsT datasets are curated by the experts of the TREC Conversational Assistance Track (CAsT) and each dataset comprises information-seeking conversations encompassing hundreds of turns in total. The TopiOCQA dataset addresses the novel challenge of topic switching, a common phenomenon in real-world scenarios. Table 1 provides an overview of the fundamental statistics of the datasets.²²2These statistics consider only the turns that possess relevance labels. Following previous studies (Dalton et al., 2021, 2022; Krasakis et al., 2022; Mao et al., 2023a), we employ Mean Reciprocal Rank (MRR), NDCG, and Recall as our evaluation metrics, computed with the pytrec_eval tool (Van Gysel and de Rijke, 2018).

4.2. Baseline Models

We compare ConvSDG with the following models: (1) BM25 (Robertson et al., 2009): A widely used unsupervised sparse retrieval model. (2) ConvDR (Yu et al., 2021): A conversational dense retrieval model based on ANCE retriever, containing both zero-shot and few-shot manner, which is supervised by both conversational search data and manual rewritten queries. (3) ZeCo (Krasakis et al., 2022): A zero-shot conversational search method that matches only the contextualized terms of the current query with passages based on ColBERT (Khattab and Zaharia, 2020). (4) LLMQR: A conversational query rewriting method based on the large language model ChatGPT (gpt-turbo-3.5-4k) without supervision signals to directly rewrite the current turn with the given conversation session. (5) CONVERSER (Huang et al., 2023): A few-shot two-stage conversational query generation method based on the large language model for training conversational dense retrievers. (6) T5QR (Lin et al., 2020): A conversational query rewriting method based on the T5 model using human-rewritten queries in the QReCC dataset (Anantha et al., 2021). (7) ConvGQR (Mo et al., 2023a): A conversational query reformulation method by combining query rewrite and expansion, which leverages human-rewritten queries and gold answers in QReCC dataset as generation objectives. (8) w/o Aug. (Mao et al., 2023c): A conversational dense retriever that fine-tunes ANCE with the original (non-augmented) conversational search data. We use it as the baseline of our methods to see the impact of data augmentation. The fine-tuning process is only based on the CAsT-19 training set.

Table 1. Statistics of the three CAsT and TopiOCQA datasets.

Dataset	CAsT-19		CAsT-20	CAsT-21	TopiOCQA
Dataset	Train	Test	Test	Test	Train	Test
# Conversations	30	20	25	18	3,509	205
# Turns (Queries)	108	173	208	157	45,450	2,514
# Passages/Docs	38M		38M	40M	25M

4.3. Implementation Details

We utilize OpenAI’s ChatGPT (gpt-turbo-3.5-4k) API for both dialogue-level session generation and query-level augmented generation with the default hyper-parameters. For the pseudo relevance supervision signals, we randomly select three passages from the top-5 retrieved passages using PRF. The ANCE system serves as the backbone model for fine-tuning conversational dense retrieval, with maximum input lengths truncated at 64 for queries, 384 for passages, and 512 for concatenated sessions. Model training employs a batch size of 16 with 5 epochs. Further details can be found in our released code repository³³3https://github.com/fengranMark/ConvSDG.

4.4. Experimental Scenarios

We evaluate our method on the following two training scenarios, with and without the relevance judgment, and compare with the suitable baseline models:

Dialogue-level augmentation w/o relevance judgment: We utilize solely the topic information from the CAsT-19 and TopiOCQA training sets for ConvSDG to perform dialogue-level session generation. Subsequently, we assess its performance on the respective test sets of all three CAsT datasets and TopiOCQA. Consequently, in the absence of existing relevance judgments, conversational dense retrieval fine-tuning is carried out following an unsupervised learning approach. The methods we compare include unsupervised and zero-shot methods, as well as the direct use of LLM.

Query-level augmentation w/ relevance judgment: We employ ConvSDG for query-level augmented generation and use the limited relevance judgments available in the CAsT-19 training set. The evaluation is then carried out on three CAsT datasets. As a result, conversational dense retrieval fine-tuning takes place in a semi-supervised learning manner, with the compared methods being supervised and trained using the same data samples.

5. Experimental Results

Table 2. Performance of two different settings on CAsT datasets.

\dagger

denotes significant improvements with t-test at

p<0.05

over all compared methods and bold indicates the best results in corresponding settings. The turns of CONVERSER are quoted from the original paper and the turns of our ConvSDG with relevance judgment are expanded two times and combined with the original 745 turns in the original training set.

Method	Turn	CAsT-19			CAsT-20			CAsT-21
Method	Turn	MRR	NDCG@3	Recall@100	MRR	NDCG@3	Recall@100	MRR	NDCG@3	Recall@100
Dialogue-level augmentation w/o relevance judgement
BM25	-	39.7	18.0	20.1	13.9	7.2	11.5	30.3	16.6	24.9
ZeCo	-	-	23.8	21.6	-	17.6	20.0	-	23.4	26.7
ConvDR	-	42.0	24.7	18.3	23.4	15.0	15.0	-	-	-
LLMQR	-	57.8	35.0	27.7	36.8	24.5	28.2	42.1	28.2	31.3
CONVERSER	230k	35.8	21.4	-	-	-	-	-	-	-
ConvSDG (Ours)	1,704	59.5 ${}^{\dagger}$	32.1	33.5 ${}^{\dagger}$	37.9 ${}^{\dagger}$	25.3 ${}^{\dagger}$	34.9 ${}^{\dagger}$	50.2 ${}^{\dagger}$	33.2 ${}^{\dagger}$	37.5 ${}^{\dagger}$
Query-level augmentation w/ relevance judgement
T5QR	2,235	52.8	31.3	25.3	29.7	19.0	24.2	30.3	20.5	20.6
ConvGQR	2,235	61.0	34.6	30.3	35.1	24.3	30.3	37.6	24.6	28.4
ConvDR	2,235	62.1	35.0	29.7	34.6	23.6	28.8	37.6	25.2	31.4
w/o Aug.	745	56.8	31.5	29.2	34.2	22.6	32.6	45.6	29.8	35.2
ConvSDG (Ours)	2,235	60.6	35.3	32.1 ${}^{\dagger}$	36.5 ${}^{\dagger}$	24.2	34.2 ${}^{\dagger}$	47.2 ${}^{\dagger}$	30.8 ${}^{\dagger}$	36.8 ${}^{\dagger}$

Table 3. Performance on TopiOCQA dataset.

\dagger

denotes significant improvements with t-test at

p<0.05

over all compared methods and bold indicates the best results (except for the results for reference).

Dialogue-level augmentation w/o relevance judgement
Method	Turn	TopiOCQA
Method	Turn	MRR	NDCG@3	Recall@10	Recall@100
BM25	-	10.7	9.7	11.2	26.7
ConvDR	-	10.3	9.1	15.7	35.4
ConvSDG (Ours)	5,231	21.4 ${}^{\dagger}$	19.9 ${}^{\dagger}$	37.8 ${}^{\dagger}$	58.0 ${}^{\dagger}$
Query-level augmentation w/ relevance judgement
T5QR	5,231	18.4	17.6	30.8	45.3
ConvGQR	5,231	8.0	7.3	14.3	25.5
For Reference
T5QR	63,501	23.0	22.2	37.6	54.4
ConvGQR	63,501	25.6	24.3	41.8	58.8

5.1. Main Results

The overall performance comparisons on CAsT and TopiOCQA datasets with different settings are presented in Table 2 and Table 3.

In the absence of relevance judgments with dialogue-level augmentation, as shown in Table 2, we observe that ConvSDG, with unsupervised training, outperforms the compared methods across most evaluation metrics for both the CAsT and TopiOCQA datasets. In particular, it exhibits a significant relative improvement of 19.2% in MRR and 17.7% in NDCG@3 over the second-best results on the challenging CAsT-21 dataset. Besides, with fewer training samples, it even outperforms models trained with manual relevance judgments on CAsT-20 and CAsT-21. These findings confirm the quality and usefulness of our automatically generated data. Moreover, our approach outperforms all compared methods by applying the same augmented samples and is on par with supervised training methods with full original training samples on the TopiOCQA dataset (as shown in Table 3). These results emphasize the importance of conversational dense retrieval fine-tuning, especially when compared to zero-shot methods like ZeCo and ConvDR. Overall, our approach addresses the data scarcity challenge effectively through the automatic generation of conversational search data, validating our motivation.

When considering the scenario with query-level augmentation relevance judgments in the training data, as presented in Table 2, ConvSDG continues to outperform the compared methods across most evaluation metrics with the same training samples on manual datasets. Specifically, it achieves substantial improvements by a relative boost of 25.5% in MRR, 22.0% in NDCG@3, and 17.2% in Recall@100 over the second-best results on the more challenging CAsT-21 dataset. These enhancements, facilitated by the augmented query turns, show the effectiveness of our approach in rewriting the queries in the original datasets while retaining the same search intent. However, compared to dialogue-level ConvSDG without relevance judgment, ConvSDG with query-level augmentation is not more effective, even though it leverages human relevance judgments. This result suggests that there is still room for further improvement in system performance by fully harnessing the existing relevance annotations and enhancing the diversity of the conversational sessions. It is also worth noting that the compared methods do not perform well on the more challenging CAsT-21 dataset. This discrepancy could be attributed to the fact that these methods depend on human-rewritten queries, and our generated augmented session queries may not align perfectly with these original annotations. This observation implies that CDR methods might be more suitable than CQR when being trained on the augmented queries.

5.2. Effectiveness of Supervision Signals

We present the retrieval performance achieved using four different query input forms, as discussed in Sec. 3.3, for generating pseudo-relevance feedback (PRF) based on the ANCE dense retriever in Fig. 3. In this setup, the top results obtained from PRF are employed as the pseudo supervision signals for the corresponding session query turns generated to fine-tune the conversational dense retriever. Our findings indicate that using only the information from the current turn, i.e., the current turn’s query and the corresponding hypothetical answer generated by LLM, the method tends to yield better results compared to incorporating the entire conversation context, such as concatenating with queries or answers from previous turns. This observation can be attributed to the fact that ad-hoc search retrievers lack the capability to represent an entire conversation session effectively, and the underlying search intent within the current turn query is context-dependent. Moreover, we observe that the inclusion of topic information in each conversation session proves beneficial. Indeed, the topic information will help the generation process of augmented data to produce more relevant and topic-related data, which is better than that produced without the topic information. Although topic information may not always be available during the inference stage, we can still leverage it to construct training datasets.

5.3. Effectiveness of Generated Data Size

We present an analysis of the effectiveness of employing varying sizes of generated data for conversational dense retrieval fine-tuning in two different scenarios across three CAsT datasets, as depicted in Fig. 4. The percentage counted by the whole generated query turns for both unsupervised and semi-supervised settings. For unsupervised learning, we observe a notable improvement in system performance as the volume of utilized data increases. This observation demonstrates the pivotal role of augmented data in mitigating the data scarcity issue, and it suggests that there is further potential for enhancing model performance by expanding the dataset even more. On the other hand, for semi-supervised learning, we note that the fine-tuned models do not exhibit improved performance consistently compared to models trained solely on the original training set until a sufficient amount of generated data is added. This indicates that the generated data points might alter the data distribution, and it implies the need for appropriate filtering mechanisms. Nonetheless, the performance enhancement achieved with the addition of full-sized data underlines the effectiveness of our generated data for model training.

6. Conclusion

In this paper, we introduce ConvSDG, a novel framework for generating session data with LLMs for conversational search. By harnessing the robust text generation capabilities of LLMs, we are able to fine-tune conversational dense retrieval using the session data generated through unsupervised or semi-supervised learning methods. Our experimental findings, based on four public datasets, demonstrate the remarkable effectiveness of our approach, as it outperforms existing comparable methods and even surpasses fully supervised models. Additionally, we analyze some crucial impacts of automatically constructing conversational search session data, offering insights for future exploration in this domain. Our study shows the effectiveness of using an LLM to generate additional training data for fine-tuning a dense retriever. It enriches the already extensive body of studies trying to exploit LLMs for search. More research is still required to find the best way to leverage LLMs for enhancing conversational search.

References

(1)
Adlakha et al. (2022) Vaibhav Adlakha, Shehzaad Dhuliawala, Kaheer Suleman, Harm de Vries, and Siva Reddy. 2022. TopiOCQA: Open-domain Conversational Question Answering with Topic Switching. Transactions of the Association for Computational Linguistics 10 (2022), 468–483.
Anantha et al. (2021) Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2021. Open-Domain Question Answering Goes Conversational via Question Rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 520–534.
Bonifacio et al. (2022) Luiz Henrique Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Frassetto Nogueira. 2022. InPars: Data Augmentation for Information Retrieval using Large Language Models. CoRR abs/2202.05144 (2022). arXiv:2202.05144 https://arxiv.longhoe.net/abs/2202.05144
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Chen et al. (2022) Zhiyu Chen, Jie Zhao, Anjie Fang, Besnik Fetahu, Rokhlenko Oleg, and Shervin Malmasi. 2022. Reinforced Question Rewriting for Conversational Question Answering. (2022).
Dai et al. (2022) Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Y Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, and Kelvin Guu. 2022. Dialog Inpainting: Turning Documents into Dialogs. In International Conference on Machine Learning. PMLR, 4558–4586.
Dai et al. (2023) Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, **g Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. 2023. Promptagator: Few-shot Dense Retrieval From 8 Examples. In 11th International Conference on Learning Representations, ICLR 2023.
Dalton et al. (2020) Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2020. TREC CAsT 2019: The conversational assistance track overview. In In Proceedings of TREC.
Dalton et al. (2021) Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2021. CAsT 2020: The Conversational Assistance Track Overview.. In In Proceedings of TREC.
Dalton et al. (2022) Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2022. TREC CAsT 2021: The conversational assistance track overview. In In Proceedings of TREC.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
Ding et al. (2023) Bosheng Ding, Chengwei Qin, Linlin Liu, Lidong Bing, Shafiq Joty, and Boyang Li. 2023. Is gpt-3 a good data annotator?. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 11173–11195.
Gao et al. (2022b) Jianfeng Gao, Chenyan Xiong, Paul Bennett, and Nick Craswell. 2022b. Neural approaches to conversational information retrieval. arXiv preprint arXiv:2201.05176 (2022).
Gao et al. (2022a) Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2022a. Precise Zero-Shot Dense Retrieval without Relevance Labels. CoRR abs/2212.10496 (2022).
Huang et al. (2023) Chao-Wei Huang, Chen-Yu Hsu, Tsu-Yuan Hsu, Chen-An Li, and Yun-Nung Chen. 2023. CONVERSER: Few-Shot Conversational Dense Retrieval with Synthetic Data Generation. arXiv preprint arXiv:2309.06748 (2023).
Johnson et al. (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval (SIGIR). ACM, 39–48.
Kim et al. (2022) Minju Kim, Chaehyeong Kim, Yong Ho Song, Seung-won Hwang, and **young Yeo. 2022. BotsTalk: Machine-sourced Framework for Automatic Curation of Large-scale Multi-skill Dialogue Datasets. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 5149–5170.
Kim and Kim (2022) Sungdong Kim and Gangwoo Kim. 2022. Saving dense retriever from shortcut dependency in conversational search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 10278–10287.
Krasakis et al. (2022) Antonios Minas Krasakis, Andrew Yates, and Evangelos Kanoulas. 2022. Zero-shot Query Contextualization for Conversational Search. In Proceedings of the 45th International ACM SIGIR conference on research and development in Information Retrieval (SIGIR).
Lin et al. (2021a) Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021a. Contextualized Query Embeddings for Conversational Search. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 1004–1015.
Lin et al. (2020) Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, and Jimmy Lin. 2020. Conversational question reformulation via sequence-to-sequence architectures and pretrained language models. arXiv preprint arXiv:2004.01909 (2020).
Lin et al. (2021b) Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, and Jimmy Lin. 2021b. Multi-stage conversational passage retrieval: An approach to fusing term importance estimation and neural query rewriting. ACM Transactions on Information Systems (TOIS) 39, 4 (2021), 1–29.
Liu et al. (2022) Alisa Liu, Swabha Swayamdipta, Noah A Smith, and Ye** Choi. 2022. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation. In Findings of the Association for Computational Linguistics: EMNLP 2022. 6826–6847.
Mackie et al. (2023) Iain Mackie, Shubham Chatterjee, and Jeffrey Dalton. 2023. Generative Relevance Feedback with Large Language Models. CoRR abs/2304.13157 (2023). https://doi.org/10.48550/arXiv.2304.13157 arXiv:2304.13157
Mao et al. (2023a) Kelong Mao, Zhicheng Dou, Haonan Chen, Fengran Mo, and Hong** Qian. 2023a. Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search. arXiv preprint arXiv:2303.06573 (2023).
Mao et al. (2023b) Kelong Mao, Zhicheng Dou, Bang Liu, Hong** Qian, Fengran Mo, Xiangli Wu, Xiaohua Cheng, and Zhao Cao. 2023b. Search-Oriented Conversational Query Editing. In Findings of the Association for Computational Linguistics: ACL 2023. 4160–4172.
Mao et al. (2022a) Kelong Mao, Zhicheng Dou, and Hong** Qian. 2022a. Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval. In Proceedings of the 45th International ACM SIGIR conference on research and development in Information Retrieval (SIGIR).
Mao et al. (2022b) Kelong Mao, Zhicheng Dou, Hong** Qian, Fengran Mo, Xiaohua Cheng, and Zhao Cao. 2022b. ConvTrans: Transforming Web Search Sessions for Conversational Dense Retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2935–2946.
Mao et al. (2023c) Kelong Mao, Hong** Qian, Fengran Mo, Zhicheng Dou, Bang Liu, Xiaohua Cheng, and Zhao Cao. 2023c. Learning Denoised and Interpretable Session Representation for Conversational Search. In Proceedings of the ACM Web Conference 2023. 3193–3202.
Mao et al. (2020) Kelong Mao, Xi Xiao, Jieming Zhu, Biao Lu, Ruiming Tang, and Xiuqiang He. 2020. Item tagging for information retrieval: A tripartite graph neural network based approach. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2327–2336.
Mo et al. (2023a) Fengran Mo, Kelong Mao, Yutao Zhu, Yihong Wu, Kaiyu Huang, and Jian-Yun Nie. 2023a. ConvGQR: Generative Query Reformulation for Conversational Search. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 4998–5012.
Mo et al. (2023b) Fengran Mo, Jian-Yun Nie, Kaiyu Huang, Kelong Mao, Yutao Zhu, Peng Li, and Yang Liu. 2023b. Learning to Relate to Previous Turns in Conversational Search. In 29th ACM SIGKDD Conference On Knowledge Discover and Data Mining (SIGKDD).
Mo et al. (2024) Fengran Mo, Chen Qu, Kelong Mao, Tianyu Zhu, Zhan Su, Kaiyu Huang, and Jian-Yun Nie. 2024. History-Aware Conversational Dense Retrieval. arXiv preprint arXiv:2401.16659 (2024).
Qian and Dou (2022) Hong** Qian and Zhicheng Dou. 2022. Explicit Query Rewriting for Conversational Dense Retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 4725–4737.
Qu et al. (2020) Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W Bruce Croft, and Mohit Iyyer. 2020. Open-retrieval conversational question answering. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 539–548.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
Tao and Zhai (2007) Tao Tao and ChengXiang Zhai. 2007. An exploration of proximity measures in information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 295–302.
Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia **, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yan** Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Kathleen S. Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Agüera y Arcas, Claire Cui, Marian Croak, Ed H. Chi, and Quoc Le. 2022. LaMDA: Language Models for Dialog Applications. CoRR abs/2201.08239 (2022). arXiv:2201.08239 https://arxiv.longhoe.net/abs/2201.08239
Vakulenko et al. (2021) Svitlana Vakulenko, Shayne Longpre, Zhucheng Tu, and Raviteja Anantha. 2021. Question rewriting for conversational question answering. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 355–363.
Van Gysel and de Rijke (2018) Christophe Van Gysel and Maarten de Rijke. 2018. Pytrec_eval: An Extremely Fast Python Interface to trec_eval. In SIGIR. ACM.
Voskarides et al. (2020) Nikos Voskarides, Dan Li, Pengjie Ren, Evangelos Kanoulas, and Maarten de Rijke. 2020. Query resolution for conversational search with limited supervision. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 921–930.
Wang et al. (2023) Liang Wang, Nan Yang, and Furu Wei. 2023. Query2doc: Query Expansion with Large Language Models. CoRR abs/2303.07678 (2023). https://doi.org/10.48550/arXiv.2303.07678 arXiv:2303.07678
Wang et al. (2021) Zirui Wang, Adams Wei Yu, Orhan Firat, and Yuan Cao. 2021. Towards zero-label language learning. arXiv preprint arXiv:2109.09193 (2021).
Wei et al. (2022) Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language Models are Zero-Shot Learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=gEZrGCozdqR
West et al. (2022) Peter West, Chandra Bhagavatula, Jack Hessel, Jena Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Ye** Choi. 2022. Symbolic Knowledge Distillation: from General Language Models to Commonsense Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4602–4625.
Wu et al. (2022) Zeqiu Wu, Yi Luan, Hannah Rashkin, David Reitter, and Gaurav Singh Tomar. 2022. CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning. (2022).
Xiong et al. (2020) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations.
Yu et al. (2020) Shi Yu, Jiahua Liu, **gqin Yang, Chenyan Xiong, Paul Bennett, Jianfeng Gao, and Zhiyuan Liu. 2020. Few-shot generative conversational query rewriting. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 1933–1936.
Yu et al. (2021) Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Zhiyuan Liu. 2021. Few-shot conversational dense retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 829–838.
Zamani et al. (2023) Hamed Zamani, Johanne R Trippas, Jeff Dalton, Filip Radlinski, et al. 2023. Conversational information seeking. Foundations and Trends® in Information Retrieval 17, 3-4 (2023), 244–456.
Zhang et al. (2023) Le Zhang, Yihong Wu, Fengran Mo, Jian-Yun Nie, and Aishwarya Agrawal. 2023. MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model. In Findings of the Association for Computational Linguistics: EMNLP 2023. 1195–1210.
Zheng et al. (2023) Chujie Zheng, Sahand Sabour, Jiaxin Wen, Zheng Zhang, and Minlie Huang. 2023. Augesc: Dialogue augmentation with large language models for emotional support conversation. In Findings of the Association for Computational Linguistics: ACL 2023. 1552–1568.