Pistis-RAG: A Scalable Cascading Framework Towards Content-Centric Retrieval-Augmented Generation

Yu Bai Zhongguancun Laboratory [email protected] , Yukai Miao Zhongguancun Laboratory [email protected] , Li Chen Zhongguancun Laboratory [email protected] , Dan Li Zhongguancun LaboratoryTsinghua University [email protected] , Yanyu Ren Zhongguancun LaboratoryTsinghua University [email protected] , Hongtao Xie China Mobile Communications Group Co., Ltd. [email protected] , Ce Yang China Mobile Communications Group Co., Ltd. [email protected] and Xuhui Cai China Mobile Communications Group Co., Ltd. [email protected]

Abstract.

In Greek mythology, Pistis symbolized good faith, trust, and reliability. Drawing inspiration from these principles, Pistis-RAG is a scalable multi-stage framework designed to address the challenges of large-scale retrieval-augmented generation (RAG) systems. This framework consists of distinct stages: matching, pre-ranking, ranking, reasoning, and aggregating. Each stage contributes to narrowing the search space, prioritizing semantically relevant documents, aligning with the large language model’s (LLM) preferences, supporting complex chain-of-thought (CoT) methods, and combining information from multiple sources.

Our ranking stage introduces a significant innovation by recognizing that semantic relevance alone may not lead to improved generation quality, due to the sensitivity of the few-shot prompt order, as noted in previous research (lu2021fantastically, ). This critical aspect is often overlooked in current RAG frameworks.

We argue that the alignment issue between LLMs and external knowledge ranking methods is tied to the model-centric paradigm dominant in RAG systems. We propose a content-centric approach, emphasizing seamless integration between LLMs and external information sources to optimize content transformation for specific tasks.

Our novel ranking stage is designed specifically for RAG systems, incorporating principles of information retrieval while considering the unique business scenarios reflected in LLM preferences and user feedback. We simulated feedback signals on the MMLU benchmark, resulting in a 9.3% performance improvement. Our model and code will be open-sourced on GitHub. Additionally, experiments on real-world, large-scale data validate the scalability of our framework.

1. Introduction

Refer to caption — Figure 1. Brief of RAG Framework: from the initial bi-encoder encoding to mixed retrieval, and cross-encoder re-ranking.

In recent years, LLMs have demonstrated remarkable capabilities in natural language understanding and generation tasks. These models, trained in vast datasets that represent a wide range of human knowledge, excel in generating coherent and contextually relevant text in various domains (achiam2023gpt, ). Despite their advancements, LLMs are inherently limited by the scope and quality of their training data. They cannot often incorporate real-time or domain-specific knowledge not present in their training corpus, leading to challenges in accuracy and relevance, particularly in dynamic environments.

To mitigate these limitations, RAG (lewis2020retrieval, ) has emerged as a promising paradigm. RAG enhances LLMs by augmenting their input with additional information retrieved from external sources at inference time. This approach not only enriches the context of LLM-generated responses but also enables them to leverage up-to-date and domain-specific knowledge without the need for extensive retraining. By integrating ICL from the retrieved knowledge, LLMs can adapt and perform better on tasks that require specialized knowledge or current events.

Current RAG frameworks face significant challenges that hinder their widespread adoption. These challenges primarily lie in two areas: accurately retrieving relevant knowledge and striking a balance between generation quality and computational efficiency. Additionally, integrating retrieved information with LLM-generated outputs remains a nontrivial task, requiring careful consideration of factors like semantic alignment, prompt engineering, and user preferences (as discussed in (yu2023augmentation, )). Traditional RAG approaches may also overlook the nuances of how LLMs process and prioritize information, leading to suboptimal performance in real-world applications.

This paper tackles key challenges in retrieval-augmented generation by introducing Pistis-RAG, a novel framework that boosts both effectiveness and efficiency. Pistis-RAG leverages a multi-stage retrieval pipeline, consisting of matching, pre-ranking, ranking, and re-ranking stages. Each stage plays a critical role in optimizing how relevant knowledge is retrieved and integrated into LLM-generated responses. Notably, our framework prioritizes a content-centric approach, ensuring retrieved information seamlessly aligns with user needs and LLM capabilities.

Contributions

(1)

Content-Centric Integration: We introduce a content-centric perspective to the end-to-end process in RAG systems, emphasizing the alignment of content delivery with both business objectives and user preferences. This approach surpasses traditional ranking methods by focusing on relevance as well as the contextual and strategic value of the retrieved and presented content.
(2)

Enhanced User Feedback Mechanisms: Our framework effectively leverages user feedback. By continuously incorporating feedback into the ranking process, the system adapts to evolving user preferences and business goals. This dynamic adjustment enhances the system’s ability to deliver personalized and contextually appropriate content. We introduce a specialized ranking mechanism to address the prompt order sensitivity of LLMs, optimizing the sequence of prompts to prioritize the most relevant and informative ones according to LLM preferences.
(3)

Novel Framework: We propose a comprehensive framework for optimizing the end-to-end process in RAG systems. This framework integrates advanced ranking techniques across the matching, pre-ranking, ranking, re-ranking, and multi-path reasoning stages. It ensures a systematic approach to content evaluation and selection, enhancing retrieval accuracy and relevance for end-users.
(4)

Experimental Validation: We conduct extensive experiments demonstrating the effectiveness of our approach, showcasing significant improvements in overall system performance. Our results show that our framework outperforms existing methods by 9.3%, leading to more relevant and personalized results.

In conclusion, Pistis-RAG represents a significant advancement in the field of retrieval-augmented generation, addressing critical shortcomings in existing frameworks and paving the way for more effective and adaptive AIGC systems.

2. Background

Fundamentally, our approach serves large-scale online content generation systems (such as ChatGPT(achiam2023gpt, )), which address the core issue of Open-Domain Question Answering (ODQA) (guu2020retrieval, ).

2.1. Large-Scale Generative AI Systems

Large-scale generative AI systems, are designed to handle a high volume of real-time user requests with high accuracy and efficiency. These systems are becoming increasingly important in a wide range of applications, including customer service, chatbots, and social media platforms.

One of the key challenges in develo** large-scale generative AI systems is ensuring that the generated content is both relevant and accurate. This is particularly important in applications where users are relying on the system for information or assistance. RAG is a promising approach to addressing this challenge. RAG systems use a combination of retrieval and generation techniques to produce more accurate and relevant responses.

Another challenge in develo** large-scale generative AI systems is ensuring that they can scale to handle the high volume of requests that they are likely to receive. This requires careful design of the system architecture and algorithms. Additionally, large-scale generative AI systems often need to be trained on massive amounts of data, which can be computationally expensive.

Advanced content generation platforms such as ChatGPT, Gemini(team2023gemini, ), WenXinYiYan(sun2021ernie, ), and TongYiQianWen(bai2023qwen, ), among others, are engineered to manage substantial real-time user queries efficiently. These platforms play a pivotal role in applications requiring rapid response times, precise outputs, and extensive generation capacities.

RAG plays a vital role in these systems by functioning as a powerful external memory source. This significantly reduces hallucinations and enhances responses’ relevance and factual accuracy.

Moreover, these large-scale generative AI systems benefit from real-time user feedback. This feedback loop allows for continuous improvement in the quality and accuracy of content generation over time. We propose the Pistis-RAG framework to recognize the need for a robust, scalable solution that addresses these specific challenges and optimizes system performance.

2.2. Open Domain Question Answering

The RAG framework leverages a two-stage Retriever-Generator architecture. This approach is particularly effective for ODQA, a challenging information retrieval task. ODQA questions are often complex, open-ended, and require reasoning over vast amounts of information.

By employing a cascading system derived from traditional IR disciplines, the RAG framework allows for the integration of advanced techniques such as prompt engineering and multi-step reasoning. This cascading system approach is especially beneficial in environments that demand high-performance content generation, such as search engines, recommendation systems, and computational advertising. Utilizing cascade modeling within the Pistis-RAG framework, we aim to achieve substantial enhancements in system performance and output quality, specifically tailored for the demands of ODQA tasks.

In the following sections, we will delve into the details of the Pistis-RAG framework, highlighting its unique architecture and capabilities. We will also present experimental results demonstrating the framework’s effectiveness in various large-scale content generation and ODQA tasks.

3. The Pistis-RAG Framework

This section explores online generative AI tasks through a cascading system, emphasizing a content-centric perspective.

3.1. Content-Centric vs. Model-Centric

This perspective views the process as transforming content from one form to another, rather than a traditional model-centric approach focused on manipulating data within the model. Specifically, the system operates in a series of actions:

•

Content Acquisition: The system first retrieves information from an external knowledge base. This retrieval process is driven by the user’s intent, ensuring the retrieved content aligns with the user’s desired task.
•

Content Transformation and Integration: The system then leverages the retrieved content alongside the knowledge it has learned during pre-training. This combined knowledge base informs the generation of new content, tailored to the user’s needs. This new content could be instructions, summaries, creative text formats, or any other output relevant to the user’s intent.
•

Content Delivery: Finally, the system returns the generated content to the user, completing the online task.

The content-centric perspective highlights the critical role of external information in this process, content from long-term memory becomes the system’s primary input, driving the entire content transformation pipeline.

We argue that the lack of strong alignment between LLMs and the external knowledge ranking methods used in RAG tasks is relevant to the reliance on the model-centric paradigm in Langchain-like frameworks. A content-centric approach would prioritize seamless integration between the LLMs and external information sources, optimizing the content transformation process for each specific task. This topic will be further explored in later sections of the paper.

Table 1. Overview of Pistis-RAG Framework Stages and Industry Enhancements. The Pistis-RAG framework encompasses stages including Matching, Pre-Ranking, Ranking, Re-Ranking, Reasoning, and Aggregating, each playing a crucial role in the response generation process. This table provides detailed insights into the function of each stage, the techniques employed, the challenges faced, and the industry-specific considerations. Additionally, it includes a column showcasing the number of ranking items at each stage, demonstrating the reduction as the process progresses.

Stage	Item Size	Function	Techniques	Challenges	Industry Setting
Matching	Tens of Thousands	Low-latency retrieval of relevant documents	TF-IDF, BM25, Bi-Encoder, etc.	Ensuring relevance by mixed retrieval methods	Integration with external search engines, etc.
Pre-ranking	Hundreds	Pre-ranking for workload reduction in ranking stage	Cross-Encoder, etc.	Balancing accuracy and relevance	Auto scaling and Failover, etc.
Ranking	Tens	Detailed ranking based on LLM preferences	Listwise LTR methods w/ listwide label optimization	Learning from listwide labels	Auto scaling and Failover, etc.
Re-ranking	Even Fewer	Ranking based on additional criteria such as credibility	Pointwise ranking algorithms for precise control	Depending on user requirements	Ensuring credibility of generated content, etc.
Reasoning	/	Generating reasoned outputs through LLM inference	Multi-Path reasoning, etc.	Maintaining response coherence	Speculative decoding for online performance, etc.
Aggregating	/	Synthesizing coherent user responses from reasoning results	Consistency checks, etc.	Ensuring credibility and readability	Citations, Markdown formatting, Content safety integration, etc.

3.2. Rethinking RAG through a Content-Centric Lens

Adopting a content-centric perspective on RAG tasks, particularly when viewed within the framework of cascading systems, offers valuable insights. This perspective allows us to draw meaningful comparisons between RAG and traditional IR systems. Through this comparison, we can identify potential mismatches between the previously defined stages of both systems and uncover opportunities for introducing new stages that could enhance the overall process.

A content-centric investigation, as exemplified in Figure 2, highlights a key issue: the current RAG "re-ranker" can easily lead to misunderstandings. Its function aligns more closely with the pre-ranking stage.

Moreover, We can see an absence of the ranking stage, the ranking stage is designed to align with feedback that reflects the user’s final needs and preferences. Although similar signals can be obtained in online RAG systems, they are not widely utilized in the existing popular RAG frameworks, which are predominantly model-centric.

The cascading architecture of Pistis-RAG is structured into five pivotal stages, each contributing uniquely to the processing pipeline:

3.2.1. Stage 1: Matching (Filtering)

In the initial stage, the system employs robust retrieval algorithms to navigate through a vast corpus of documents, selecting those most relevant to the user’s query. This stage utilizes a Mixed Retrieval strategy to ensure that the filtering process is both efficient and effective.

3.2.2. Stage 2: Pre-Ranking (Semantic Refinement)

Following the matching phase, this stage enhances the scoring process by refining it based on the complete document obtained through the retrieved fragment. It enhances the sorting process by scoring documents based on their semantic correlation with the user’s query, using Cross-Encoder methods to achieve higher accuracy.

3.2.3. Stage 3: Ranking (LLM Alignment)

At this juncture, the process refines the document ranking by aligning with the preferences of the LLM. This adjustment ensures that the most relevant information is positioned advantageously within the prompt templates, optimizing the LLM’s performance.

3.2.4. Stage 4: Re-Ranking (Domain Specific Requirements)

The Re-Ranking phase, although discretionary, plays a pivotal role in domain-specific requirements like assessing the credibility of information sources in official document composition or critical decision-making scenarios.

3.2.5. Stage 5: Reasoning (Multi-Path CoT Strategy)

If documents exhibit low distinctiveness in their semantic content, this stage intervenes by generating multiple response sequences. The LLM concurrently generates answers based on varied retrieval outputs, enhancing content diversity and decision-making in the aggregation stage.

3.2.6. Stage 6: Aggregating (Consistency Checking)

The final stage employs sophisticated voting methods to synthesize the outputs from the Reasoning stage. This method ensures consistency and stability in the final user responses, mitigating potential variability due to the LLM’s sensitivity to input sequences.

3.3. Architecture

The Pistis-RAG system architecture comprises four core components: Matching, Ranking, Reasoning, and Aggregating Services, each meticulously designed to handle specific segments of the query-response cycle efficiently.

Matching Service

The Matching Service is the cornerstone of the user interaction process, responsible for understanding user intent and swiftly retrieving the most relevant information. To achieve this, the Matching Service employs a sophisticated blend of IR techniques, focusing on optimizing latency for online large-scale retrieval.

The Matching Service utilizes various data structures to optimize information retrieval based on the specific technique employed:

•

Vector Storage: This structure is crucial for embedding-based retrieval approaches like ANN. It efficiently stores document representations as vectors, allowing for fast similarity comparisons with the user’s query vector. Example: The Matching Service can represent documents as vectors in a high-dimensional space, enabling it to quickly identify documents that are semantically similar to the user’s query.
•

Inverted Index: A fundamental data structure for keyword-based retrieval. It allows for rapid identification of documents containing specific keywords present in the user’s query. Example: When a user searches for a specific term, the inverted index efficiently directs the Matching Service to the documents that contain that term.

In-memory K-V Cache also plays a critical role in stateful services. It can be used for:

•

Maintaining User Conversation Session: Storing the context of a user’s ongoing conversation allows for a more coherent and personalized user experience. Example: The Matching Service can use the K-V cache to remember the user’s previous queries and interactions, enabling it to provide more tailored responses.
•

User Prompt-Answer Pair Cache: Storing previously accessed responses to commonly asked questions or user prompts can substantially enhance response times. Moreover, it serves as an effective few-shot example for RAG. For instance, when a user poses a recurring question, the Matching Service can save the corresponding answer in the key-value cache, thereby obviating the necessity to conduct a fresh search on each occasion.
•

User-Relevant Session History: By storing high-quality content relevant to the user’s past interactions, the Matching Service can prioritize its retrieval for future queries, enhancing the user experience. Example: If a user has previously shown interest in a particular topic, the Matching Service can prioritize content related to that topic when responding to future queries.

In large-scale industrial settings, the Matching Service might also integrate with external search engines to access a vast corpus of information. However, this approach typically comes with increased latency due to network communication overhead.

The choice of retrieval technique and data structure depends on the specific requirements of the application. For instance, if higher accuracy is desired, ANN or inverted index and BM25 and TF-IDF methods can be used together. It is important to carefully consider the trade-offs between speed and accuracy when selecting the appropriate techniques.

Additionally, it is important to acknowledge that the Matching Service has its limitations. For example, it may struggle with handling ambiguous or complex queries that require.

Ranking Service

The ranking service optimizes the information retrieval process by prioritizing items relevant to the user’s intent. It takes a set of relevant items, denoted by $\mathcal{I}$ , identified by the matching service, and outputs a prioritized list, denoted by $\mathcal{O}$ . This prioritization considers both the retrieved items and the user’s intent representation.

The ranking service employs several techniques to achieve this. Below are the key stages and processes formally defined:

(1)

Stage 1: Pre-Ranking

The Pre-Ranking stage serves as an initial filtering mechanism aimed at streamlining the subsequent ranking process. It employs a coarse ranking mechanism, typically denoted by the function $f_{\text{pre}}$ , which assesses preliminary relevance scores for each item in the set $\mathcal{I}$ . These scores are generated based on the user’s intent representation $u$ .

The preliminary relevance scores are computed as follows:

P=\{(I_{j},P_{j})\mid I_{j}\in\mathcal{I},P_{j}=f_{\text{pre}}(I_{j},u)\}

Here, $P$ represents a set of pairs where $I_{j}$ is an item from $\mathcal{I}$ and $P_{j}$ denotes its corresponding preliminary relevance score.

(2)

Stage 2: Ranking

The Ranking stage focuses on refining the relevance assessment process by scoring each item in the pre-ranked subset $S_{\text{pre}}$ . This is achieved through the application of a ranking function, often denoted as $f_{\text{rank}}$ , which takes into account both the features of the items and the user’s intent representation $u$ .

The relevance scores for the pre-ranked subset are computed as follows:

R=\{(S_{\text{pre}_{j}},R_{j})\mid S_{\text{pre}_{j}}\in S_{\text{pre}},R_{j}=% f_{\text{rank}}(S_{\text{p}_{j}},u)\}

Here, $R$ represents a set of pairs where $S_{\text{pre}_{j}}$ is an item from the pre-ranked subset $S_{\text{pre}}$ and $R_{j}$ denotes its corresponding relevance score.

(3)

Stage 3: Re-Ranking

The Re-Ranking phase, although discretionary, plays a pivotal role in certain contexts such as in official document composition or critical decision-making scenarios. This step involves isolating a subset $S_{\text{re}}$ from the previously ranked set $S_{\text{rank}}$ and applying a re-ranking function, commonly denoted as $f_{\text{re}}$ .

The re-ranking process adjusts the relevance scores of items in $S_{\text{re}}$ based on credibility scores $s_{k}$ associated with each item. The final output of the Re-Ranking stage denoted as $O$ , is obtained by sorting the re-ranked subset in descending order of relevance scores.

O=\{(S_{\text{re}_{k}},O_{k})\mid S_{\text{re}_{k}}\in S_{\text{re}},O_{k}=f_{% \text{re}}(S_{\text{re}_{k}},s_{k})\}

Here, $O$ represents a set of pairs where $S_{\text{re}_{k}}$ is an item from the re-ranked subset $S_{\text{re}}$ and $O_{k}$ denotes its corresponding relevance score after re-ranking.

At the core of the Ranking Service lies a commitment to delivering information with utmost accuracy and relevance. Advanced relevance assessment algorithms, such as $bge$ - $reranker$ - $base/large$ , are utilized to identify items that are semantically aligned with the user’s query. Additionally, factors like factual accuracy, comprehensiveness, and novelty are carefully considered during the ranking process to ensure that the most informative and reliable items are surfaced.

By prioritizing accuracy and relevance, the Ranking Service enhances the clarity, comprehensiveness, and effectiveness of the information retrieval process, thereby providing users with a more valuable and trustworthy experience.

Reasoning Service

The Reasoning Service Enhanced Reasoning with Parallel Inference and Expert Routing. The Reasoning Service leverages an advanced Chain of Though (wei2022chain, ) to execute parallel inference across LLMs, generating diverse results for subsequent aggregation. This service takes prompts and retrieved information as input, producing a series of reasoned outputs. By harnessing LLMs, it extracts insights, draws conclusions, and answers inquiries based on the provided data. The outcome is a comprehensive set of reasoned responses that comprehensively address the user’s intent.

Concurrent multi-step inference feeds into the aggregation stage, enabling consistency checks like self-consistency and others. Additionally, industry-specific decoding strategies, such as speculative decoding, can be applied for further refinement. This tailors the inference process to the specific domain, enhancing the relevance and accuracy of generated responses.

Furthermore, the service routes issues to the most suitable expert model based on the problem nature. By simultaneously invoking multiple LLMs with varying specializations, it leverages their diversity to improve question-answering effectiveness.

Aggregating Service

The Aggregating Service plays a critical role in transforming reasoning outputs into clear, user-centric responses. It takes a set of reasoning outcomes and crafts structured answers tailored to the user’s initial query. This involves organizing information logically, ensuring clarity, and conciseness, and maintaining user engagement.

Merging Concurrent Inference Results: This service seamlessly combines outcomes from concurrent inference processes, guaranteeing coherence and consistency. Self-consistency (wang2022self, ) checks and similar techniques are employed to validate and harmonize the aggregated results.

Industry-Specific Optimizations: To enhance the aggregation process for industry settings, the Aggregating Service incorporates several key elements:

•

Citation and Transparency: Credibility is bolstered by incorporating citations (li2024citation, ) to trustworthy sources within industry contexts. This may involve referencing established sources and providing transparency regarding data origin. Additionally, the reasoning process, such as the Chain of Thought and decision-making steps, can be showcased to provide deeper insights.
•

Tailored Formatting: Readability and visual appeal are improved through the application of industry-standard formatting techniques, like Markdown. Adhering to formatting conventions ensures consistency with established norms, facilitating user comprehension.
•

Content Safety Integration: In safety-critical settings, content safety checks are incorporated to filter out potentially harmful or inappropriate content. Algorithms and protocols are implemented to screen aggregated information, guaranteeing compliance with industry safety standards and regulations.

By integrating these components, the Aggregating Service goes beyond consolidating reasoning results. It elevates the quality, trustworthiness, and safety of the final user response, ultimately delivering a smooth and enriching user experience.

4. Ranking Details

In this section, we outline the ranking process, differentiating it from the commonly discussed concept of rerankers in RAG discourse. The ranking stage in information retrieval systems plays a crucial role in aligning content delivery with business objectives and user preferences. The ranking stage transcends this, encompassing a broader range of tasks essential for optimizing the retrieval and presentation of information.

4.1. RAG "Re-Rankers" in the Pre-Ranking Stage

While terms like $bge-reranker$ are commonly used for components within RAG systems, they can be misleading. Information Retrieval systems like search engines employ re-ranking models in a later stage to refine results based on specific business needs.

In RAG, however, these models are used upfront during the pre-ranking stage to narrow down candidate documents. Therefore, a more accurate term for this function would be "pre-ranker."

4.2. Ranking for LLM Prompt Order Sensitivity

LLMs are known for their impressive capabilities, but their performance can be heavily influenced by the order in which prompts are presented (prompt order sensitivity). This is particularly relevant in the RAG system, where ranking plays a crucial role. If the order of prompt examples is not carefully considered, the LLM might generate responses that lack coherence or deviate from the user’s intent. The order of prompts can shape the LLM’s understanding of the task and the user’s expectations. If prompts are not presented logically, the LLM might struggle to generate an output that aligns with the user’s desired outcome.

Existing rule-based studies, such as those that focus on cue word placement, suggest that placing examples at the beginning or end may be more effective. However, we assume that this preference cannot be determined by rules alone; rather, it requires a statistic learning model to capture this preference distribution accurately.

The RAG system addresses prompt order sensitivity by employing a ranking mechanism. This mechanism ensures that the most relevant and informative prompt examples are presented to the LLM first. This alignment between the ranking and the ideal order of prompts leads to the following.

•

Improved Coherence: By prioritizing relevant information, the LLM is more likely to produce coherent and consistent outputs that align with the user’s intent.
•

Enhanced User Experience: When the LLM generates responses that meet user expectations, the overall user experience is significantly improved.

We can ensure that LLMs leverage their potential to generate informative, coherent, and user-centric outputs by addressing prompt order sensitivity.

4.3. Ranking Problem Definition

The ranking problem in online systems involves improving the order of items presented to users based on various factors beyond just relevance. This can be achieved by using user feedback to understand their preferences for the content.

User feedback can be obtained through actions such as copying the content (indicating a strong preference), regenerating (suggesting a mild preference for an alternative), or disliking the content (clear disapproval). By analyzing these feedback labels, we aim to learn a ranking model that optimizes the order of items within a list to better match user preferences.

Formally, let us denote our dataset as $\mathcal{D}$ , containing a set of examples $\{(p_{i},x_{i},y_{i},f_{i})\}$ , where:

•

$p_{i}$ represents the user’s intent.
•

$x_{i}$ represents the ordered few-shot example list.
•

$y_{i}$ represents the end-to-end generated output.
•

$f_{i}$ represents the specific feedback label (copying, regenerating, disliking) associated with the $i$ -th pair.

The objective of the RAG ranking stages is to develop a ranking model that utilizes these feedback signals to optimize the ranking sequences produced by the large language model, ultimately enhancing the user experience and satisfaction with the system.

4.4. Listwide Labels

This part delves into the complexities of learning from implicit user feedback in Listwide Learning to Rank (LTR) methods. Specifically, we focus on scenarios where explicit relevancy labels are absent for all items in a list.

Our primary focus is on using indirect signals derived from user feedback, which reflect how the ranking of content impacts the results produced by LLMs. These indirect signals are termed Listwide Labels.

Training ranking models using Listwide Labels poses significant challenges. A key issue is the potential oversight of critical insights into the overall ranking quality. Therefore, develo** effective strategies to utilize Listwide Labels is essential for enhancing ranking model performance.

4.5. Ranking Model Alignment

To address the ranking problem in online systems, we employ the listwise Transformer to combine the listwise LTR objective with a listwide objective, where the overall quality of the list is explicitly modeled.

For effective utilization of listwide labels, we initially collect user feedback data, encompassing actions such as copying, regenerating, disliking, etc., to serve as the ground truth for evaluating the efficacy of the content generation by the large language model.

The core ranking process employs advanced algorithms to meticulously assess each item’s quality and relevance. These algorithms take a set $I$ of items to be ranked as input and produce an ordered set $O$ based on the ranking function $f$ :

O=f(I)

In the industry setting, we formulate the ranking task as a learning-to-rank problem based on listwide signals. This involves training a ranking model using supervised learning techniques to learn a ranking function that sorts the content generated by the large language model into the desired order. Our model incorporates a diverse range of features, including:

•

Semantic similarity between generated content and user intent $\text{sim}(c,\text{intent})$
•

Relevance of generated content to the prompt $\text{rel}(c,\text{prompt})$
•

Novelty of generated content $\text{nov}(c)$
•

User engagement metrics $\text{eng}(c)$

In addition, we integrate techniques for managing user feedback dynamics and evolving user preferences over time, ensuring that our ranking model remains adaptive and responsive to changes in user behavior.

Our methodology aims to optimize the ranking of content produced by the LLMs in alignment with user preferences and business objectives, thereby enhancing the overall user experience and utility.

Table 2. Ablation Study Results on MMLU. Evaluated configurations: (1) Baseline without enhancements; (2) Full Pistis-RAG with all components; (3) Without Ranking Stage Feedback Label Integration; (4) Without Multi-Path Reasoning and Aggregating.

Paradigm	Component	F1-Score
Content-Centric	Full Pistis-RAG	54.65%
	Without Ranking Stage (LTR Transformer w/ Feedback Label Integration)	52.3%
	Without Reasoning and Aggregating Stage (Multi-Path Reasoning and Consistency Check)	52.8%
Model-Centric	Baseline (Matching: BAAI/beg-m3 + Pre-Ranking: BAAI/bge-reranker-large)	50.0%

5. Experiments on Public Datasets

Our experimental analysis delves into the prominent public datasets MMLU(hendrycks2020measuring, ), which comprise a substantial volume of data, and human annotators assign explicit answers to each question in the selection.

However, our primary focus lies in learning from feedback labels, which are more abundant and reflective of real-world scenarios. Hence, we simulate feedback labels to mirror the realistic user behavior observed in various industries. This simulation approach is detailed in the subsequent sections.

5.1. Simulating Feedback

Online platforms, such as ChatGPT, continuously enhance their performance through user feedback mechanisms such as copying, regenerating, and disliking responses. Our simulation process aims to align our system with this prevalent scenario.

5.1.1. Collecting Few-shot Examples

Within our system architecture, we seamlessly integrate highly rated user feedback into our contextual learning example database for the utilization of RAG. Specifically, we incorporate precise question-answer pairs sourced from public datasets as supplementary ICL examples $D$ . To ensure impartial evaluation, we filter out answers identical to questions during retrieval (though in the live online system, such cases are directly cached for future retrieval). This meticulous approach establishes a simulated example of learning from an external knowledge base, enriching the system’s performance through augmented retrieval data.

5.1.2. Simulating User Behavior

We refine user behavior data from open source datasets to simulate user feedback through the following steps:

(1)

Leveraging Retrieval-Augmented Generation: Our system employs RAG to augment response generation by retrieving relevant information from a vast dataset $D$ before generating an answer. This process, denoted as $RAG(D)$ , aims to establish the correlation between RAG ranking methods and generated responses.
(2)

Extracting Information with Regular Expressions: Following the generation stage, we employ regular expressions to extract specific information $X$ from the generated texts $Y$ . This operation is represented as $P(Y)=X$ , where $P$ represents the regex parsing function.
(3)

Assigning Labels Based on Correctness: The final step involves assigning labels to the generated outputs based on their accuracy relative to the expected answers. We define the labeling function $L(y)$ for the generated output $y$ in comparison to the expected answer.

This feedback loop is fundamental to continuously enhancing the model’s accuracy and aligning it with user preferences. Here’s the breakdown of the labels:

•

Correct ( $L(y)=Positive$ ): Indicates that the generated output $y$ matches the set of correct answers $Y_{\text{correct}}$ .
•

Incorrect ( $L(y)=Even$ ): Signals that the generated output $y$ falls within the set of incorrect answers $Y_{\text{incorrect}}$ .
•

No Answer ( $L(y)=Negative$ ): Represents outputs lacking an answer and belonging to the set of nonresponses $Y_{\text{no\_answer}}$ .

This formalization illustrates a continuous learning system in which feedback from real user interactions refines and enhances the model, aligning its output more closely with user expectations and real-world applications. Specifically, the correct answers ( $Y_{\text{correct}}$ ) resemble text copying, the incorrect answers ( $Y_{\text{incorrect}}$ ) resemble regeneration, and the lack of answers ( $Y_{\text{no\_answer}}$ ) correspond to negative user feedback.

5.2. Datasets

In assessing the effectiveness of the Pistis-RAG framework, it is essential to leverage diverse datasets that offer a comprehensive evaluation of its capabilities. MMLU is an invaluable resource, providing insights into the framework’s adaptability and performance across various domains.

The MMLU dataset is a valuable resource for researchers develo** speech recognition and natural language processing systems. The data set has been used to train various speech recognition systems and has been shown to improve the performance of these systems. The dataset has also been used to develop natural language processing systems that can understand and generate spoken language.

With a total of 15,908 questions, MMLU provides a robust benchmark for evaluating language models’ performance across various tasks. Particularly noteworthy is its relevance to healthcare, making it an ideal choice for assessing the Pistis-RAG framework’s ability to handle complex medical queries.

It evaluates a framework’s ability to understand and generate coherent responses across various subjects, making it an ideal choice for evaluating the Pistis-RAG framework’s performance in handling complex queries.

5.3. Experimental Setup

Our experimental setup is meticulously designed to optimize the performance of our models in both retrieval and response generation. Below are the key components of our setup:

To evaluate the effectiveness of our proposed method, we conducted experiments on the MMLU dataset. We first built an index of the MMLU training set, then used BEG-M3 to retrieve the top 10 candidate few-shots based on Milvus. Then, we used BEG-reranker-larger to pre-rank the candidate few-shots and obtain the top five candidate few-shots. Finally, we used these five candidate few-shots to generate 5-shot results as the baseline.

Matching: We used BEG-M3 to retrieve the top 10 candidate few-shots based on Milvus. BEG-M3 is a dense vector similarity search engine that can efficiently retrieve similar vectors from a large collection of vectors. Milvus is a vector database that stores and manages vectors.

Pre-Ranking: We used BEG-reranker-larger to pre-sort the candidate few-shots. BEG-reranker-larger that improves the ranking of candidate few-shots.

Generation: We used the top five candidate few-shots to generate 5-shot results. We used Llama-2-13B-chat as a generation model to generate text from the few-shot prompts. Llama-2 (touvron2023llama, ) leverages a sophisticated transformer architecture, enabling advanced capabilities in natural language understanding and generation. It excels in its depth of contextual comprehension, allowing it to generate responses that are coherent and contextually aligned across various NLP tasks. Extensive benchmarking has demonstrated Llama-2’s ability to handle complex queries with nuanced understanding and high precision.

5.3.1. Evaluation Procedure

The performance of our models is evaluated using MMLU test set. We compute several metrics to assess the models’ effectiveness in generating accurate responses:

•

Precision: The ratio of true positive predictions to the total number of positive predictions made by the model:

$P=\frac{TP}{TP+FP}$
•

Recall: The ratio of true positive predictions to the total number of actual positive instances in the dataset:

$R=\frac{TP}{TP+FN}$
•

F1-score: The harmonic mean of precision and recall, providing a balanced measure of the model’s performance:

$F1=2\cdot\frac{P\cdot R}{P+R}$

These metrics offer a comprehensive evaluation of the model’s ability to generate accurate and effective end-to-end results.

Through this meticulously designed experimental setup, we aim to provide a thorough evaluation of the effectiveness and robustness of our Pistis-RAG framework in handling diverse language understanding and generation tasks.

5.3.2. Experimental Results

We evaluated the effectiveness of different components within the Pistis-RAG framework through an ablation study. This approach systematically removes or modifies key components to assess their impact on performance metrics, focusing on the MMLU dataset.

5.3.3. Analysis and Discussion

The ablation study results summarized in Table 2 offer valuable insights:

•

Importance of Feedback Label Integration: Excluding feedback labels led to a significant F1-score drop, underlining the crucial role of user feedback in model improvement.
•

Multi-Path Reasoning Impact: Removing multi-path reasoning resulted in a lower F1-score, demonstrating its importance in strengthening the model’s analytical capabilities.

In conclusion, the ablation study confirms that these components play a critical role in the Pistis-RAG framework’s performance.

6. Conclusions

This study highlights cascade modeling and optimization as critical areas for robust large-scale online AI-generated content (AIGC) systems.

We revisited the Retrieval-Augmented Generation problem through a content-centric lens, uncovering potential shortcomings in large-scale deployments. Firstly, a mismatch exists between the intended and actual function of the "re-ranker" component, suggesting it acts more like a pre-ranker. We addressed this by proposing distinct ranking and re-ranking stages within RAG and a listwide ranking approach. Furthermore, we introduced reasoning and aggregation stages specific to AIGC services, paving the way for a more comprehensive and high-performing system.

Our work emphasizes the importance of effective ranking and aggregation mechanisms for enhancing AIGC system accuracy and reliability. The engineering techniques discussed here are crucial for ensuring robustness and resilience, enabling the system to handle various real-world challenges while maintaining high performance. However, a significant gap exists in publicly available methods for building practical, high-performance, and robust online AIGC systems. This highlights a compelling opportunity for further research and development in this field.

In conclusion, our findings advocate for continuous innovation in cascade modeling, ranking, aggregation, and engineering techniques to propel the capabilities of online AIGC systems forward.

References

[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] **ze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
[3] Harrison Chase. LangChain, October 2022.
[4] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
[5] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
[6] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
[7] Weitao Li, Junkai Li, Weizhi Ma, and Yang Liu. Citation-enhanced generation for llm-based chatbot. arXiv preprint arXiv:2402.16063, 2024.
[8] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
[9] Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137, 2021.
[10] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[11] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[12] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
[13] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
[14] Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. Augmentation-adapted retriever improves generalization of language models as generic plug-in. arXiv preprint arXiv:2305.17331, 2023.