¹¹institutetext: National University of Kyiv-Mohyla Academy
¹¹email: {mykhailo.poliakov, n.shvay}@ukma.edu.ua

Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted Metadata

Mykhailo Poliakov 0009-0006-5263-762X Nadiya Shvai 0000-0001-8194-6196

Abstract

The retrieval-augmented generation (RAG) enables retrieval of relevant information from an external knowledge source and allows large language models (LLMs) to answer queries over previously unseen document collections. However, it was demonstrated that traditional RAG applications perform poorly in answering multi-hop questions, which require retrieving and reasoning over multiple elements of supporting evidence. We introduce a new method called Multi-Meta-RAG, which uses database filtering with LLM-extracted metadata to improve the RAG selection of the relevant documents from various sources, relevant to the question. While database filtering is specific to a set of questions from a particular domain and format, we found out that Multi-Meta-RAG greatly improves the results on the MultiHop-RAG benchmark. The code is available on GitHub.

Keywords:

large language models retrieval augmented generation multi hop question answering

1 Introduction

Large Language Models (LLMs) have shown remarkable language understanding and generation abilities [8, 11]. However, there are two main challenges: static knowledge [6] and generative hallucination [3]. Retrieval-augmented generation [4] is an established process for answering user questions over entire datasets. RAG also helps mitigate generative hallucination and provides LLM with a source of new information on which it was not trained [9]. Real-world RAG pipelines often need to retrieve evidence from multiple documents simultaneously, a procedure known as multi-hop querying. Nevertheless, existing RAG applications face challenges in answering multi-hop queries, requiring retrieval and reasoning over numerous pieces of evidence [10]. In this paper, we present Multi-Meta-RAG: an improved RAG using a database filtering approach with LLM-extracted metadata that significantly improves the results on the MultiHop-RAG benchmark.

2 Related works

MultiHop-RAG [10] is a novel benchmarking dataset focused on multi-hop queries, including a knowledge base, questions, ground-truth responses, and supporting evidence. The news articles were selected from September 26, 2023, to December 26, 2023, extending beyond the knowledge cutoff of ChatGPT¹¹1gpt-3.5-turbo-0613 and GPT-4²²2gpt4-0613. A trained language model extracted factual or opinion sentences from each news article. These factual sentences act as evidence for multi-hop queries. The selection method involves kee** articles with evidence that overlaps keywords with other articles, enabling the creation of multi-hop queries with answers drawn from numerous sources. Given the original evidence and its context, GPT-4 was used to rephrase the evidence, referred to as claims. Afterward, the bridge entity or topic is used to generate multi-hop queries.

For example, "Did Engadget report a discount on the 13.6-inch MacBook Air before The Verge reported a discount on Samsung Galaxy Buds 2?" is a typical query from the MultiHop-RAG dataset. Answering it requires evidence from Engadget and The Verge to formulate an answer. Also, it requires LLM to figure out the temporal ordering of events. MultiHop-RAG also has inference, comparison, and null (without correct answer) queries, in addition to the temporal query above.

Refer to caption — Figure 1: A naive RAG implementation for MultiHop-RAG queries. RAG selects chunks from articles not asked in the example query, which leads to LLM giving a wrong response.

In a typical RAG application, we use an external corpus that comprises multiple documents and serves as the knowledge base. Each document within this corpus is segmented into chunks. These chunks are then converted into vector representations using an embedding model and stored in a vector database. Given a user query, RAG typically retrieves the top-K chunks that best match the query. The retrieved chunks, combined with the query are submitted into an LLM to generate a final response.

For the MultiHop-RAG benchmark, scraped articles act as a knowledge base for the RAG application tested. The problem is that a naive RAG application fails to recognize that the query asks for information from specific sources. Top-K chunks such as RAG retrieves often contain information from sources other than those mentioned in the query. Retrieved chunks might even miss relevant sources, leading to a wrong response, as depicted in Figure 1.

3 Multi-Meta-RAG

3.1 Extraction of Relevant Query Metadata with the LLM

Table 1: Examples of extracted metadata filters using a few-shot prompt with corresponding queries. Correct usage of the $nin operator for the last query can be noted.

Query		Extracted Filter
Does the TechCrunch article report on new hiring at Starz, while the Engadget article discusses layoffs within the entire video game industry?		⬇ "source": { "$in": ["TechCrunch", "Engadget"] }
Did The Guardian’s report on December 12, 2023, contradict the Sporting News report regarding the performance and future outlook of Manchester United?		⬇ "published_at": { "$in": ["December 12, 2023"] }, "source": { "$in": ["The Guardian", "Sporting News"] }
Who is the individual facing a criminal trial on seven counts of fraud and conspiracy, previously likened to a financial icon but not by TechCrunch, and is accused by the prosecution of committing fraud for wealth, power, and influence?		⬇ "source": { "$nin": [ "TechCrunch" ] }

Each question in the MultiHop-RAG [10] benchmark follows a typical structure. Every query requests information from one or more sources of news. In addition, some temporal queries require news articles from a particular date. We can extract the query filter via helper LLM by constructing a few-shot prompt [1] with examples of extracted article sources and publishing dates as a filter. The prompt template is provided in Appendix A. We only run metadata extraction with ChatGPT³³3gpt-3.5-turbo-1106 because this additional RAG pipeline step must be quick and cheap. We found out that this step takes 0.7 seconds on average for one query.

Two query metadata filter fields are extracted: article source and publication date. The complete filter is a dictionary with two fields combined. Samples of extracted metadata filters can be found in Table 3.1. The primary filtering operator is $in, the only operator provided in the examples in a few-shot prompt template. The LLM also correctly chooses a tiny fraction of the $nin operator for some queries without an example. While LLM only used $in and $nin for article sources, the model sometimes chooses other operators like $lt or $gt for publication date for a fraction of temporal queries. Because the number of such queries is small, we decided to only use date filters with $in and $nin operators and a most frequent date format⁴⁴4strftime format code %B %-d, %Y for easier matching in the database. All queries have a source filter extracted, while the publishing date filter was extracted in 15.57% of queries, while 22.81% of queries of the MultiHop-RAG dataset are temporal.

3.2 Improved Chunk Selection using Metadata Filtering

The extracted metadata could be used to enhance an RAG application (Figure 2). We split the articles in the MultiHop-RAG [10] knowledge base into chunks, each containing 256 tokens using LLamaIndex [5] using a sentence splitter as in the original MultiHop-RAG implementation. We also picked a chunk overlap of 32, finding out that smaller chunk overlap leads to a better variety of unique chunks in the top-K selection than the original implementation, which used the LLamaIndex default of 200. We selected LangChain [2] Neo4j [7] vector store as a vector database as its index implementation recently⁵⁵5April 2024 started to support metadata filtering. We then convert the chunks using an embedding model and save the embeddings into a vector database with article metadata saved as node properties.

Likewise, in the retrieval stage, we transform a query using the same embedding model and retrieve the top-K most relevant chunks with the highest cosine similarity with the query embedding. We also filter the chunks with LLM-extracted metadata in the same stage. Similarly to MultiHop-RAG, we use a Reranker module (bge-reranker-large [13]) to examine the retrieval performance. After retrieving 20 corresponding chunks using the embedding model and metadata filter, we select the top-K chunks using the Reranker.

4 Results

4.1 Chunk Retrieval Experiment

We selected two best-performing embedding models from original MultiHop-RAG experiment for testing metadata filtering chunk retrieval performance, bge-large-en-v1.5 [13] and voyage-02 [12]. The retrieved list of chunks is compared with the ground truth evidence associated with each query, excluding the null queries, as they lack corresponding evidence. For evaluation, we assume the Top-K chunks are retrieved and use metrics such as Mean Average Precision at K (MAP@K), Mean Reciprocal Rank at K (MRR@K), and Hit Rate at K (Hit@K). MAP@K measures the average precision of the top-K retrieval across all queries. MRR@K calculates the average of the reciprocal ranks of the first relevant chunk within the top-K retrieved set for each query. Hit@K measures the proportion of evidence that appears in the top-K retrieved set. The experiment (Table 2) with RAG showed considerable improvement in both embeddings for all core metrics MRR@10, MAP@10, Hits@10, and Hits@4. Most notably, for voyage-02 Hits@4 enhanced by 18%. This is important for practical RAG systems, where the top-K retrieved should as as low as possible to account for context window limits and cost.

Table 2: Chunk retrieval experiment results. Top-10 chunks are selected with bge-reranker-large after Top-20 chunks are found via the similarity search and database metadata filtering. A chunk size of 256 and a chunk overlap of 32 is used.

Embedding	Baseline RAG [10]
Embedding	MRR@10	MAP@10	Hits@10	Hits@4
bge-large-en-v1.5 (reported)	0.563	0.4759	0.7183	0.6364
voyage-02 (reported)	0.586	0.4795	0.7467	0.6625
voyage-02 (repository sample)	0.6152	0.2718 ^a	0.7315	0.6683

Embedding	Multi-Meta-RAG (ours)
Embedding	MRR@10	MAP@10	Hits@10	Hits@4
bge-large-en-v1.5	0.6574	0.3293	0.8909	0.7672
voyage-02	0.6748	0.3388	0.9042	0.792

Note a

We found a difference between reported MAP@10 and evaluated MAP@10 on the baseline voyage-02 retrieval sample file provided in the MultiHop-RAG repository. We take the evaluated value of 0.2718 for comparisons between RAGs.

4.2 LLM Response Generation Experiment

Table 3: Generation accuracy of LLMs with Metadata Filtering RAG (Top-6 chunks with voyage-02)

LLM	Accuracy
LLM	Ground-truth [10]	Baseline RAG [10]	Multi-Meta-RAG (ours)
GPT4 (gpt-4-0613)	0.89	0.56	0.63 ^b ^c
PaLM (text-bison@001)	0.74	0.47	0.61 ^b

As with embeddings, we picked two best-achieving LLMs on ground-truth chunks based on MultiHop-RAG initial experiments, GPT-4 and Google PaLM. We achieved substantial improvement in accuracy (Table 3) for both models compared to baseline RAG implementation. Google PaLM accuracy improved by 26% from 0.47 to 0.61. Preliminary GPT-4 results also show a 12% increase from 0.56 to 0.63.

Note b

The accuracy measuring script was not provided in the MultHop-RAG repository. We were able to recreate our script, which shows similar results for ground-truth chunks for both models and on experiments with the sample voyage-02 retrieval list provided in the repository. We accept the answer as correct when the model answer appears in the gold answer or the gold answer appears in the model answer. Both answers are transformed to lowercase, striped from leading and trailing whitespace, and have their punctuation removed before comparison. This script is used for benchmarking Multi-Meta-RAG.

Note c

GPT-4 accuracy is a preliminary result based on 50 queries (2% of all queries). Results on the complete dataset are pending.

5 Conclusion

In this paper, we introduce Multi-Meta-RAG, a method of improving RAG for multi-hop queries using database filtering with LLM-extracted metadata. Multi-Meta-RAG considerably improves results in both chunk retrieval and LLM generation experiments, while also being quite straightforward and explainable. The proposed solution still has some limitations. Firstly, extracting metadata requires a set of queries from a particular domain and question format, as well as additional inference time. Secondly, it requires the manual creation of a prompt template that will extract the metadata from the query. Thirdly, while the improved results are encouraging, they still fall considerably below the results achieved by feeding LLM precise ground-truth facts. Future work may include testing more generic prompt templates for metadata extraction on a variety of multi-hop datasets from different domains.

References

[1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)
[2] Chase, H.: LangChain (Oct 2022), https://github.com/langchain-ai/langchain
[3] Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions (2023)
[4] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 9459–9474. Curran Associates, Inc. (2020)
[5] Liu, J.: LlamaIndex (Nov 2022). https://doi.org/10.5281/zenodo.1234, https://github.com/jerryjliu/llama{_}index
[6] Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Rozière, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., Grave, E., LeCun, Y., Scialom, T.: Augmented language models: a survey (2023)
[7] Neo4j, Inc.: Neo4j graph database, https://neo4j.com/product/neo4j-graph-database
[8] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P.F., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 27730–27744. Curran Associates, Inc. (2022)
[9] Shuster, K., Poff, S., Chen, M., Kiela, D., Weston, J.: Retrieval augmentation reduces hallucination in conversation. In: Moens, M., Huang, X., Specia, L., Yih, S.W. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021. pp. 3784–3803. Association for Computational Linguistics (2021). https://doi.org/10.18653/V1/2021.FINDINGS-EMNLP.320, https://doi.org/10.18653/v1/2021.findings-emnlp.320
[10] Tang, Y., Yang, Y.: Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries (2024)
[11] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: Llama: Open and efficient foundation language models (2023)
[12] Voyage AI: Voyage ai cutting-edge embedding and rerankers, https://www.voyageai.com
[13] Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., Nie, J.Y.: C-pack: Packaged resources to advance general chinese embedding (2024)

Appendix A Metadata Extraction Prompt Template

Given the question, extract the metadata to filter the database about article sources. Avoid stopwords. The sources can only be from the list: [’Yardbarker’, ’The Guardian’, ’Revyuh Media’, ’The Independent - Sports’, ’Wired’, ’Sport Grill’, ’Hacker News’, ’Iot Business News’, ’Insidesport’, ’Sporting News’, ’Seeking Alpha’, ’The Age’, ’CBSSports.com’, ’The Sydney Morning Herald’, ’FOX News - Health’, ’Science News For Students’, ’Polygon’, ’The Independent - Life and Style’, ’FOX News - Entertainment’, ’The Verge’, ’Business Line’, ’The New York Times’, ’The Roar | Sports Writers Blog’, ’Sportskeeda’, ’BBC News - Entertainment & Arts’, ’Business World’, ’BBC News - Technology’, ’Essentially Sports’, ’Mashable’, ’Advanced Science News’, ’TechCrunch’, ’Financial Times’, ’Music Business Worldwide’, ’The Independent - Travel’, ’FOX News - Lifestyle’, ’TalkSport’, ’Yahoo News’, ’Scitechdaily | Science Space And Technology News 2017’, ’Globes English | Israel Business Arena’, ’Wide World Of Sports’, ’Rivals’, ’Fortune’, ’Zee Business’, ’Business Today | Latest Stock Market And Economy News India’, ’Sky Sports’, ’Cnbc | World Business News Leader’, ’Eos: Earth And Space Science News’, ’Live Science: The Most Interesting Articles’, ’Engadget’] Examples to follow: Question: Who is the individual associated with the cryptocurrency industry facing a criminal trial on fraud and conspiracy charges, as reported by both The Verge and TechCrunch, and is accused by prosecutors of committing fraud for personal gain? Answer: {’source’: {’$in’: [’The Verge’, ’TechCrunch’]}} Question: After the TechCrunch report on October 7, 2023, concerning Dave Clark’s comments on Flexport, and the subsequent TechCrunch article on October 30, 2023, regarding Ryan Petersen’s actions at Flexport, was there a change in the nature of the events reported? Answer: {’source’: {’$in’: [’TechCrunch’]}, ’published_at’: ’$in’: {[’October 7, 2023’, ’October 30, 2023’]}} Question: Which company, known for its dominance in the e-reader space and for offering exclusive invite-only deals during sales events, faced a stock decline due to an antitrust lawsuit reported by ’The Sydney Morning Herald’ and discussed by sellers in a ’Cnbc | World Business News Leader’ article? Answer: {’source’: {’$in’: [’The Sydney Morning Herald’, ’Cnbc | World Business News Leader’]}} If you detect multiple queries, return the answer for the first. Now it is your turn: Question: <query> Answer: