Search | arXiv e-print repository

DeeperImpact: Optimizing Sparse Learned Index Structures

Authors: Soyuj Basnet, Jerry Gou, Antonio Mallia, Torsten Suel

Abstract: A lot of recent work has focused on sparse learned indexes that use deep neural architectures to significantly improve retrieval quality while kee** the efficiency benefits of the inverted index. While such sparse learned structures achieve effectiveness far beyond those of traditional inverted index-based rankers, there is still a gap in effectiveness to the best dense retrievers, or even to sp… ▽ More A lot of recent work has focused on sparse learned indexes that use deep neural architectures to significantly improve retrieval quality while kee** the efficiency benefits of the inverted index. While such sparse learned structures achieve effectiveness far beyond those of traditional inverted index-based rankers, there is still a gap in effectiveness to the best dense retrievers, or even to sparse methods that leverage more expensive optimizations such as query expansion and query term weighting. We focus on narrowing this gap by revisiting and optimizing DeepImpact, a sparse retrieval approach that uses DocT5Query for document expansion followed by a BERT language model to learn impact scores for document terms. We first reinvestigate the expansion process and find that the recently proposed Doc2Query -- query filtration does not enhance retrieval quality when used with DeepImpact. Instead, substituting T5 with a fine-tuned Llama 2 model for query prediction results in a considerable improvement. Subsequently, we study training strategies that have proven effective for other models, in particular the use of hard negatives, distillation, and pre-trained CoCondenser model initialization. Our results substantially narrow the effectiveness gap with the most effective versions of SPLADE. △ Less

Submitted 6 July, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.01117 [pdf, other]

Faster Learned Sparse Retrieval with Block-Max Pruning

Authors: Antonio Mallia, Torten Suel, Nicola Tonellotto

Abstract: Learned sparse retrieval systems aim to combine the effectiveness of contextualized language models with the scalability of conventional data structures such as inverted indexes. Nevertheless, the indexes generated by these systems exhibit significant deviations from the ones that use traditional retrieval models, leading to a discrepancy in the performance of existing query optimizations that wer… ▽ More Learned sparse retrieval systems aim to combine the effectiveness of contextualized language models with the scalability of conventional data structures such as inverted indexes. Nevertheless, the indexes generated by these systems exhibit significant deviations from the ones that use traditional retrieval models, leading to a discrepancy in the performance of existing query optimizations that were specifically developed for traditional structures. These disparities arise from structural variations in query and document statistics, including sub-word tokenization, leading to longer queries, smaller vocabularies, and different score distributions within posting lists. This paper introduces Block-Max Pruning (BMP), an innovative dynamic pruning strategy tailored for indexes arising in learned sparse retrieval environments. BMP employs a block filtering mechanism to divide the document space into small, consecutive document ranges, which are then aggregated and sorted on the fly, and fully processed only as necessary, guided by a defined safe early termination criterion or based on approximate retrieval requirements. Through rigorous experimentation, we show that BMP substantially outperforms existing dynamic pruning strategies, offering unparalleled efficiency in safe retrieval contexts and improved tradeoffs between precision and efficiency in approximate retrieval tasks. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: SIGIR 2024 (short paper track)

arXiv:2401.06703 [pdf, other]

Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies

Authors: Puxuan Yu, Antonio Mallia, Matthias Petri

Abstract: We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to… ▽ More We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems. △ Less

Submitted 12 January, 2024; originally announced January 2024.

Comments: ECIR 2024 Full Paper

arXiv:2204.11314 [pdf, other]

Faster Learned Sparse Retrieval with Guided Traversal

Authors: Antonio Mallia, Joel Mackenzie, Torsten Suel, Nicola Tonellotto

Abstract: Neural information retrieval architectures based on transformers such as BERT are able to significantly improve system effectiveness over traditional sparse models such as BM25. Though highly effective, these neural approaches are very expensive to run, making them difficult to deploy under strict latency constraints. To address this limitation, recent studies have proposed new families of learned… ▽ More Neural information retrieval architectures based on transformers such as BERT are able to significantly improve system effectiveness over traditional sparse models such as BM25. Though highly effective, these neural approaches are very expensive to run, making them difficult to deploy under strict latency constraints. To address this limitation, recent studies have proposed new families of learned sparse models that try to match the effectiveness of learned dense models, while leveraging the traditional inverted index data structure for efficiency. Current learned sparse models learn the weights of terms in documents and, sometimes, queries; however, they exploit different vocabulary structures, document expansion techniques, and query expansion strategies, which can make them slower than traditional sparse models such as BM25. In this work, we propose a novel indexing and query processing technique that exploits a traditional sparse model's "guidance" to efficiently traverse the index, allowing the more effective learned model to execute fewer scoring operations. Our experiments show that our guided processing heuristic is able to boost the efficiency of the underlying learned sparse model by a factor of four without any measurable loss of effectiveness. △ Less

Submitted 24 April, 2022; originally announced April 2022.

Comments: Accepted at SIGIR 2022

arXiv:2104.12016 [pdf, other]

Learning Passage Impacts for Inverted Indexes

Authors: Antonio Mallia, Omar Khattab, Nicola Tonellotto, Torsten Suel

Abstract: Neural information retrieval systems typically use a cascading pipeline, in which a first-stage model retrieves a candidate set of documents and one or more subsequent stages re-rank this set using contextualized language models such as BERT. In this paper, we propose DeepImpact, a new document term-weighting scheme suitable for efficient retrieval using a standard inverted index. Compared to exis… ▽ More Neural information retrieval systems typically use a cascading pipeline, in which a first-stage model retrieves a candidate set of documents and one or more subsequent stages re-rank this set using contextualized language models such as BERT. In this paper, we propose DeepImpact, a new document term-weighting scheme suitable for efficient retrieval using a standard inverted index. Compared to existing methods, DeepImpact improves impact-score modeling and tackles the vocabulary-mismatch problem. In particular, DeepImpact leverages DocT5Query to enrich the document collection and, using a contextualized language model, directly estimates the semantic importance of tokens in a document, producing a single-value representation for each token in each document. Our experiments show that DeepImpact significantly outperforms prior first-stage retrieval approaches by up to 17% on effectiveness metrics w.r.t. DocT5Query, and, when deployed in a re-ranking scenario, can reach the same effectiveness of state-of-the-art approaches with up to 5.1x speedup in efficiency. △ Less

Submitted 24 April, 2021; originally announced April 2021.

arXiv:2003.08276 [pdf, other]

Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

Authors: Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michał Siedlaczek, Andrew Trotman, Arjen de Vries

Abstract: There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers tha… ▽ More There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASSv2, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward. △ Less

Submitted 18 March, 2020; originally announced March 2020.

arXiv:1808.02831 [pdf, other]

Debunking Fake News One Feature at a Time

Authors: Melanie Tosik, Antonio Mallia, Kedar Gangopadhyay

Abstract: Identifying the stance of a news article body with respect to a certain headline is the first step to automated fake news detection. In this paper, we introduce a 2-stage ensemble model to solve the stance detection task. By using only hand-crafted features as input to a gradient boosting classifier, we are able to achieve a score of 9161.5 out of 11651.25 (78.63%) on the official Fake News Challe… ▽ More Identifying the stance of a news article body with respect to a certain headline is the first step to automated fake news detection. In this paper, we introduce a 2-stage ensemble model to solve the stance detection task. By using only hand-crafted features as input to a gradient boosting classifier, we are able to achieve a score of 9161.5 out of 11651.25 (78.63%) on the official Fake News Challenge (Stage 1) dataset. We identify the most useful features for detecting fake news and discuss how sampling techniques can be used to improve recall accuracy on a highly imbalanced dataset. △ Less

Submitted 8 August, 2018; originally announced August 2018.

Showing 1–7 of 7 results for author: Mallia, A