Search | arXiv e-print repository

KazQAD: Kazakh Open-Domain Question Answering Dataset

Authors: Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov, Ardak Shalkarbayuli, Pavel Braslavski

Abstract: We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, an… ▽ More We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at https://github.com/IS2AI/KazQAD. △ Less

Submitted 5 April, 2024; originally announced April 2024.

Comments: To appear in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

arXiv:2402.17018 [pdf, other]

A Curious Case of Remarkable Resilience to Gradient Attacks via Fully Convolutional and Differentiable Front End with a Skip Connection

Authors: Leonid Boytsov, Ameya Joshi, Filipe Condessa

Abstract: We tested front-end enhanced neural models where a frozen classifier was prepended by a differentiable and fully convolutional model with a skip connection. By training them using a small learning rate for about one epoch, we obtained models that retained the accuracy of the backbone classifier while being unusually resistant to gradient attacks including APGD and FAB-T attacks from the AutoAttack… ▽ More We tested front-end enhanced neural models where a frozen classifier was prepended by a differentiable and fully convolutional model with a skip connection. By training them using a small learning rate for about one epoch, we obtained models that retained the accuracy of the backbone classifier while being unusually resistant to gradient attacks including APGD and FAB-T attacks from the AutoAttack package, which we attributed to gradient masking. The gradient masking phenomenon is not new, but the degree of masking was quite remarkable for fully differentiable models that did not have gradient-shattering components such as JPEG compression or components that are expected to cause diminishing gradients. Though black box attacks can be partially effective against gradient masking, they are easily defeated by combining models into randomized ensembles. We estimate that such ensembles achieve near-SOTA AutoAttack accuracy on CIFAR10, CIFAR100, and ImageNet despite having virtually zero accuracy under adaptive attacks. Adversarial training of the backbone classifier can further increase resistance of the front-end enhanced model to gradient attacks. On CIFAR10, the respective randomized ensemble achieved 90.8$\pm 2.5$% (99% CI) accuracy under AutoAttack while having only 18.2$\pm 3.6$% accuracy under the adaptive attack. We do not establish SOTA in adversarial robustness. Instead, we make methodological contributions and further supports the thesis that adaptive attacks designed with the complete knowledge of model architecture are crucial in demonstrating model robustness and that even the so-called white-box gradient attacks can have limited applicability. Although gradient attacks can be complemented with black-box attack such as the SQUARE attack or the zero-order PGD, black-box attacks can be weak against randomized ensembles, e.g., when ensemble models mask gradients. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2301.02998 [pdf, other]

InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers

Authors: Leonid Boytsov, Preksha Patel, Vivek Sourabh, Riddhi Nisar, Sayani Kundu, Ramya Ramanathan, Eric Nyberg

Abstract: We carried out a reproducibility study of InPars, which is a method for unsupervised training of neural rankers (Bonifacio et al., 2022). As a by-product, we developed InPars-light, which is a simple-yet-effective modification of InPars. Unlike InPars, InPars-light uses 7x-100x smaller ranking models and only a freely available language model BLOOM, which -- as we found out -- produced more accura… ▽ More We carried out a reproducibility study of InPars, which is a method for unsupervised training of neural rankers (Bonifacio et al., 2022). As a by-product, we developed InPars-light, which is a simple-yet-effective modification of InPars. Unlike InPars, InPars-light uses 7x-100x smaller ranking models and only a freely available language model BLOOM, which -- as we found out -- produced more accurate rankers compared to a proprietary GPT-3 model. On all five English retrieval collections (used in the original InPars study) we obtained substantial (7%-30%) and statistically significant improvements over BM25 (in nDCG and MRR) using only a 30M parameter six-layer MiniLM-30M ranker and a single three-shot prompt. In contrast, in the InPars study only a 100x larger monoT5-3B model consistently outperformed BM25, whereas their smaller monoT5-220M model (which is still 7x larger than our MiniLM ranker) outperformed BM25 only on MS MARCO and TREC DL 2020. In the same three-shot prompting scenario, our 435M parameter DeBERTA v3 ranker was at par with the 7x larger monoT5-3B (average gain over BM25 of 1.3 vs 1.32): In fact, on three out of five datasets, DeBERTA slightly outperformed monoT5-3B. Finally, these good results were achieved by re-ranking only 100 candidate documents compared to 1000 used by Bonifacio et al. (2022). We believe that InPars-light is the first truly cost-effective prompt-based unsupervised recipe to train and deploy neural ranking models that outperform BM25. Our code and data is publicly available. https://github.com/searchivarius/inpars_light/ △ Less

Submitted 20 February, 2024; v1 submitted 8 January, 2023; originally announced January 2023.

arXiv:2207.01262 [pdf, other]

Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding

Authors: Leonid Boytsov, David Akinpelu, Tianyi Lin, Fangwei Gao, Yutian Zhao, Jeffrey Huang, Nipun Katyal, Eric Nyberg

Abstract: We evaluated 20+ Transformer models for ranking of long documents (including recent LongP models trained with FlashAttention) and compared them with a simple FirstP baseline, which applies the same model to the truncated input (at most 512 tokens). We used MS MARCO Documents v1 as a primary training set and evaluated both the zero-shot transferred and fine-tuned models. On MS MARCO, TREC DLs, an… ▽ More We evaluated 20+ Transformer models for ranking of long documents (including recent LongP models trained with FlashAttention) and compared them with a simple FirstP baseline, which applies the same model to the truncated input (at most 512 tokens). We used MS MARCO Documents v1 as a primary training set and evaluated both the zero-shot transferred and fine-tuned models. On MS MARCO, TREC DLs, and Robust04 no long-document model outperformed FirstP by more than 5% in NDCG and MRR (when averaged over all test sets). We conjectured this was not due to models' inability to process long context, but due to a positional bias of relevant passages, whose distribution was skewed towards the beginning of documents. We found direct evidence of this bias in some test sets, which motivated us to create MS MARCO FarRelevant (based on MS MARCO Passages) where the relevant passages were not present among the first 512 tokens. Unlike standard collections where we saw both little benefit from incorporating longer contexts and limited variability in model performance (within a few %), experiments on MS MARCO FarRelevant uncovered dramatic differences among models. The FirstP models performed roughly at the random-baseline level in both zero-shot and fine-tuning scenarios. Simple aggregation models including MaxP and PARADE Attention had good zero-shot accuracy, but benefited little from fine-tuning. Most other models had poor zero-shot performance (sometimes at a random baseline level), but outstripped MaxP by as much as 13-28% after fine-tuning. Thus, the positional bias not only diminishes benefits of processing longer document contexts, but also leads to model overfitting to positional bias and performing poorly in a zero-shot setting when the distribution of relevant passages changes substantially. We make our software and data available. △ Less

Submitted 16 June, 2024; v1 submitted 4 July, 2022; originally announced July 2022.

arXiv:2205.06154 [pdf, other]

Smooth-Reduce: Leveraging Patches for Improved Certified Robustness

Authors: Ameya Joshi, Minh Pham, Minsu Cho, Leonid Boytsov, Filipe Condessa, J. Zico Kolter, Chinmay Hegde

Abstract: Randomized smoothing (RS) has been shown to be a fast, scalable technique for certifying the robustness of deep neural network classifiers. However, methods based on RS require augmenting data with large amounts of noise, which leads to significant drops in accuracy. We propose a training-free, modified smoothing approach, Smooth-Reduce, that leverages patching and aggregation to provide improved… ▽ More Randomized smoothing (RS) has been shown to be a fast, scalable technique for certifying the robustness of deep neural network classifiers. However, methods based on RS require augmenting data with large amounts of noise, which leads to significant drops in accuracy. We propose a training-free, modified smoothing approach, Smooth-Reduce, that leverages patching and aggregation to provide improved classifier certificates. Our algorithm classifies overlap** patches extracted from an input image, and aggregates the predicted logits to certify a larger radius around the input. We study two aggregation schemes -- max and mean -- and show that both approaches provide better certificates in terms of certified accuracy, average certified radii and abstention rates as compared to concurrent approaches. We also provide theoretical guarantees for such certificates, and empirically show significant improvements over other randomized smoothing methods that require expensive retraining. Further, we extend our approach to videos and provide meaningful certificates for video classifiers. A project page can be found at https://nyu-dice-lab.github.io/SmoothReduce/ △ Less

Submitted 12 May, 2022; originally announced May 2022.

arXiv:2204.06457 [pdf, other]

doi 10.1007/978-3-031-28241-6_4

The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer

Authors: Pavel Efimov, Leonid Boytsov, Elena Arslanova, Pavel Braslavski

Abstract: Large multilingual language models such as mBERT or XLM-R enable zero-shot cross-lingual transfer in various IR and NLP tasks. Cao et al. (2020) proposed a data- and compute-efficient method for cross-lingual adjustment of mBERT that uses a small parallel corpus to make embeddings of related words across languages similar to each other. They showed it to be effective in NLI for five European langu… ▽ More Large multilingual language models such as mBERT or XLM-R enable zero-shot cross-lingual transfer in various IR and NLP tasks. Cao et al. (2020) proposed a data- and compute-efficient method for cross-lingual adjustment of mBERT that uses a small parallel corpus to make embeddings of related words across languages similar to each other. They showed it to be effective in NLI for five European languages. In contrast we experiment with a typologically diverse set of languages (Spanish, Russian, Vietnamese, and Hindi) and extend their original implementations to new tasks (XSR, NER, and QA) and an additional training regime (continual learning). Our study reproduced gains in NLI for four languages, showed improved NER, XSR, and cross-lingual QA results in three languages (though some cross-lingual QA gains were not statistically significant), while mono-lingual QA performance never improved and sometimes degraded. Analysis of distances between contextualized embeddings of related and unrelated words (across languages) showed that fine-tuning leads to "forgetting" some of the cross-lingual alignment information. Based on this observation, we further improved NLI performance using continual learning. △ Less

Submitted 31 October, 2023; v1 submitted 13 April, 2022; originally announced April 2022.

Comments: Presented at ECIR 2023

arXiv:2103.03335 [pdf, other]

doi 10.1145/3404835.3463093

A Systematic Evaluation of Transfer Learning and Pseudo-labeling with BERT-based Ranking Models

Authors: Iurii Mokrii, Leonid Boytsov, Pavel Braslavski

Abstract: Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carry out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In co… ▽ More Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carry out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In contrast, each of our collections has a substantial number of queries, which enables a full-shot evaluation mode and improves reliability of our results. Furthermore, since source datasets licences often prohibit commercial use, we compare transfer learning to training on pseudo-labels generated by a BM25 scorer. We find that training on pseudo-labels -- possibly with subsequent fine-tuning using a modest number of annotated queries -- can produce a competitive or better model compared to transfer learning. Yet, it is necessary to improve the stability and/or effectiveness of the few-shot training, which, sometimes, can degrade performance of a pretrained model. △ Less

Submitted 21 November, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

Journal ref: SIGIR 2021 (44th International ACM SIGIR Conference on Research and Development in Information Retrieval)

arXiv:2102.06815 [pdf, ps, other]

Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency Benefits

Authors: Leonid Boytsov, Zico Kolter

Abstract: We study the utility of the lexical translation model (IBM Model 1) for English text retrieval, in particular, its neural variants that are trained end-to-end. We use the neural Model1 as an aggregator layer applied to context-free or contextualized query/document embeddings. This new approach to design a neural ranking system has benefits for effectiveness, efficiency, and interpretability. Speci… ▽ More We study the utility of the lexical translation model (IBM Model 1) for English text retrieval, in particular, its neural variants that are trained end-to-end. We use the neural Model1 as an aggregator layer applied to context-free or contextualized query/document embeddings. This new approach to design a neural ranking system has benefits for effectiveness, efficiency, and interpretability. Specifically, we show that adding an interpretable neural Model 1 layer on top of BERT-based contextualized embeddings (1) does not decrease accuracy and/or efficiency; and (2) may overcome the limitation on the maximum sequence length of existing BERT models. The context-free neural Model 1 is less effective than a BERT-based ranking model, but it can run efficiently on a CPU (without expensive index-time precomputation or query-time operations on large tensors). Using Model 1 we produced best neural and non-neural runs on the MS MARCO document ranking leaderboard in late 2020. △ Less

Submitted 17 March, 2021; v1 submitted 12 February, 2021; originally announced February 2021.

Journal ref: ECIR 2021 (The 43rd European Conference on Information Retrieval)

arXiv:2012.08020 [pdf, ps, other]

Traditional IR rivals neural models on the MS MARCO Document Ranking Leaderboard

Authors: Leonid Boytsov

Abstract: This short document describes a traditional IR system that achieved MRR@100 equal to 0.298 on the MS MARCO Document Ranking leaderboard (on 2020-12-06). Although inferior to most BERT-based models, it outperformed several neural runs (as well as all non-neural ones), including two submissions that used a large pretrained Transformer model for re-ranking. We provide software and data to reproduce o… ▽ More This short document describes a traditional IR system that achieved MRR@100 equal to 0.298 on the MS MARCO Document Ranking leaderboard (on 2020-12-06). Although inferior to most BERT-based models, it outperformed several neural runs (as well as all non-neural ones), including two submissions that used a large pretrained Transformer model for re-ranking. We provide software and data to reproduce our results. △ Less

Submitted 17 March, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

arXiv:2010.14848 [pdf, other]

Flexible retrieval with NMSLIB and FlexNeuART

Authors: Leonid Boytsov, Eric Nyberg

Abstract: Our objective is to introduce to the NLP community an existing k-NN search library NMSLIB, a new retrieval toolkit FlexNeuART, as well as their integration capabilities. NMSLIB, while being one the fastest k-NN search libraries, is quite generic and supports a variety of distance/similarity functions. Because the library relies on the distance-based structure-agnostic algorithms, it can be further… ▽ More Our objective is to introduce to the NLP community an existing k-NN search library NMSLIB, a new retrieval toolkit FlexNeuART, as well as their integration capabilities. NMSLIB, while being one the fastest k-NN search libraries, is quite generic and supports a variety of distance/similarity functions. Because the library relies on the distance-based structure-agnostic algorithms, it can be further extended by adding new distances. FlexNeuART is a modular, extendible and flexible toolkit for candidate generation in IR and QA applications, which supports mixing of classic and neural ranking signals. FlexNeuART can efficiently retrieve mixed dense and sparse representations (with weights learned from training data), which is achieved by extending NMSLIB. In that, other retrieval systems work with purely sparse representations (e.g., Lucene), purely dense representations (e.g., FAISS and Annoy), or only perform mixing at the re-ranking stage. △ Less

Submitted 17 November, 2020; v1 submitted 28 October, 2020; originally announced October 2020.

Journal ref: 2nd EMNLP Workshop for Natural Language Processing Open Source Software (NLP-OSS), 2020

arXiv:1912.09723 [pdf, other]

doi 10.1007/978-3-030-58219-7_1

SberQuAD -- Russian Reading Comprehension Dataset: Description and Analysis

Authors: Pavel Efimov, Andrey Chertok, Leonid Boytsov, Pavel Braslavski

Abstract: SberQuAD -- a large scale analog of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community. We fill this gap by providing a description, a thorough analysis, and baseline experimental results. SberQuAD -- a large scale analog of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community. We fill this gap by providing a description, a thorough analysis, and baseline experimental results. △ Less

Submitted 2 May, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

arXiv:1910.03539 [pdf, other]

doi 10.1007/978-3-030-32047-8_7

Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study

Authors: Leonid Boytsov, Eric Nyberg

Abstract: We focus on low-dimensional non-metric search, where tree-based approaches permit efficient and accurate retrieval while having short indexing time. These methods rely on space partitioning and require a pruning rule to avoid visiting unpromising parts. We consider two known data-driven approaches to extend these rules to non-metric spaces: TriGen and a piece-wise linear approximation of the pruni… ▽ More We focus on low-dimensional non-metric search, where tree-based approaches permit efficient and accurate retrieval while having short indexing time. These methods rely on space partitioning and require a pruning rule to avoid visiting unpromising parts. We consider two known data-driven approaches to extend these rules to non-metric spaces: TriGen and a piece-wise linear approximation of the pruning rule. We propose and evaluate two adaptations of TriGen to non-symmetric similarities (TriGen does not support non-symmetric distances). We also evaluate a hybrid of TriGen and the piece-wise linear approximation pruning. We find that this hybrid approach is often more effective than either of the pruning rules. We make our software publicly available. △ Less

Submitted 8 October, 2019; originally announced October 2019.

arXiv:1910.03534 [pdf, other]

doi 10.1007/978-3-030-32047-8_12

Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs

Authors: Leonid Boytsov, Eric Nyberg

Abstract: We demonstrate that a graph-based search algorithm-relying on the construction of an approximate neighborhood graph-can directly work with challenging non-metric and/or non-symmetric distances without resorting to metric-space map** and/or distance symmetrization, which, in turn, lead to substantial performance degradation. Although the straightforward metrization and symmetrization is usually i… ▽ More We demonstrate that a graph-based search algorithm-relying on the construction of an approximate neighborhood graph-can directly work with challenging non-metric and/or non-symmetric distances without resorting to metric-space map** and/or distance symmetrization, which, in turn, lead to substantial performance degradation. Although the straightforward metrization and symmetrization is usually ineffective, we find that constructing an index using a modified, e.g., symmetrized, distance can improve performance. This observation paves a way to a new line of research of designing index-specific graph-construction distance functions. △ Less

Submitted 8 October, 2019; originally announced October 2019.

arXiv:1711.03066 [pdf, other]

A Simple Derivation of the Heap's Law from the Generalized Zipf's Law

Authors: Leonid Boytsov

Abstract: I reproduce a rather simple formal derivation of the Heaps' law from the generalized Zipf's law, which I previously published in Russian. I reproduce a rather simple formal derivation of the Heaps' law from the generalized Zipf's law, which I previously published in Russian. △ Less

Submitted 8 November, 2017; originally announced November 2017.

arXiv:1610.10001 [pdf, other]

doi 10.1145/2983323.2983815

Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

Authors: Leonid Boytsov, David Novak, Yury Malkov, Eric Nyberg

Abstract: Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-forc… ▽ More Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online. △ Less

Submitted 31 October, 2016; originally announced October 2016.

arXiv:1508.05470 [pdf, ps, other]

Non-Metric Space Library Manual

Authors: Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov, David Novak

Abstract: This document covers a library for fast similarity (k-NN)search. It describes only search methods and distances (spaces). Details about building, installing, Python bindings can be found online:https://github.com/searchivarius/nmslib/tree/v1.8/. Even though the library contains a variety of exact metric-space access methods, our main focus is on more generic and approximate search methods, in part… ▽ More This document covers a library for fast similarity (k-NN)search. It describes only search methods and distances (spaces). Details about building, installing, Python bindings can be found online:https://github.com/searchivarius/nmslib/tree/v1.8/. Even though the library contains a variety of exact metric-space access methods, our main focus is on more generic and approximate search methods, in particular, on methods for non-metric spaces. NMSLIB is possibly the first library with a principled support for non-metric space searching. △ Less

Submitted 6 June, 2019; v1 submitted 22 August, 2015; originally announced August 2015.

Comments: Methodology paper

arXiv:1506.03163 [pdf, other]

Permutation Search Methods are Efficient, Yet Faster Search is Possible

Authors: Bilegsaikhan Naidan, Leonid Boytsov, Eric Nyberg

Abstract: We survey permutation-based methods for approximate k-nearest neighbor search. In these methods, every data point is represented by a ranked list of pivots sorted by the distance to this point. Such ranked lists are called permutations. The underpinning assumption is that, for both metric and non-metric spaces, the distance between permutations is a good proxy for the distance between original poi… ▽ More We survey permutation-based methods for approximate k-nearest neighbor search. In these methods, every data point is represented by a ranked list of pivots sorted by the distance to this point. Such ranked lists are called permutations. The underpinning assumption is that, for both metric and non-metric spaces, the distance between permutations is a good proxy for the distance between original points. Thus, it should be possible to efficiently retrieve most true nearest neighbors by examining only a tiny subset of data points whose permutations are similar to the permutation of a query. We further test this assumption by carrying out an extensive experimental evaluation where permutation methods are pitted against state-of-the art benchmarks (the multi-probe LSH, the VP-tree, and proximity-graph based retrieval) on a variety of realistically large data set from the image and textual domain. The focus is on the high-accuracy retrieval methods for generic spaces. Additionally, we assume that both data and indices are stored in main memory. We find permutation methods to be reasonably efficient and describe a setup where these methods are most useful. To ease reproducibility, we make our software and data sets publicly available. △ Less

Submitted 31 October, 2016; v1 submitted 10 June, 2015; originally announced June 2015.

arXiv:1401.6399 [pdf, other]

doi 10.1002/spe.2326

SIMD Compression and the Intersection of Sorted Integers

Authors: Daniel Lemire, Leonid Boytsov, Nathan Kurz

Abstract: Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent proces… ▽ More Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decoding speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD Gallo** algorithm. We exploit the fact that one SIMD instruction can compare 4 pairs of integers at once. We experiment with two TREC text collections, GOV2 and ClueWeb09 (Category B), using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state-of-the-art approach. △ Less

Submitted 20 April, 2020; v1 submitted 24 January, 2014; originally announced January 2014.

Journal ref: Software: Practice and Experience Volume 46, Issue 6, pages 723-749, June 2016

arXiv:1209.2137 [pdf, other]

doi 10.1002/spe.2203

Decoding billions of integers per second through vectorization

Authors: Daniel Lemire, Leonid Boytsov

Abstract: In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar natu… ▽ More In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding. △ Less

Submitted 30 January, 2021; v1 submitted 10 September, 2012; originally announced September 2012.

Comments: For software, see https://github.com/lemire/FastPFor, For data, see http://boytsov.info/datasets/clueweb09gap/

Journal ref: Software: Practice & Experience 45 (1), 2015

Showing 1–19 of 19 results for author: Boytsov, L