Skip to main content

Showing 1–19 of 19 results for author: Boytsov, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.04487  [pdf, other

    cs.CL

    KazQAD: Kazakh Open-Domain Question Answering Dataset

    Authors: Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov, Ardak Shalkarbayuli, Pavel Braslavski

    Abstract: We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, an… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

    Comments: To appear in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

  2. arXiv:2402.17018  [pdf, other

    cs.LG cs.AI cs.CV

    A Curious Case of Remarkable Resilience to Gradient Attacks via Fully Convolutional and Differentiable Front End with a Skip Connection

    Authors: Leonid Boytsov, Ameya Joshi, Filipe Condessa

    Abstract: We tested front-end enhanced neural models where a frozen classifier was prepended by a differentiable and fully convolutional model with a skip connection. By training them using a small learning rate for about one epoch, we obtained models that retained the accuracy of the backbone classifier while being unusually resistant to gradient attacks including APGD and FAB-T attacks from the AutoAttack… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  3. arXiv:2301.02998  [pdf, other

    cs.IR cs.AI cs.CL

    InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers

    Authors: Leonid Boytsov, Preksha Patel, Vivek Sourabh, Riddhi Nisar, Sayani Kundu, Ramya Ramanathan, Eric Nyberg

    Abstract: We carried out a reproducibility study of InPars, which is a method for unsupervised training of neural rankers (Bonifacio et al., 2022). As a by-product, we developed InPars-light, which is a simple-yet-effective modification of InPars. Unlike InPars, InPars-light uses 7x-100x smaller ranking models and only a freely available language model BLOOM, which -- as we found out -- produced more accura… ▽ More

    Submitted 20 February, 2024; v1 submitted 8 January, 2023; originally announced January 2023.

  4. arXiv:2207.01262  [pdf, other

    cs.IR cs.CL

    Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding

    Authors: Leonid Boytsov, David Akinpelu, Tianyi Lin, Fangwei Gao, Yutian Zhao, Jeffrey Huang, Nipun Katyal, Eric Nyberg

    Abstract: We evaluated 20+ Transformer models for ranking of long documents (including recent LongP models trained with FlashAttention) and compared them with a simple FirstP baseline, which applies the same model to the truncated input (at most 512 tokens). We used MS MARCO Documents v1 as a primary training set and evaluated both the zero-shot transferred and fine-tuned models. On MS MARCO, TREC DLs, an… ▽ More

    Submitted 16 June, 2024; v1 submitted 4 July, 2022; originally announced July 2022.

  5. arXiv:2205.06154  [pdf, other

    cs.LG cs.CV

    Smooth-Reduce: Leveraging Patches for Improved Certified Robustness

    Authors: Ameya Joshi, Minh Pham, Minsu Cho, Leonid Boytsov, Filipe Condessa, J. Zico Kolter, Chinmay Hegde

    Abstract: Randomized smoothing (RS) has been shown to be a fast, scalable technique for certifying the robustness of deep neural network classifiers. However, methods based on RS require augmenting data with large amounts of noise, which leads to significant drops in accuracy. We propose a training-free, modified smoothing approach, Smooth-Reduce, that leverages patching and aggregation to provide improved… ▽ More

    Submitted 12 May, 2022; originally announced May 2022.

  6. The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer

    Authors: Pavel Efimov, Leonid Boytsov, Elena Arslanova, Pavel Braslavski

    Abstract: Large multilingual language models such as mBERT or XLM-R enable zero-shot cross-lingual transfer in various IR and NLP tasks. Cao et al. (2020) proposed a data- and compute-efficient method for cross-lingual adjustment of mBERT that uses a small parallel corpus to make embeddings of related words across languages similar to each other. They showed it to be effective in NLI for five European langu… ▽ More

    Submitted 31 October, 2023; v1 submitted 13 April, 2022; originally announced April 2022.

    Comments: Presented at ECIR 2023

  7. A Systematic Evaluation of Transfer Learning and Pseudo-labeling with BERT-based Ranking Models

    Authors: Iurii Mokrii, Leonid Boytsov, Pavel Braslavski

    Abstract: Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carry out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In co… ▽ More

    Submitted 21 November, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

    Journal ref: SIGIR 2021 (44th International ACM SIGIR Conference on Research and Development in Information Retrieval)

  8. arXiv:2102.06815  [pdf, ps, other

    cs.CL cs.IR

    Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency Benefits

    Authors: Leonid Boytsov, Zico Kolter

    Abstract: We study the utility of the lexical translation model (IBM Model 1) for English text retrieval, in particular, its neural variants that are trained end-to-end. We use the neural Model1 as an aggregator layer applied to context-free or contextualized query/document embeddings. This new approach to design a neural ranking system has benefits for effectiveness, efficiency, and interpretability. Speci… ▽ More

    Submitted 17 March, 2021; v1 submitted 12 February, 2021; originally announced February 2021.

    Journal ref: ECIR 2021 (The 43rd European Conference on Information Retrieval)

  9. arXiv:2012.08020  [pdf, ps, other

    cs.CL cs.IR

    Traditional IR rivals neural models on the MS MARCO Document Ranking Leaderboard

    Authors: Leonid Boytsov

    Abstract: This short document describes a traditional IR system that achieved MRR@100 equal to 0.298 on the MS MARCO Document Ranking leaderboard (on 2020-12-06). Although inferior to most BERT-based models, it outperformed several neural runs (as well as all non-neural ones), including two submissions that used a large pretrained Transformer model for re-ranking. We provide software and data to reproduce o… ▽ More

    Submitted 17 March, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

  10. arXiv:2010.14848  [pdf, other

    cs.IR

    Flexible retrieval with NMSLIB and FlexNeuART

    Authors: Leonid Boytsov, Eric Nyberg

    Abstract: Our objective is to introduce to the NLP community an existing k-NN search library NMSLIB, a new retrieval toolkit FlexNeuART, as well as their integration capabilities. NMSLIB, while being one the fastest k-NN search libraries, is quite generic and supports a variety of distance/similarity functions. Because the library relies on the distance-based structure-agnostic algorithms, it can be further… ▽ More

    Submitted 17 November, 2020; v1 submitted 28 October, 2020; originally announced October 2020.

    Journal ref: 2nd EMNLP Workshop for Natural Language Processing Open Source Software (NLP-OSS), 2020

  11. SberQuAD -- Russian Reading Comprehension Dataset: Description and Analysis

    Authors: Pavel Efimov, Andrey Chertok, Leonid Boytsov, Pavel Braslavski

    Abstract: SberQuAD -- a large scale analog of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community. We fill this gap by providing a description, a thorough analysis, and baseline experimental results.

    Submitted 2 May, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

  12. Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study

    Authors: Leonid Boytsov, Eric Nyberg

    Abstract: We focus on low-dimensional non-metric search, where tree-based approaches permit efficient and accurate retrieval while having short indexing time. These methods rely on space partitioning and require a pruning rule to avoid visiting unpromising parts. We consider two known data-driven approaches to extend these rules to non-metric spaces: TriGen and a piece-wise linear approximation of the pruni… ▽ More

    Submitted 8 October, 2019; originally announced October 2019.

  13. Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs

    Authors: Leonid Boytsov, Eric Nyberg

    Abstract: We demonstrate that a graph-based search algorithm-relying on the construction of an approximate neighborhood graph-can directly work with challenging non-metric and/or non-symmetric distances without resorting to metric-space map** and/or distance symmetrization, which, in turn, lead to substantial performance degradation. Although the straightforward metrization and symmetrization is usually i… ▽ More

    Submitted 8 October, 2019; originally announced October 2019.

  14. arXiv:1711.03066  [pdf, other

    cs.IR

    A Simple Derivation of the Heap's Law from the Generalized Zipf's Law

    Authors: Leonid Boytsov

    Abstract: I reproduce a rather simple formal derivation of the Heaps' law from the generalized Zipf's law, which I previously published in Russian.

    Submitted 8 November, 2017; originally announced November 2017.

  15. Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

    Authors: Leonid Boytsov, David Novak, Yury Malkov, Eric Nyberg

    Abstract: Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-forc… ▽ More

    Submitted 31 October, 2016; originally announced October 2016.

  16. arXiv:1508.05470  [pdf, ps, other

    cs.MS cs.IR

    Non-Metric Space Library Manual

    Authors: Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov, David Novak

    Abstract: This document covers a library for fast similarity (k-NN)search. It describes only search methods and distances (spaces). Details about building, installing, Python bindings can be found online:https://github.com/searchivarius/nmslib/tree/v1.8/. Even though the library contains a variety of exact metric-space access methods, our main focus is on more generic and approximate search methods, in part… ▽ More

    Submitted 6 June, 2019; v1 submitted 22 August, 2015; originally announced August 2015.

    Comments: Methodology paper

  17. arXiv:1506.03163  [pdf, other

    cs.LG cs.DB cs.DS

    Permutation Search Methods are Efficient, Yet Faster Search is Possible

    Authors: Bilegsaikhan Naidan, Leonid Boytsov, Eric Nyberg

    Abstract: We survey permutation-based methods for approximate k-nearest neighbor search. In these methods, every data point is represented by a ranked list of pivots sorted by the distance to this point. Such ranked lists are called permutations. The underpinning assumption is that, for both metric and non-metric spaces, the distance between permutations is a good proxy for the distance between original poi… ▽ More

    Submitted 31 October, 2016; v1 submitted 10 June, 2015; originally announced June 2015.

  18. arXiv:1401.6399  [pdf, other

    cs.IR cs.DB cs.PF

    SIMD Compression and the Intersection of Sorted Integers

    Authors: Daniel Lemire, Leonid Boytsov, Nathan Kurz

    Abstract: Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent proces… ▽ More

    Submitted 20 April, 2020; v1 submitted 24 January, 2014; originally announced January 2014.

    Journal ref: Software: Practice and Experience Volume 46, Issue 6, pages 723-749, June 2016

  19. arXiv:1209.2137  [pdf, other

    cs.IR cs.DB

    Decoding billions of integers per second through vectorization

    Authors: Daniel Lemire, Leonid Boytsov

    Abstract: In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar natu… ▽ More

    Submitted 30 January, 2021; v1 submitted 10 September, 2012; originally announced September 2012.

    Comments: For software, see https://github.com/lemire/FastPFor, For data, see http://boytsov.info/datasets/clueweb09gap/

    Journal ref: Software: Practice & Experience 45 (1), 2015