Skip to main content

Showing 1–19 of 19 results for author: Ferragina, P

.
  1. arXiv:2407.00734   

    cs.DS

    Balanced Learned Sort: a new learned model for fast and balanced item bucketing

    Authors: Paolo Ferragina, Mattia Odorisio

    Abstract: This paper aims to better understand the strengths and limitations of adopting learned-based approaches in sequential sorting numerical data, via two main research steps. First, we study different learned models for distribution-based sorting, starting from some known ones (i.e., two-layer RMI or simple linear models) and then introducing some novel models that either improve the two-layer RMI o… ▽ More

    Submitted 2 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

    Comments: We need to make the experiments more robust

  2. arXiv:2311.15380  [pdf, other

    cs.DS cs.DB

    Grafite: Taming Adversarial Queries with Optimal Range Filters

    Authors: Marco Costa, Paolo Ferragina, Giorgio Vinciguerra

    Abstract: Range filters allow checking whether a query range intersects a given set of keys with a chance of returning a false positive answer, thus generalising the functionality of Bloom filters from point to range queries. Existing practical range filters have addressed this problem heuristically, resulting in high false positive rates and query times when dealing with adversarial inputs, such as in the… ▽ More

    Submitted 19 March, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

    Comments: Accepted for publication in Proceedings of the ACM on Management of Data (SIGMOD 2024)

    Journal ref: Proceedings of the ACM on Management of Data, Volume 2, Issue 1 (2024), Article No. 3, pp 1-23

  3. arXiv:2310.18419  [pdf, ps, other

    cs.IT cs.DS physics.data-an

    On nonlinear compression costs: when Shannon meets Rényi

    Authors: Andrea Somazzi, Paolo Ferragina, Diego Garlaschelli

    Abstract: Shannon entropy is the shortest average codeword length a lossless compressor can achieve by encoding i.i.d. symbols. However, there are cases in which the objective is to minimize the \textit{exponential} average codeword length, i.e. when the cost of encoding/decoding scales exponentially with the length of codewords. The optimum is reached by all strategies that map each symbol $x_i$ generated… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: 22 pages, 9 figures

    MSC Class: 68P30; 94A29; 94A17 ACM Class: E.4; H.1.1

    Journal ref: IEEE Access, vol. 12, pp. 77750-77763, 2024

  4. arXiv:2304.11012  [pdf, other

    cs.DS

    Learned Monotone Minimal Perfect Hashing

    Authors: Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, Giorgio Vinciguerra

    Abstract: A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms. In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of… ▽ More

    Submitted 30 August, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

  5. arXiv:2203.14540  [pdf, other

    cs.DS

    Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices

    Authors: Paolo Ferragina, Travis Gagie, Dominik Köppl, Giovanni Manzini, Gonzalo Navarro, Manuel Striani, Francesco Tosoni

    Abstract: As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments sho… ▽ More

    Submitted 30 March, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

  6. arXiv:2105.04293  [pdf, other

    cs.HC cs.AI cs.IR

    An interactive dashboard for searching and comparing soccer performance scores

    Authors: Paolo Cintia, Giovanni Mauro, Luca Pappalardo, Paolo Ferragina

    Abstract: The performance of soccer players is one of most discussed aspects by many actors in the soccer industry: from supporters to journalists, from coaches to talent scouts. Unfortunately, the dashboards available online provide no effective way to compare the evolution of the performance of players or to find players behaving similarly on the field. This paper describes the design of a web dashboard t… ▽ More

    Submitted 11 May, 2021; v1 submitted 16 April, 2021; originally announced May 2021.

    Comments: 4 pages, 6 figures

  7. arXiv:2004.05222  [pdf

    cs.CY cs.SI

    Give more data, awareness and control to individual citizens, and they will help COVID-19 containment

    Authors: Mirco Nanni, Gennady Andrienko, Albert-László Barabási, Chiara Boldrini, Francesco Bonchi, Ciro Cattuto, Francesca Chiaromonte, Giovanni Comandé, Marco Conti, Mark Coté, Frank Dignum, Virginia Dignum, Josep Domingo-Ferrer, Paolo Ferragina, Fosca Giannotti, Riccardo Guidotti, Dirk Helbing, Kimmo Kaski, Janos Kertesz, Sune Lehmann, Bruno Lepri, Paul Lukowicz, Stan Matwin, David Megías Jiménez, Anna Monreale , et al. (14 additional authors not shown)

    Abstract: The rapid dynamics of COVID-19 calls for quick and effective tracking of virus transmission chains and early detection of outbreaks, especially in the phase 2 of the pandemic, when lockdown and other restriction measures are progressively withdrawn, in order to avoid or minimize contagion resurgence. For this purpose, contact-tracing apps are being proposed for large scale adoption by many countri… ▽ More

    Submitted 16 April, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

    Comments: Revised text. Additional authors

    Journal ref: Transactions on Data Privacy 13(1): 61-66 (2020), http://www.tdp.cat/issues16/abs.a389a20.php

  8. arXiv:1910.06169  [pdf, ps, other

    cs.DS cs.DB cs.IR cs.LG

    The PGM-index: a multicriteria, compressed and learned approach to data indexing

    Authors: Paolo Ferragina, Giorgio Vinciguerra

    Abstract: The recent introduction of learned indexes has shaken the foundations of the decades-old field of indexing data structures. Combining, or even replacing, classic design elements such as B-tree nodes with machine learning models has proven to give outstanding improvements in the space footprint and time efficiency of data systems. However, these novel approaches are based on heuristics, thus they l… ▽ More

    Submitted 14 October, 2019; originally announced October 2019.

    Comments: We remark to the reader that this paper is an extended and improved version of our previous paper titled "Superseding traditional indexes by orchestrating learning and geometry" (arXiv:1903.00507)

    ACM Class: E.1; E.4; I.2.6

    Journal ref: PVLDB, 13(8): 1162-1175, 2020

  9. arXiv:1903.00507  [pdf

    cs.DS

    Superseding traditional indexes by orchestrating learning and geometry

    Authors: Giorgio Vinciguerra, Paolo Ferragina, Michele Miccinesi

    Abstract: We design the first learned index that solves the dictionary problem with time and space complexity provably better than classic data structures for hierarchical memories, such as B-trees, and modern learned indexes. We call our solution the Piecewise Geometric Model index (PGM-index) because it turns the indexing of a sequence of keys into the coverage of a sequence of 2D-points via linear models… ▽ More

    Submitted 9 March, 2019; v1 submitted 1 March, 2019; originally announced March 2019.

    ACM Class: E.1; E.4; I.2.6

  10. WISER: A Semantic Approach for Expert Finding in Academia based on Entity Linking

    Authors: Paolo Cifariello, Paolo Ferragina, Marco Ponza

    Abstract: We present WISER, a new semantic search engine for expert finding in academia. Our system is unsupervised and it jointly combines classical language modeling techniques, based on text evidences, with the Wikipedia Knowledge Graph, via entity linking. WISER indexes each academic author through a novel profiling technique which models her expertise with a small, labeled and weighted graph drawn fr… ▽ More

    Submitted 10 June, 2019; v1 submitted 10 May, 2018; originally announced May 2018.

    Journal ref: Information Systems, Elsevier (2019)

  11. arXiv:1804.03580  [pdf, other

    cs.IR

    SWAT: A System for Detecting Salient Wikipedia Entities in Texts

    Authors: Marco Ponza, Paolo Ferragina, Francesco Piccinno

    Abstract: We study the problem of entity salience by proposing the design and implementation of SWAT, a system that identifies the salient Wikipedia entities occurring in an input document. SWAT consists of several modules that are able to detect and classify on-the-fly Wikipedia entities as salient or not, based on a large number of syntactic, semantic and latent features properly extracted via a supervise… ▽ More

    Submitted 16 May, 2019; v1 submitted 10 April, 2018; originally announced April 2018.

    Journal ref: Computational Intelligence, Wiley-Blackwell Publishing (2019)

  12. arXiv:1802.04987  [pdf, other

    stat.AP cs.AI

    PlayeRank: data-driven performance evaluation and player ranking in soccer via a machine learning approach

    Authors: Luca Pappalardo, Paolo Cintia, Paolo Ferragina, Emanuele Massucco, Dino Pedreschi, Fosca Giannotti

    Abstract: The problem of evaluating the performance of soccer players is attracting the interest of many companies and the scientific community, thanks to the availability of massive data capturing all the events generated during a match (e.g., tackles, passes, shots, etc.). Unfortunately, there is no consolidated and widely accepted metric for measuring performance quality in all of its facets. In this pap… ▽ More

    Submitted 25 January, 2019; v1 submitted 14 February, 2018; originally announced February 2018.

    Journal ref: PlayeRank: Data-driven Performance Evaluation and Player Ranking in Soccer via a Machine Learning Approach. ACM Trans. Intell. Syst. Technol. 10, 5, Article 59 (September 2019), 27 pages

  13. arXiv:1307.3872  [pdf, other

    cs.IT cs.DS

    Bicriteria data compression

    Authors: Andrea Farruggia, Paolo Ferragina, Antonio Frangioni, Rossano Venturini

    Abstract: The advent of massive datasets (and the consequent design of high-performing distributed storage systems) have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of its decompres… ▽ More

    Submitted 15 July, 2013; originally announced July 2013.

  14. arXiv:1006.3498  [pdf, other

    cs.IR

    Fast and accurate annotation of short texts with Wikipedia pages

    Authors: Paolo Ferragina, Ugo Scaiella

    Abstract: We address the problem of cross-referencing text fragments with Wikipedia pages, in a way that synonymy and polysemy issues are resolved accurately and efficiently. We take inspiration from a recent flow of work [Cucerzan 2007, Mihalcea and Csomai 2007, Milne and Witten 2008, Chakrabarti et al 2009], and extend their scenario from the annotation of long documents to the annotation of short texts,… ▽ More

    Submitted 28 July, 2010; v1 submitted 17 June, 2010; originally announced June 2010.

  15. arXiv:0909.4341  [pdf, ps, other

    cs.DS

    Lightweight Data Indexing and Compression in External Memory

    Authors: Paolo Ferragina, Travis Gagie, Giovanni Manzini

    Abstract: In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size $n$, they use only ${n}$ bits of disk working space while all previous approaches use $\Th{n \log n}$ bits of disk working space. Moreover, our algorithms access disk data… ▽ More

    Submitted 24 September, 2009; originally announced September 2009.

  16. arXiv:0906.4692  [pdf, ps, other

    cs.DS cs.IT

    On optimally partitioning a text to improve its compression

    Authors: Paolo Ferragina, Igor Nitto, Rossano Venturini

    Abstract: In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once. This problem was introduced in the context of table compression, and then further elaborated and extended to strings and trees. Unfortunately, the literature off… ▽ More

    Submitted 25 June, 2009; originally announced June 2009.

  17. arXiv:0802.0835  [pdf, ps, other

    cs.DS cs.IT

    Bit-Optimal Lempel-Ziv compression

    Authors: Paolo Ferragina, Igor Nitto, Rossano Venturini

    Abstract: One of the most famous and investigated lossless data-compression scheme is the one introduced by Lempel and Ziv about 40 years ago. This compression scheme is known as "dictionary-based compression" and consists of squeezing an input string by replacing some of its substrings with (shorter) codewords which are actually pointers to a dictionary of phrases built as the string is processed. Surpri… ▽ More

    Submitted 6 February, 2008; originally announced February 2008.

  18. arXiv:0801.2378  [pdf, ps, other

    cs.DS cs.IR

    String algorithms and data structures

    Authors: Paolo Ferragina

    Abstract: The string-matching field has grown at a such complicated stage that various issues come into play when studying it: data structure and algorithmic design, database principles, compression techniques, architectural features, cache and prefetching policies. The expertise nowadays required to design good string data structures and algorithms is therefore transversal to many computer science fields… ▽ More

    Submitted 15 January, 2008; originally announced January 2008.

  19. arXiv:0712.3360  [pdf, ps, other

    cs.DS

    Compressed Text Indexes:From Theory to Practice!

    Authors: Paolo Ferragina, Rodrigo Gonzalez, Gonzalo Navarro, Rossano Venturini

    Abstract: A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical… ▽ More

    Submitted 20 December, 2007; originally announced December 2007.

    ACM Class: F.2.2; H.2.1; H.3.2; H.3.3