-
On the Reproducibility of Experiments of Indexing Repetitive Document Collections
Authors:
Antonio Fariña,
Miguel A. Martínez-Prieto,
Francisco Claude,
Gonzalo Navarro,
Juan J. Lastra-Díaz,
Nicola Prezza,
Diego Seco
Abstract:
This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe…
▽ More
This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as reproducibility package.
△ Less
Submitted 26 December, 2019;
originally announced December 2019.
-
A Grammar-based Compressed Representation of 3D Trajectories
Authors:
Nieves R. Brisaboa,
Adrián Gómez-Brandón,
Miguel A. Martínez-Prieto,
José R. Paramá
Abstract:
Much research has been published about trajectory management on the ground or at the sea, but compression or indexing of flight trajectories have usually been less explored. However, air traffic management is a challenge because airspace is becoming more and more congested, and large flight data collections must be preserved and exploited for varied purposes. This paper proposes 3DGraCT, a new met…
▽ More
Much research has been published about trajectory management on the ground or at the sea, but compression or indexing of flight trajectories have usually been less explored. However, air traffic management is a challenge because airspace is becoming more and more congested, and large flight data collections must be preserved and exploited for varied purposes. This paper proposes 3DGraCT, a new method for representing these flight trajectories. It extends the GraCT compact data structure to cope with a third dimension (altitude), while retaining its space/time complexities. 3DGraCT improves space requirements of traditional spatio-temporal data structures by two orders of magnitude, being competitive for the considered types of queries, even leading the comparison for a particular one.
△ Less
Submitted 28 December, 2018;
originally announced December 2018.
-
Universal Indexes for Highly Repetitive Document Collections
Authors:
Francisco Claude,
Antonio Fariña,
Miguel A. Martínez-Prieto,
Gonzalo Navarro
Abstract:
Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space.
We intr…
▽ More
Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space.
We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.
△ Less
Submitted 23 May, 2016; v1 submitted 29 April, 2016;
originally announced April 2016.
-
Generalized Biwords for Bitext Compression and Translation Spotting
Authors:
Felipe Sánchez-Martínez,
Rafael C. Carrasco,
Miguel A. Martínez-Prieto,
Joaquin Adiego
Abstract:
Large bilingual parallel texts (also known as bitexts) are usually stored in a compressed form, and previous work has shown that they can be more efficiently compressed if the fact that the two texts are mutual translations is exploited. For example, a bitext can be seen as a sequence of biwords ---pairs of parallel words with a high probability of co-occurrence--- that can be used as an intermedi…
▽ More
Large bilingual parallel texts (also known as bitexts) are usually stored in a compressed form, and previous work has shown that they can be more efficiently compressed if the fact that the two texts are mutual translations is exploited. For example, a bitext can be seen as a sequence of biwords ---pairs of parallel words with a high probability of co-occurrence--- that can be used as an intermediate representation in the compression process. However, the simple biword approach described in the literature can only exploit one-to-one word alignments and cannot tackle the reordering of words. We therefore introduce a generalization of biwords which can describe multi-word expressions and reorderings. We also describe some methods for the binary compression of generalized biword sequences, and compare their performance when different schemes are applied to the extraction of the biword sequence. In addition, we show that this generalization of biwords allows for the implementation of an efficient algorithm to look on the compressed bitext for words or text segments in one of the texts and retrieve their counterpart translations in the other text ---an application usually referred to as translation spotting--- with only some minor modifications in the compression algorithm.
△ Less
Submitted 18 January, 2014;
originally announced January 2014.
-
Compressed Vertical Partitioning for Full-In-Memory RDF Management
Authors:
Sandra Álvarez-García,
Nieves R. Brisaboa,
Javier D. Fernández,
Miguel A. Martínez-Prieto,
Gonzalo Navarro
Abstract:
The Web of Data has been gaining momentum and this leads to increasingly publish more semi-structured datasets following the RDF model, based on atomic triple units of subject, predicate, and object. Although it is a simple model, compression methods become necessary because datasets are increasingly larger and various scalability issues arise around their organization and storage. This requiremen…
▽ More
The Web of Data has been gaining momentum and this leads to increasingly publish more semi-structured datasets following the RDF model, based on atomic triple units of subject, predicate, and object. Although it is a simple model, compression methods become necessary because datasets are increasingly larger and various scalability issues arise around their organization and storage. This requirement is more restrictive in RDF stores because efficient SPARQL resolution on the compressed RDF datasets is also required.
This article introduces a novel RDF indexing technique (called k2-triples) supporting efficient SPARQL resolution in compressed space. k2-triples, uses the predicate to vertically partition the dataset into disjoint subsets of pairs (subject, object), one per predicate. These subsets are represented as binary matrices in which 1-bits mean that the corresponding triple exists in the dataset. This model results in very sparse matrices, which are efficiently compressed using k2-trees. We enhance this model with two compact indexes listing the predicates related to each different subject and object, in order to address the specific weaknesses of vertically partitioned representations. The resulting technique not only achieves by far the most compressed representations, but also the best overall performance for RDF retrieval in our experiments. Our approach uses up to 10 times less space than a state of the art baseline, and outperforms its performance by several order of magnitude on the most basic query patterns. In addition, we optimize traditional join algorithms on k2-triples and define a novel one leveraging its specific features. Our experimental results show that our technique overcomes traditional vertical partitioning for join resolution, reporting the best numbers for joins in which the non-joined nodes are provided, and being competitive in the majority of the cases.
△ Less
Submitted 21 October, 2013; v1 submitted 18 October, 2013;
originally announced October 2013.
-
Compressed k2-Triples for Full-In-Memory RDF Engines
Authors:
Sandra Álvarez-García,
Nieves R. Brisaboa,
Javier D. Fernández,
Miguel A. Martínez-Prieto
Abstract:
Current "data deluge" has flooded the Web of Data with very large RDF datasets. They are hosted and queried through SPARQL endpoints which act as nodes of a semantic net built on the principles of the Linked Data project. Although this is a realistic philosophy for global data publishing, its query performance is diminished when the RDF engines (behind the endpoints) manage these huge datasets. Th…
▽ More
Current "data deluge" has flooded the Web of Data with very large RDF datasets. They are hosted and queried through SPARQL endpoints which act as nodes of a semantic net built on the principles of the Linked Data project. Although this is a realistic philosophy for global data publishing, its query performance is diminished when the RDF engines (behind the endpoints) manage these huge datasets. Their indexes cannot be fully loaded in main memory, hence these systems need to perform slow disk accesses to solve SPARQL queries. This paper addresses this problem by a compact indexed RDF structure (called k2-triples) applying compact k2-tree structures to the well-known vertical-partitioning technique. It obtains an ultra-compressed representation of large RDF graphs and allows SPARQL queries to be full-in-memory performed without decompression. We show that k2-triples clearly outperforms state-of-the-art compressibility and traditional vertical-partitioning query resolution, remaining very competitive with multi-index solutions.
△ Less
Submitted 19 May, 2011;
originally announced May 2011.
-
An Empirical Study of Real-World SPARQL Queries
Authors:
Mario Arias,
Javier D. Fernández,
Miguel A. Martínez-Prieto,
Pablo de la Fuente
Abstract:
Understanding how users tailor their SPARQL queries is crucial when designing query evaluation engines or fine-tuning RDF stores with performance in mind. In this paper we analyze 3 million real-world SPARQL queries extracted from logs of the DBPedia and SWDF public endpoints. We aim at finding which are the most used language elements both from syntactical and structural perspectives, paying spec…
▽ More
Understanding how users tailor their SPARQL queries is crucial when designing query evaluation engines or fine-tuning RDF stores with performance in mind. In this paper we analyze 3 million real-world SPARQL queries extracted from logs of the DBPedia and SWDF public endpoints. We aim at finding which are the most used language elements both from syntactical and structural perspectives, paying special attention to triple patterns and joins, since they are indeed some of the most expensive SPARQL operations at evaluation phase. We have determined that most of the queries are simple and include few triple patterns and joins, being Subject-Subject, Subject-Object and Object-Object the most common join types. The graph patterns are usually star-shaped and despite triple pattern chains exist, they are generally short.
△ Less
Submitted 25 March, 2011;
originally announced March 2011.
-
Compressed String Dictionaries
Authors:
Nieves R. Brisaboa,
Rodrigo Cánovas,
Miguel A. Martínez-Prieto,
Gonzalo Navarro
Abstract:
The problem of storing a set of strings --- a string dictionary --- in compact form appears naturally in many cases. While classically it has represented a small part of the whole data to be processed (e.g., for Natural Language processing or for indexing text collections), more recent applications in Web engines, Web mining, RDF graphs, Internet routing, Bioinformatics, and many others, make use…
▽ More
The problem of storing a set of strings --- a string dictionary --- in compact form appears naturally in many cases. While classically it has represented a small part of the whole data to be processed (e.g., for Natural Language processing or for indexing text collections), more recent applications in Web engines, Web mining, RDF graphs, Internet routing, Bioinformatics, and many others, make use of very large string dictionaries, whose size is a significant fraction of the whole data. Thus novel approaches to compress them efficiently are necessary. In this paper we experimentally compare time and space performance of some existing alternatives, as well as new ones we propose. We show that space reductions of up to 20% of the original size of the strings is possible while supporting fast dictionary searches.
△ Less
Submitted 28 January, 2011;
originally announced January 2011.