Search | arXiv e-print repository

Evaluating Regular Path Queries on Compressed Adjacency Matrices

Authors: Diego Arroyuelo, Adrián Gómez-Brandón, Gonzalo Navarro

Abstract: Regular Path Queries (RPQs), which are essentially regular expressions to be matched against the labels of paths in labeled graphs, are at the core of graph database query languages like SPARQL. A way to solve RPQs is to translate them into a sequence of operations on the adjacency matrices of each label. We design and implement a Boolean algebra on sparse matrix representations and, as an applica… ▽ More Regular Path Queries (RPQs), which are essentially regular expressions to be matched against the labels of paths in labeled graphs, are at the core of graph database query languages like SPARQL. A way to solve RPQs is to translate them into a sequence of operations on the adjacency matrices of each label. We design and implement a Boolean algebra on sparse matrix representations and, as an application, use them to handle RPQs. Our baseline representation uses the same space as the previously most compact index for RPQs and outperforms it on the hardest types of queries -- those where both RPQ endpoints are unspecified. Our more succinct structure, based on $k^2$-trees, is 4 times smaller than any existing representation that handles RPQs, and still solves complex RPQs in a few seconds. Our new sparse-matrix-based representations dominate a good portion of the space/time tradeoff map, being outperformed only by representations that use much more space. They are also of independent interest beyond solving RPQs. △ Less

Submitted 23 April, 2024; v1 submitted 27 July, 2023; originally announced July 2023.

arXiv:2305.14461 [pdf, other]

Engineering Rank/Select Data Structures for Large-Alphabet Strings

Authors: Diego Arroyuelo, Gabriel Carmona, Héctor Larrañaga, Francisco Riveros, Carlos Eugenio Rojas-Morales, Erick Sepúlveda

Abstract: Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in small-alphabets strings. This paper studies the efficient implementation of one of the most effective approaches for dealing with large-alphabet strings, namely the \emph{al… ▽ More Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in small-alphabets strings. This paper studies the efficient implementation of one of the most effective approaches for dealing with large-alphabet strings, namely the \emph{alphabet-partitioning} approach. The main contribution is a compressed data structure that supports the fundamental operations $rank$ and $select$ efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach. In particular, the time for operation $select$ can be improved by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), the representation of run-length compressed strings, and the distributed-computation processing of $rank$ and $select$ operations. In the particular case of run-length compressed strings, our experiments on the Burrows-Wheeler transform of highly-repetitive texts indicate that by using only about 0.98--1.09 times the space of state-of-the-art RLFM-indexes (depending on the text), the process of counting the number of occurrences of a pattern in a text can be carried out 1.23--2.33 times faster. △ Less

Submitted 1 May, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

arXiv:2212.00946 [pdf, other]

Trie-Compressed Intersectable Sets

Authors: Diego Arroyuelo, Juan Pablo Castillo

Abstract: We introduce space- and time-efficient algorithms and data structures for the offline set intersection problem. We show that a sorted integer set $S \subseteq [0{..}u)$ of $n$ elements can be represented using compressed space while supporting $k$-way intersections in adaptive $O(kδ\lg{\!(u/δ)})$ time, $δ$ being the alternation measure introduced by Barbay and Kenyon. Our experimental results sugg… ▽ More We introduce space- and time-efficient algorithms and data structures for the offline set intersection problem. We show that a sorted integer set $S \subseteq [0{..}u)$ of $n$ elements can be represented using compressed space while supporting $k$-way intersections in adaptive $O(kδ\lg{\!(u/δ)})$ time, $δ$ being the alternation measure introduced by Barbay and Kenyon. Our experimental results suggest that our approaches are competitive in practice, outperforming the most efficient alternatives (Partitioned Elias-Fano indexes, Roaring Bitmaps, and Recursive Universe Partitioning (RUP)) in several scenarios, offering in general relevant space-time trade-offs. △ Less

Submitted 1 December, 2022; originally announced December 2022.

arXiv:2111.04556 [pdf, other]

Time- and Space-Efficient Regular Path Queries on Graphs

Authors: Diego Arroyuelo, Aidan Hogan, Gonzalo Navarro, Javiel Rojas-Ledesma

Abstract: We introduce a time- and space-efficient technique to solve regularpath queries over labeled graphs. We combine a bit-parallel simula-tion of the Glushkov automaton of the regular expression with thering index introduced by Arroyuelo et al., exploiting its wavelettree representation of the triples in order to efficiently reach thestates of the product graph that are relevant for the query. Ourquer… ▽ More We introduce a time- and space-efficient technique to solve regularpath queries over labeled graphs. We combine a bit-parallel simula-tion of the Glushkov automaton of the regular expression with thering index introduced by Arroyuelo et al., exploiting its wavelettree representation of the triples in order to efficiently reach thestates of the product graph that are relevant for the query. Ourquery algorithm is able to simultaneously process several automa-ton states, as well as several graph nodes/labels. Our experimentalresults show that our representation uses 3-5 times less space thanthe alternatives in the literature, while generally outperformingthem in query times (1.67 times faster than the next best). △ Less

Submitted 8 November, 2021; originally announced November 2021.

arXiv:2111.01540 [pdf, other]

MillenniumDB: A Persistent, Open-Source, Graph Database

Authors: Domagoj Vrgoc, Carlos Rojas, Renzo Angles, Marcelo Arenas, Diego Arroyuelo, Carlos Buil Aranda, Aidan Hogan, Gonzalo Navarro, Cristian Riveros, Juan Romero

Abstract: In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported. The engine itself is founded on a combination of tried and tested techniques from relational data manage… ▽ More In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported. The engine itself is founded on a combination of tried and tested techniques from relational data management, state-of-the-art algorithms for worst-case-optimal joins, as well as graph-specific algorithms for evaluating path queries. In this paper, we present the main design principles underlying MillenniumDB, describing the abstract graph model and query semantics supported, the concrete data model and query syntax implemented, as well as the storage, indexing, query planning and query evaluation techniques used. We evaluate MillenniumDB over real-world data and queries from the Wikidata knowledge graph, where we find that it outperforms other popular persistent graph database engines (including both enterprise and open source alternatives) that support similar query features. △ Less

Submitted 2 November, 2021; originally announced November 2021.

arXiv:1911.08971 [pdf, other]

doi 10.1007/978-3-030-32686-9_30

Faster Dynamic Compressed d-ary Relations

Authors: Diego Arroyuelo, Guillermo de Bernardo, Travis Gagie, Gonzalo Navarro

Abstract: The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector… ▽ More The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector as done in previous work, yields operation times that are below the lower bound of dynamic bitvectors and offers improved time performance in practice. △ Less

Submitted 20 November, 2019; originally announced November 2019.

Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

Journal ref: Proc. SPIRE 2019

Showing 1–6 of 6 results for author: Arroyuelo, D