-
Evaluating Regular Path Queries on Compressed Adjacency Matrices
Authors:
Diego Arroyuelo,
Adrián Gómez-Brandón,
Gonzalo Navarro
Abstract:
Regular Path Queries (RPQs), which are essentially regular expressions to be matched against the labels of paths in labeled graphs, are at the core of graph database query languages like SPARQL. A way to solve RPQs is to translate them into a sequence of operations on the adjacency matrices of each label. We design and implement a Boolean algebra on sparse matrix representations and, as an applica…
▽ More
Regular Path Queries (RPQs), which are essentially regular expressions to be matched against the labels of paths in labeled graphs, are at the core of graph database query languages like SPARQL. A way to solve RPQs is to translate them into a sequence of operations on the adjacency matrices of each label. We design and implement a Boolean algebra on sparse matrix representations and, as an application, use them to handle RPQs. Our baseline representation uses the same space as the previously most compact index for RPQs and outperforms it on the hardest types of queries -- those where both RPQ endpoints are unspecified. Our more succinct structure, based on $k^2$-trees, is 4 times smaller than any existing representation that handles RPQs, and still solves complex RPQs in a few seconds. Our new sparse-matrix-based representations dominate a good portion of the space/time tradeoff map, being outperformed only by representations that use much more space. They are also of independent interest beyond solving RPQs.
△ Less
Submitted 23 April, 2024; v1 submitted 27 July, 2023;
originally announced July 2023.
-
Engineering Rank/Select Data Structures for Large-Alphabet Strings
Authors:
Diego Arroyuelo,
Gabriel Carmona,
Héctor Larrañaga,
Francisco Riveros,
Carlos Eugenio Rojas-Morales,
Erick Sepúlveda
Abstract:
Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in small-alphabets strings. This paper studies the efficient implementation of one of the most effective approaches for dealing with large-alphabet strings, namely the \emph{al…
▽ More
Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in small-alphabets strings. This paper studies the efficient implementation of one of the most effective approaches for dealing with large-alphabet strings, namely the \emph{alphabet-partitioning} approach. The main contribution is a compressed data structure that supports the fundamental operations $rank$ and $select$ efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach. In particular, the time for operation $select$ can be improved by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), the representation of run-length compressed strings, and the distributed-computation processing of $rank$ and $select$ operations. In the particular case of run-length compressed strings, our experiments on the Burrows-Wheeler transform of highly-repetitive texts indicate that by using only about 0.98--1.09 times the space of state-of-the-art RLFM-indexes (depending on the text), the process of counting the number of occurrences of a pattern in a text can be carried out 1.23--2.33 times faster.
△ Less
Submitted 1 May, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Trie-Compressed Intersectable Sets
Authors:
Diego Arroyuelo,
Juan Pablo Castillo
Abstract:
We introduce space- and time-efficient algorithms and data structures for the offline set intersection problem. We show that a sorted integer set $S \subseteq [0{..}u)$ of $n$ elements can be represented using compressed space while supporting $k$-way intersections in adaptive $O(kδ\lg{\!(u/δ)})$ time, $δ$ being the alternation measure introduced by Barbay and Kenyon. Our experimental results sugg…
▽ More
We introduce space- and time-efficient algorithms and data structures for the offline set intersection problem. We show that a sorted integer set $S \subseteq [0{..}u)$ of $n$ elements can be represented using compressed space while supporting $k$-way intersections in adaptive $O(kδ\lg{\!(u/δ)})$ time, $δ$ being the alternation measure introduced by Barbay and Kenyon. Our experimental results suggest that our approaches are competitive in practice, outperforming the most efficient alternatives (Partitioned Elias-Fano indexes, Roaring Bitmaps, and Recursive Universe Partitioning (RUP)) in several scenarios, offering in general relevant space-time trade-offs.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
Time- and Space-Efficient Regular Path Queries on Graphs
Authors:
Diego Arroyuelo,
Aidan Hogan,
Gonzalo Navarro,
Javiel Rojas-Ledesma
Abstract:
We introduce a time- and space-efficient technique to solve regularpath queries over labeled graphs. We combine a bit-parallel simula-tion of the Glushkov automaton of the regular expression with thering index introduced by Arroyuelo et al., exploiting its wavelettree representation of the triples in order to efficiently reach thestates of the product graph that are relevant for the query. Ourquer…
▽ More
We introduce a time- and space-efficient technique to solve regularpath queries over labeled graphs. We combine a bit-parallel simula-tion of the Glushkov automaton of the regular expression with thering index introduced by Arroyuelo et al., exploiting its wavelettree representation of the triples in order to efficiently reach thestates of the product graph that are relevant for the query. Ourquery algorithm is able to simultaneously process several automa-ton states, as well as several graph nodes/labels. Our experimentalresults show that our representation uses 3-5 times less space thanthe alternatives in the literature, while generally outperformingthem in query times (1.67 times faster than the next best).
△ Less
Submitted 8 November, 2021;
originally announced November 2021.
-
MillenniumDB: A Persistent, Open-Source, Graph Database
Authors:
Domagoj Vrgoc,
Carlos Rojas,
Renzo Angles,
Marcelo Arenas,
Diego Arroyuelo,
Carlos Buil Aranda,
Aidan Hogan,
Gonzalo Navarro,
Cristian Riveros,
Juan Romero
Abstract:
In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported. The engine itself is founded on a combination of tried and tested techniques from relational data manage…
▽ More
In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported. The engine itself is founded on a combination of tried and tested techniques from relational data management, state-of-the-art algorithms for worst-case-optimal joins, as well as graph-specific algorithms for evaluating path queries. In this paper, we present the main design principles underlying MillenniumDB, describing the abstract graph model and query semantics supported, the concrete data model and query syntax implemented, as well as the storage, indexing, query planning and query evaluation techniques used. We evaluate MillenniumDB over real-world data and queries from the Wikidata knowledge graph, where we find that it outperforms other popular persistent graph database engines (including both enterprise and open source alternatives) that support similar query features.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
Faster Dynamic Compressed d-ary Relations
Authors:
Diego Arroyuelo,
Guillermo de Bernardo,
Travis Gagie,
Gonzalo Navarro
Abstract:
The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector…
▽ More
The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector as done in previous work, yields operation times that are below the lower bound of dynamic bitvectors and offers improved time performance in practice.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.