Search | arXiv e-print repository

Path-based Algebraic Foundations of Graph Query Languages

Authors: Renzo Angles, Angela Bonifati, Roberto García, Domagoj Vrgoč

Abstract: Graph databases are gaining momentum thanks to the flexibility and expressiveness of their data model and query languages. A standardization activity driven by the ISO/IEC standardization body is also ongoing and has already conducted to the specification of the first versions of two standard graph query languages, namely SQL/PGQ and GQL, respectively in 2023 and 2024. Apart from the standards, th… ▽ More Graph databases are gaining momentum thanks to the flexibility and expressiveness of their data model and query languages. A standardization activity driven by the ISO/IEC standardization body is also ongoing and has already conducted to the specification of the first versions of two standard graph query languages, namely SQL/PGQ and GQL, respectively in 2023 and 2024. Apart from the standards, there exists a panoply of concrete graph query languages in commercial and open-source graph databases, each of which exhibits different features and modes. In this paper, we tackle the heterogeneity problem of graph query languages by laying the foundations of a unifying path-oriented algebraic framework. Such a theoretical framework is currently missing in the graph databases landscape, thus impeding a lingua franca in which different graph query language implementations can be expressed and cross-compared. Our framework gives a blueprint for correct implementation of graph queries of different expressiveness. It allows to overcome the boundaries of current versions of standard query languages, thus paving the way to future extensions including query composability. It also allows, when the path-based semantics is stripped off, to express classical Codd's relational algebra enhanced with a recursive operator, thus proving its utility for a wide range of queries in database management systems. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: Under review

arXiv:2306.02194 [pdf, other]

PathFinder: A unified approach for handling paths in graph query languages

Authors: Benjamín Farías, Wim Martens, Carlos Rojas, Domagoj Vrgoč

Abstract: Path queries are a core feature of modern graph query languages such as Cypher, SQL/PGQ, and GQL. These languages provide a rich set of features for matching paths, such as restricting to certain path modes (shortest, simple, trail) and constraining the edge labels along the path by a regular expression. In this paper we present PathFinder, a unifying approach for dealing with path queries in all… ▽ More Path queries are a core feature of modern graph query languages such as Cypher, SQL/PGQ, and GQL. These languages provide a rich set of features for matching paths, such as restricting to certain path modes (shortest, simple, trail) and constraining the edge labels along the path by a regular expression. In this paper we present PathFinder, a unifying approach for dealing with path queries in all these query languages. PathFinder leverages a compact representation of the (potentially exponential number of) paths that can match a given query, extends it with pipelined execution, and supports all commonly used path modes. In the paper we describe the algorithmic backbone of PathFinder, provide a reference implementation, and test it over a large set of real-world queries and datasets. Our results show that PathFinder exhibits very stable behavior, even on large data and complex queries, and its performance is an order of magnitude better than that of many modern graph engines. △ Less

Submitted 13 March, 2024; v1 submitted 3 June, 2023; originally announced June 2023.

arXiv:2211.10962 [pdf, ps, other]

doi 10.1145/3589778

PG-Schema: Schemas for Property Graphs

Authors: Renzo Angles, Angela Bonifati, Stefania Dumbrava, George Fletcher, Alastair Green, Jan Hidders, Bei Li, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Stefan Plantikow, Ognjen Savković, Michael Schmidt, Juan Sequeda, Sławek Staworko, Dominik Tomaszuk, Hannes Voigt, Domagoj Vrgoč, Mingxi Wu, Dušan Živković

Abstract: Property graphs have reached a high level of maturity, witnessed by multiple robust graph database systems as well as the ongoing ISO standardization effort aiming at creating a new standard Graph Query Language (GQL). Yet, despite documented demand, schema support is limited both in existing systems and in the first version of the GQL Standard. It is anticipated that the second version of the GQL… ▽ More Property graphs have reached a high level of maturity, witnessed by multiple robust graph database systems as well as the ongoing ISO standardization effort aiming at creating a new standard Graph Query Language (GQL). Yet, despite documented demand, schema support is limited both in existing systems and in the first version of the GQL Standard. It is anticipated that the second version of the GQL Standard will include a rich DDL. Aiming to inspire the development of GQL and enhance the capabilities of graph database systems, we propose PG-Schema, a simple yet powerful formalism for specifying property graph schemas. It features PG-Types with flexible type definitions supporting multi-inheritance, as well as expressive constraints based on the recently proposed PG-Keys formalism. We provide the formal syntax and semantics of PG-Schema, which meet principled design requirements grounded in contemporary property graph management scenarios, and offer a detailed comparison of its features with those of existing schema languages and graph database systems. △ Less

Submitted 8 July, 2023; v1 submitted 20 November, 2022; originally announced November 2022.

Comments: 26 pages

Journal ref: Proc. ACM Manag. Data (2023)

arXiv:2210.16580 [pdf, ps, other]

GPC: A Pattern Calculus for Property Graphs

Authors: Nadime Francis, Amélie Gheerbrant, Paolo Guagliardo, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Liat Peterfreund, Alexandra Rogova, Domagoj Vrgoč

Abstract: The development of practical query languages for graph databases runs well ahead of the underlying theory. The ISO committee in charge of database query languages is currently develo** a new standard called Graph Query Language (GQL) as well as an extension of the SQL Standard for querying property graphs represented by a relational schema, called SQL/PGQ. The main component of both is the patte… ▽ More The development of practical query languages for graph databases runs well ahead of the underlying theory. The ISO committee in charge of database query languages is currently develo** a new standard called Graph Query Language (GQL) as well as an extension of the SQL Standard for querying property graphs represented by a relational schema, called SQL/PGQ. The main component of both is the pattern matching facility, which is shared by the two standards. In many aspects, it goes well beyond RPQs, CRPQs, and similar queries on which the research community has focused for years. Our main contribution is to distill the lengthy standard specification into a simple Graph Pattern Calculus (GPC) that reflects all the key pattern matching features of GQL and SQL/PGQ, and at the same time lends itself to rigorous theoretical investigation. We describe the syntax and semantics of GPC, along with the ty** rules that ensure its expressions are well-defined, and state some basic properties of the language. With this paper we provide the community a tool to embark on a study of query languages that will soon be widely adopted by industry. △ Less

Submitted 29 October, 2022; originally announced October 2022.

arXiv:2207.13541 [pdf, ps, other]

Representing Paths in Graph Database Pattern Matching

Authors: Wim Martens, Matthias Niewerth, Tina Popp, Stijn Vansummeren, Domagoj Vrgoc

Abstract: Modern graph database query languages such as GQL, SQL/PGQ, and their academic predecessor G-Core promote paths to first-class citizens in the sense that paths that match regular path queries can be returned to the user. This brings a number of challenges in terms of efficiency, caused by the fact that graphs can have a huge amount of paths between a given node pair. We introduce the concept of… ▽ More Modern graph database query languages such as GQL, SQL/PGQ, and their academic predecessor G-Core promote paths to first-class citizens in the sense that paths that match regular path queries can be returned to the user. This brings a number of challenges in terms of efficiency, caused by the fact that graphs can have a huge amount of paths between a given node pair. We introduce the concept of path multiset representations (PMRs), which can represent multisets of paths in an exponentially succinct manner. After exploring fundamental problems such as minimization and equivalence testing of PMRs, we explore how their use can lead to significant time and space savings when executing query plans. We show that, from a computational complexity point of view, PMRs seem especially well-suited for representing results of regular path queries and extensions thereof involving counting, random sampling, unions, and joins. △ Less

Submitted 27 July, 2022; originally announced July 2022.

arXiv:2204.11137 [pdf, other]

Evaluating regular path queries under the all-shortest paths semantics

Authors: Domagoj Vrgoč

Abstract: The purpose of this report is to explain how the textbook breadth-first search algorithm (BFS) can be modified in order to also create a compact representation of all shortest paths connecting a single source node to all the nodes reachable from it. From this representation, all these paths can also be efficiently enumerated. We then apply this algorithm to solve a similar problem in edge labelled… ▽ More The purpose of this report is to explain how the textbook breadth-first search algorithm (BFS) can be modified in order to also create a compact representation of all shortest paths connecting a single source node to all the nodes reachable from it. From this representation, all these paths can also be efficiently enumerated. We then apply this algorithm to solve a similar problem in edge labelled graphs, where paths also have an additional restriction that their edge labels form a word belonging to a regular language. Namely, we solve the problem of evaluating regular path queries (RPQs) under the all-shortest paths semantics. △ Less

Submitted 2 February, 2023; v1 submitted 23 April, 2022; originally announced April 2022.

arXiv:2112.06217 [pdf, ps, other]

Graph Pattern Matching in GQL and SQL/PGQ

Authors: Alin Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Hannes Voigt, Oskar van Rest, Domagoj Vrgoč, Mingxi Wu, Fred Zemke

Abstract: As graph databases become widespread, JTC1 -- the committee in joint charge of information technology standards for the International Organization for Standardization (ISO), and International Electrotechnical Commission (IEC) -- has approved a project to create GQL, a standard property graph query language. This complements a project to extend SQL with a new part, SQL/PGQ, which specifies how to d… ▽ More As graph databases become widespread, JTC1 -- the committee in joint charge of information technology standards for the International Organization for Standardization (ISO), and International Electrotechnical Commission (IEC) -- has approved a project to create GQL, a standard property graph query language. This complements a project to extend SQL with a new part, SQL/PGQ, which specifies how to define graph views over an SQL tabular schema, and to run read-only queries against them. Both projects have been assigned to the ISO/IEC JTC1 SC32 working group for Database Languages, WG3, which continues to maintain and enhance SQL as a whole. This common responsibility helps enforce a policy that the identical core of both PGQ and GQL is a graph pattern matching sub-language, here termed GPML. The WG3 design process is also analyzed by an academic working group, part of the Linked Data Benchmark Council (LDBC), whose task is to produce a formal semantics of these graph data languages, which complements their standard specifications. This paper, written by members of WG3 and LDBC, presents the key elements of the GPML of SQL/PGQ and GQL in advance of the publication of these new standards. △ Less

Submitted 12 December, 2021; originally announced December 2021.

ACM Class: H.2.3

arXiv:2111.01540 [pdf, other]

MillenniumDB: A Persistent, Open-Source, Graph Database

Authors: Domagoj Vrgoc, Carlos Rojas, Renzo Angles, Marcelo Arenas, Diego Arroyuelo, Carlos Buil Aranda, Aidan Hogan, Gonzalo Navarro, Cristian Riveros, Juan Romero

Abstract: In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported. The engine itself is founded on a combination of tried and tested techniques from relational data manage… ▽ More In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported. The engine itself is founded on a combination of tried and tested techniques from relational data management, state-of-the-art algorithms for worst-case-optimal joins, as well as graph-specific algorithms for evaluating path queries. In this paper, we present the main design principles underlying MillenniumDB, describing the abstract graph model and query semantics supported, the concrete data model and query syntax implemented, as well as the storage, indexing, query planning and query evaluation techniques used. We evaluate MillenniumDB over real-world data and queries from the Wikidata knowledge graph, where we find that it outperforms other popular persistent graph database engines (including both enterprise and open source alternatives) that support similar query features. △ Less

Submitted 2 November, 2021; originally announced November 2021.

arXiv:2010.13717 [pdf, other]

Expressive power of linear algebra query languages

Authors: Floris Geerts, Thomas Muñoz, Cristian Riveros, Domagoj Vrgoč

Abstract: Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination, matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow loo** for a number of steps that is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of r… ▽ More Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination, matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow loo** for a number of steps that is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of recursion, and show that this suffices to express classical linear algebra algorithms. We study the expressive power of this language and show that it naturally corresponds to arithmetic circuit families, which are often said to capture linear algebra. Furthermore, we analyze several sub-fragments of our language, and show that their expressive power is closely tied to logical formalisms on semiring-annotated relations. △ Less

Submitted 26 October, 2020; originally announced October 2020.

arXiv:1803.05277 [pdf, ps, other]

Constant delay algorithms for regular document spanners

Authors: Fernando Florenzano, Cristian Riveros, Martin Ugarte, Stijn Vansummeren, Domagoj Vrgoc

Abstract: Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have go… ▽ More Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner's output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results. △ Less

Submitted 14 March, 2018; originally announced March 2018.

arXiv:1707.00827 [pdf, ps, other]

Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity

Authors: Francisco Maturana, Cristian Riveros, Domagoj Vrgoč

Abstract: Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, all language proposals introduced so far are designed to output relations, thus making them incapable of handling incomplete information. To remedy… ▽ More Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, all language proposals introduced so far are designed to output relations, thus making them incapable of handling incomplete information. To remedy the situation, we propose to extend information extraction languages with the ability to use map**s, thus allowing us to work with documents which have missing or optional parts. Using this approach, we simplify the semantics of regex formulas and extraction rules, two previously defined methods for extracting information, extend them with the ability to handle incomplete data, and study how they compare in terms of expressive power. We also study computational properties of these languages, focusing on the query enumeration problem, as well as satisfiability and containment. △ Less

Submitted 29 December, 2017; v1 submitted 4 July, 2017; originally announced July 2017.

arXiv:1701.06454 [pdf, other]

Evaluating navigational RDF queries over the Web

Authors: Jorge Baier, Dietrich Daroch, Juan Reutter, Domagoj Vrgoč

Abstract: Semantic Web, and its underlying data format RDF, lend themselves naturally to navigational querying due to their graph-like structure. This is particularly evident when considering RDF data on the Web, where various separately published datasets reference each other and form a giant graph known as the Web of Linked Data. And while navigational queries over singular RDF datasets are supported thro… ▽ More Semantic Web, and its underlying data format RDF, lend themselves naturally to navigational querying due to their graph-like structure. This is particularly evident when considering RDF data on the Web, where various separately published datasets reference each other and form a giant graph known as the Web of Linked Data. And while navigational queries over singular RDF datasets are supported through SPARQL property paths, not much is known about evaluating them over Linked Data. In this paper we propose a method for evaluating property path queries over the Web based on the classical AI search algorithm A*, show its optimality in the open world setting of the Web, and test it using real world queries which access a variety of RDF datasets available online. △ Less

Submitted 14 March, 2017; v1 submitted 23 January, 2017; originally announced January 2017.

arXiv:1701.02221 [pdf, ps, other]

JSON: data model, query languages and schema specification

Authors: Pierre Bourhis, Juan L. Reutter, Fernando Suárez, Domagoj Vrgoč

Abstract: Despite the fact that JSON is currently one of the most popular formats for exchanging data on the Web, there are very few studies on this topic and there are no agreement upon theoretical framework for dealing with JSON. There- fore in this paper we propose a formal data model for JSON documents and, based on the common features present in available systems using JSON, we define a lightweight que… ▽ More Despite the fact that JSON is currently one of the most popular formats for exchanging data on the Web, there are very few studies on this topic and there are no agreement upon theoretical framework for dealing with JSON. There- fore in this paper we propose a formal data model for JSON documents and, based on the common features present in available systems using JSON, we define a lightweight query language allowing us to navigate through JSON documents. We also introduce a logic capturing the schema proposal for JSON and study the complexity of basic computational tasks associated with these two formalisms. △ Less

Submitted 9 January, 2017; originally announced January 2017.

arXiv:1610.06264 [pdf, ps, other]

Foundations of Modern Query Languages for Graph Databases

Authors: Renzo Angles, Marcelo Arenas, Pablo Barcelo, Aidan Hogan, Juan Reutter, Domagoj Vrgoc

Abstract: We survey foundational features underlying modern graph query languages. We first discuss two popular graph data models: edge-labelled graphs, where nodes are connected by directed, labelled edges; and property graphs, where nodes and edges can further have attributes. Next we discuss the two most fundamental graph querying functionalities: graph patterns and navigational expressions. We start wit… ▽ More We survey foundational features underlying modern graph query languages. We first discuss two popular graph data models: edge-labelled graphs, where nodes are connected by directed, labelled edges; and property graphs, where nodes and edges can further have attributes. Next we discuss the two most fundamental graph querying functionalities: graph patterns and navigational expressions. We start with graph patterns, in which a graph-structured query is matched against the data. Thereafter we discuss navigational expressions, in which patterns can be matched recursively against the graph to navigate paths of arbitrary length; we give an overview of what kinds of expressions have been proposed, and how they can be combined with graph patterns. We also discuss several semantics under which queries using the previous features can be evaluated, what effects the selection of features and semantics has on complexity, and offer examples of such features in three modern languages that are used to query graphs: SPARQL, Cypher and Gremlin. We conclude by discussing the importance of formalisation for graph query languages; a summary of what is known about SPARQL, Cypher and Gremlin in terms of expressivity and complexity; and an outline of possible future directions for the area. △ Less

Submitted 15 June, 2017; v1 submitted 19 October, 2016; originally announced October 2016.

Showing 1–14 of 14 results for author: Vrgoč, D