Search | arXiv e-print repository

Link Prediction with Relational Hypergraphs

Authors: Xingyue Huang, Miguel Romero Orth, Pablo Barceló, Michael M. Bronstein, İsmail İlkan Ceylan

Abstract: Link prediction with knowledge graphs has been thoroughly studied in graph machine learning, leading to a rich landscape of graph neural network architectures with successful applications. Nonetheless, it remains challenging to transfer the success of these architectures to relational hypergraphs, where the task of link prediction is over $k$-ary relations, which is substantially harder than link… ▽ More Link prediction with knowledge graphs has been thoroughly studied in graph machine learning, leading to a rich landscape of graph neural network architectures with successful applications. Nonetheless, it remains challenging to transfer the success of these architectures to relational hypergraphs, where the task of link prediction is over $k$-ary relations, which is substantially harder than link prediction with knowledge graphs. In this paper, we propose a framework for link prediction with relational hypergraphs, unlocking applications of graph neural networks to fully relational structures. Theoretically, we conduct a thorough analysis of the expressive power of the resulting model architectures via corresponding relational Weisfeiler-Leman algorithms and also via logical expressiveness. Empirically, we validate the power of the proposed model architectures on various relational hypergraph benchmarks. The resulting model architectures substantially outperform every baseline for inductive link prediction, and lead to state-of-the-art results for transductive link prediction. △ Less

Submitted 23 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2310.11636 [pdf, other]

A Uniform Language to Explain Decision Trees

Authors: Marcelo Arenas, Pablo Barcelo, Diego Bustamante, Jose Caraball, Bernardo Subercaseaux

Abstract: The formal XAI community has studied a plethora of interpretability queries aiming to understand the classifications made by decision trees. However, a more uniform understanding of what questions we can hope to answer about these models, traditionally deemed to be easily interpretable, has remained elusive. In an initial attempt to understand uniform languages for interpretability, Arenas et al.… ▽ More The formal XAI community has studied a plethora of interpretability queries aiming to understand the classifications made by decision trees. However, a more uniform understanding of what questions we can hope to answer about these models, traditionally deemed to be easily interpretable, has remained elusive. In an initial attempt to understand uniform languages for interpretability, Arenas et al. (2021) proposed FOIL, a logic for explaining black-box ML models, and showed that it can express a variety of interpretability queries. However, we show that FOIL is limited in two important senses: (i) it is not expressive enough to capture some crucial queries, and (ii) its model agnostic nature results in a high computational complexity for decision trees. In this paper, we carefully craft two fragments of first-order logic that allow for efficiently interpreting decision trees: Q-DT-FOIL and its optimization variant OPT-DT-FOIL. We show that our proposed logics can express not only a variety of interpretability queries considered by previous literature, but also elegantly allows users to specify different objectives the sought explanations should optimize for. Using finite model-theoretic techniques, we show that the different ingredients of Q-DT-FOIL are necessary for its expressiveness, and yet that queries in Q-DT-FOIL can be evaluated with a polynomial number of queries to a SAT solver, as well as their optimization versions in OPT-DT-FOIL. Besides our theoretical results, we provide a SAT-based implementation of the evaluation for OPT-DT-FOIL that is performant on industry-size decision trees. △ Less

Submitted 21 May, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

arXiv:2310.04598 [pdf, other]

A Neuro-Symbolic Framework for Answering Graph Pattern Queries in Knowledge Graphs

Authors: Tamara Cucumides, Daniel Daza, Pablo Barceló, Michael Cochez, Floris Geerts, Juan L Reutter, Miguel Romero

Abstract: The challenge of answering graph queries over incomplete knowledge graphs is gaining significant attention in the machine learning community. Neuro-symbolic models have emerged as a promising approach, combining good performance with high interpretability. These models utilize trained architectures to execute atomic queries and integrate modules that mimic symbolic query operators. However, most n… ▽ More The challenge of answering graph queries over incomplete knowledge graphs is gaining significant attention in the machine learning community. Neuro-symbolic models have emerged as a promising approach, combining good performance with high interpretability. These models utilize trained architectures to execute atomic queries and integrate modules that mimic symbolic query operators. However, most neuro-symbolic query processors are constrained to tree-like graph pattern queries. These queries admit a bottom-up execution with constant values or anchors at the leaves and the target variable at the root. While expressive, tree-like queries fail to capture critical properties in knowledge graphs, such as the existence of multiple edges between entities or the presence of triangles. We introduce a framework for answering arbitrary graph pattern queries over incomplete knowledge graphs, encompassing both cyclic queries and tree-like queries with existentially quantified leaves. These classes of queries are vital for practical applications but are beyond the scope of most current neuro-symbolic models. Our approach employs an approximation scheme that facilitates acyclic traversals for cyclic patterns, thereby embedding additional symbolic bias into the query execution process. Our experimental evaluation demonstrates that our framework performs competitively on three datasets, effectively handling cyclic queries through our approximation strategy. Additionally, it maintains the performance of existing neuro-symbolic models on anchored tree-like queries and extends their capabilities to queries with existentially quantified variables. △ Less

Submitted 5 June, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

arXiv:2310.03817 [pdf, ps, other]

Logical Languages Accepted by Transformer Encoders with Hard Attention

Authors: Pablo Barcelo, Alexander Kozachinskiy, Anthony Widjaja Lin, Vladimir Podolskii

Abstract: We contribute to the study of formal languages that can be recognized by transformer encoders. We focus on two self-attention mechanisms: (1) UHAT (Unique Hard Attention Transformers) and (2) AHAT (Average Hard Attention Transformers). UHAT encoders are known to recognize only languages inside the circuit complexity class ${\sf AC}^0$, i.e., accepted by a family of poly-sized and depth-bounded boo… ▽ More We contribute to the study of formal languages that can be recognized by transformer encoders. We focus on two self-attention mechanisms: (1) UHAT (Unique Hard Attention Transformers) and (2) AHAT (Average Hard Attention Transformers). UHAT encoders are known to recognize only languages inside the circuit complexity class ${\sf AC}^0$, i.e., accepted by a family of poly-sized and depth-bounded boolean circuits with unbounded fan-ins. On the other hand, AHAT encoders can recognize languages outside ${\sf AC}^0$), but their expressive power still lies within the bigger circuit complexity class ${\sf TC}^0$, i.e., ${\sf AC}^0$-circuits extended by majority gates. We first show a negative result that there is an ${\sf AC}^0$-language that cannot be recognized by an UHAT encoder. On the positive side, we show that UHAT encoders can recognize a rich fragment of ${\sf AC}^0$-languages, namely, all languages definable in first-order logic with arbitrary unary numerical predicates. This logic, includes, for example, all regular languages from ${\sf AC}^0$. We then show that AHAT encoders can recognize all languages of our logic even when we enrich it with counting terms. We apply these results to derive new results on the expressive power of UHAT and AHAT up to permutation of letters (a.k.a. Parikh images). △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2309.17053 [pdf, ps, other]

On the Power of the Weisfeiler-Leman Test for Graph Motif Parameters

Authors: Matthias Lanzinger, Pablo Barceló

Abstract: Seminal research in the field of graph neural networks (GNNs) has revealed a direct correspondence between the expressive capabilities of GNNs and the $k$-dimensional Weisfeiler-Leman ($k$WL) test, a widely-recognized method for verifying graph isomorphism. This connection has reignited interest in comprehending the specific graph properties effectively distinguishable by the $k$WL test. A central… ▽ More Seminal research in the field of graph neural networks (GNNs) has revealed a direct correspondence between the expressive capabilities of GNNs and the $k$-dimensional Weisfeiler-Leman ($k$WL) test, a widely-recognized method for verifying graph isomorphism. This connection has reignited interest in comprehending the specific graph properties effectively distinguishable by the $k$WL test. A central focus of research in this field revolves around determining the least dimensionality $k$, for which $k$WL can discern graphs with different number of occurrences of a pattern graph $P$. We refer to such a least $k$ as the WL-dimension of this pattern counting problem. This inquiry traditionally delves into two distinct counting problems related to patterns: subgraph counting and induced subgraph counting. Intriguingly, despite their initial appearance as separate challenges with seemingly divergent approaches, both of these problems are interconnected components of a more comprehensive problem: "graph motif parameters". In this paper, we provide a precise characterization of the WL-dimension of labeled graph motif parameters. As specific instances of this result, we obtain characterizations of the WL-dimension of the subgraph counting and induced subgraph counting problem for every labeled pattern $P$. We additionally demonstrate that in cases where the $k$WL test distinguishes between graphs with varying occurrences of a pattern $P$, the exact number of occurrences of $P$ can be computed uniformly using only local information of the last layer of a corresponding GNN. We finally delve into the challenge of recognizing the WL-dimension of various graph parameters. We give a polynomial time algorithm for determining the WL-dimension of the subgraph counting problem for given pattern $P$, answering an open question from previous work. △ Less

Submitted 28 March, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

arXiv:2305.08727 [pdf, other]

Separating Automatic Relations

Authors: Pablo Barceló, Diego Figueira, Rémi Morvan

Abstract: We study the separability problem for automatic relations (i.e., relations on finite words definable by synchronous automata) in terms of recognizable relations (i.e., finite unions of products of regular languages). This problem takes as input two automatic relations $R$ and $R'$, and asks if there exists a recognizable relation $S$ that contains $R$ and does not intersect $R'$. We show this prob… ▽ More We study the separability problem for automatic relations (i.e., relations on finite words definable by synchronous automata) in terms of recognizable relations (i.e., finite unions of products of regular languages). This problem takes as input two automatic relations $R$ and $R'$, and asks if there exists a recognizable relation $S$ that contains $R$ and does not intersect $R'$. We show this problem to be undecidable when the number of products allowed in the recognizable relation is fixed. In particular, checking if there exists a recognizable relation $S$ with at most $k$ products of regular languages that separates $R$ from $R'$ is undecidable, for each fixed $k \geq 2$. Our proofs reveal tight connections, of independent interest, between the separability problem and the finite coloring problem for automatic graphs, where colors are regular languages. △ Less

Submitted 2 August, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

Comments: Long version of a paper accepted at MFCS 2023

arXiv:2303.12853 [pdf, ps, other]

Three iterations of $(d-1)$-WL test distinguish non isometric clouds of $d$-dimensional points

Authors: Valentino Delle Rose, Alexander Kozachinskiy, Cristóbal Rojas, Mircea Petrache, Pablo Barceló

Abstract: The Weisfeiler--Lehman (WL) test is a fundamental iterative algorithm for checking isomorphism of graphs. It has also been observed that it underlies the design of several graph neural network architectures, whose capabilities and performance can be understood in terms of the expressive power of this test. Motivated by recent developments in machine learning applications to datasets involving thre… ▽ More The Weisfeiler--Lehman (WL) test is a fundamental iterative algorithm for checking isomorphism of graphs. It has also been observed that it underlies the design of several graph neural network architectures, whose capabilities and performance can be understood in terms of the expressive power of this test. Motivated by recent developments in machine learning applications to datasets involving three-dimensional objects, we study when the WL test is {\em complete} for clouds of euclidean points represented by complete distance graphs, i.e., when it can distinguish, up to isometry, any arbitrary such cloud. Our main result states that the $(d-1)$-dimensional WL test is complete for point clouds in $d$-dimensional Euclidean space, for any $d\ge 2$, and that only three iterations of the test suffice. Our result is tight for $d = 2, 3$. We also observe that the $d$-dimensional WL test only requires one iteration to achieve completeness. △ Less

Submitted 28 March, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

Comments: 14 pages

MSC Class: 05C60; 68R12; 68R10 ACM Class: F.2.2; G.2.2

arXiv:2302.02209 [pdf, ps, other]

A Theory of Link Prediction via Relational Weisfeiler-Leman on Knowledge Graphs

Authors: Xingyue Huang, Miguel Romero Orth, İsmail İlkan Ceylan, Pablo Barceló

Abstract: Graph neural networks are prominent models for representation learning over graph-structured data. While the capabilities and limitations of these models are well-understood for simple graphs, our understanding remains incomplete in the context of knowledge graphs. Our goal is to provide a systematic understanding of the landscape of graph neural networks for knowledge graphs pertaining to the pro… ▽ More Graph neural networks are prominent models for representation learning over graph-structured data. While the capabilities and limitations of these models are well-understood for simple graphs, our understanding remains incomplete in the context of knowledge graphs. Our goal is to provide a systematic understanding of the landscape of graph neural networks for knowledge graphs pertaining to the prominent task of link prediction. Our analysis entails a unifying perspective on seemingly unrelated models and unlocks a series of other models. The expressive power of various models is characterized via a corresponding relational Weisfeiler-Leman algorithm. This analysis is extended to provide a precise logical characterization of the class of functions captured by a class of graph neural networks. The theoretical findings presented in this paper explain the benefits of some widely employed practical design choices, which are validated empirically. △ Less

Submitted 26 October, 2023; v1 submitted 4 February, 2023; originally announced February 2023.

Comments: Proceedings of the Thirty-Seventh Annual Conference on Advances in Neural Information Processing Systems (NeurIPS 2023). Code available at: https://github.com/HxyScotthuang/CMPNN

arXiv:2211.17113 [pdf, other]

Weisfeiler and Leman Go Relational

Authors: Pablo Barcelo, Mikhail Galkin, Christopher Morris, Miguel Romero Orth

Abstract: Knowledge graphs, modeling multi-relational data, improve numerous applications such as question answering or graph logical reasoning. Many graph neural networks for such data emerged recently, often outperforming shallow architectures. However, the design of such multi-relational graph neural networks is ad-hoc, driven mainly by intuition and empirical insights. Up to now, their expressivity, the… ▽ More Knowledge graphs, modeling multi-relational data, improve numerous applications such as question answering or graph logical reasoning. Many graph neural networks for such data emerged recently, often outperforming shallow architectures. However, the design of such multi-relational graph neural networks is ad-hoc, driven mainly by intuition and empirical insights. Up to now, their expressivity, their relation to each other, and their (practical) learning performance is poorly understood. Here, we initiate the study of deriving a more principled understanding of multi-relational graph neural networks. Namely, we investigate the limitations in the expressive power of the well-known Relational GCN and Compositional GCN architectures and shed some light on their practical learning performance. By aligning both architectures with a suitable version of the Weisfeiler-Leman test, we establish under which conditions both models have the same expressive power in distinguishing non-isomorphic (multi-relational) graphs or vertices with different structural roles. Further, by leveraging recent progress in designing expressive graph neural networks, we introduce the $k$-RN architecture that provably overcomes the expressiveness limitations of the above two architectures. Empirically, we confirm our theoretical findings in a vertex classification setting over small and large multi-relational graphs. △ Less

Submitted 30 November, 2022; originally announced November 2022.

Comments: Learning on Graphs Conference 2022. arXiv admin note: text overlap with arXiv:2206.11168

arXiv:2211.02144 [pdf, ps, other]

No Agreement Without Loss: Learning and Social Choice in Peer Review

Authors: Pablo Barceló, Mauricio Duarte, Cristóbal Rojas, Tomasz Steifer

Abstract: In peer review systems, reviewers are often asked to evaluate various features of submissions, such as technical quality or novelty. A score is given to each of the predefined features and based on these the reviewer has to provide an overall quantitative recommendation. It may be assumed that each reviewer has her own map** from the set of features to a recommendation, and that different review… ▽ More In peer review systems, reviewers are often asked to evaluate various features of submissions, such as technical quality or novelty. A score is given to each of the predefined features and based on these the reviewer has to provide an overall quantitative recommendation. It may be assumed that each reviewer has her own map** from the set of features to a recommendation, and that different reviewers have different map**s in mind. This introduces an element of arbitrariness known as commensuration bias. In this paper we discuss a framework, introduced by Noothigattu, Shah and Procaccia, and then applied by the organizers of the AAAI 2022 conference. Noothigattu, Shah and Procaccia proposed to aggregate reviewer's map** by minimizing certain loss functions, and studied axiomatic properties of this approach, in the sense of social choice theory. We challenge several of the results and assumptions used in their work and report a number of negative results. On the one hand, we study a trade-off between some of the axioms proposed and the ability of the method to properly capture agreements of the majority of reviewers. On the other hand, we show that drop** a certain unrealistic assumption has dramatic effects, including causing the method to be discontinuous. △ Less

Submitted 3 August, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: accepted for ECAI 2023

MSC Class: 91B14

arXiv:2207.12213 [pdf, other]

On Computing Probabilistic Explanations for Decision Trees

Authors: Marcelo Arenas, Pablo Barceló, Miguel Romero, Bernardo Subercaseaux

Abstract: Formal XAI (explainable AI) is a growing area that focuses on computing explanations with mathematical guarantees for the decisions made by ML models. Inside formal XAI, one of the most studied cases is that of explaining the choices taken by decision trees, as they are traditionally deemed as one of the most interpretable classes of models. Recent work has focused on studying the computation of "… ▽ More Formal XAI (explainable AI) is a growing area that focuses on computing explanations with mathematical guarantees for the decisions made by ML models. Inside formal XAI, one of the most studied cases is that of explaining the choices taken by decision trees, as they are traditionally deemed as one of the most interpretable classes of models. Recent work has focused on studying the computation of "sufficient reasons", a kind of explanation in which given a decision tree $T$ and an instance $x$, one explains the decision $T(x)$ by providing a subset $y$ of the features of $x$ such that for any other instance $z$ compatible with $y$, it holds that $T(z) = T(x)$, intuitively meaning that the features in $y$ are already enough to fully justify the classification of $x$ by $T$. It has been argued, however, that sufficient reasons constitute a restrictive notion of explanation, and thus the community has started to study their probabilistic counterpart, in which one requires that the probability of $T(z) = T(x)$ must be at least some value $δ\in (0, 1]$, where $z$ is a random instance that is compatible with $y$. Our paper settles the computational complexity of $δ$-sufficient-reasons over decision trees, showing that both (1) finding $δ$-sufficient-reasons that are minimal in size, and (2) finding $δ$-sufficient-reasons that are minimal inclusion-wise, do not admit polynomial-time algorithms (unless P=NP). This is in stark contrast with the deterministic case ($δ= 1$) where inclusion-wise minimal sufficient-reasons are easy to compute. By doing this, we answer two open problems originally raised by Izza et al. On the positive side, we identify structural restrictions of decision trees that make the problem tractable, and show how SAT solvers might be able to tackle these problems in practical settings. △ Less

Submitted 30 June, 2022; originally announced July 2022.

Comments: Preliminary version. Abstract slightly trimmed due to arxiv character limit

arXiv:2110.02376 [pdf, other]

Foundations of Symbolic Languages for Model Interpretability

Authors: Marcelo Arenas, Daniel Baez, Pablo Barceló, Jorge Pérez, Bernardo Subercaseaux

Abstract: Several queries and scores have recently been proposed to explain individual predictions over ML models. Given the need for flexible, reliable, and easy-to-apply interpretability methods for ML models, we foresee the need for develo** declarative languages to naturally specify different explainability queries. We do this in a principled way by rooting such a language in a logic, called FOIL, tha… ▽ More Several queries and scores have recently been proposed to explain individual predictions over ML models. Given the need for flexible, reliable, and easy-to-apply interpretability methods for ML models, we foresee the need for develo** declarative languages to naturally specify different explainability queries. We do this in a principled way by rooting such a language in a logic, called FOIL, that allows for expressing many simple but important explainability queries, and might serve as a core for more expressive interpretability languages. We study the computational complexity of FOIL queries over two classes of ML models often deemed to be easily interpretable: decision trees and OBDDs. Since the number of possible inputs for an ML model is exponential in its dimension, the tractability of the FOIL evaluation problem is delicate but can be achieved by either restricting the structure of the models or the fragment of FOIL being evaluated. We also present a prototype implementation of FOIL wrapped in a high-level declarative language and perform experiments showing that such a language can be used in practice. △ Less

Submitted 14 November, 2021; v1 submitted 5 October, 2021; originally announced October 2021.

Comments: Accepted as Spotlight for NeurIPS'2021

arXiv:2106.06707 [pdf, other]

Graph Neural Networks with Local Graph Parameters

Authors: Pablo Barceló, Floris Geerts, Juan Reutter, Maksimilian Ryschkov

Abstract: Various recent proposals increase the distinguishing power of Graph Neural Networks GNNs by propagating features between $k$-tuples of vertices. The distinguishing power of these "higher-order'' GNNs is known to be bounded by the $k$-dimensional Weisfeiler-Leman (WL) test, yet their $\mathcal O(n^k)$ memory requirements limit their applicability. Other proposals infuse GNNs with local higher-order… ▽ More Various recent proposals increase the distinguishing power of Graph Neural Networks GNNs by propagating features between $k$-tuples of vertices. The distinguishing power of these "higher-order'' GNNs is known to be bounded by the $k$-dimensional Weisfeiler-Leman (WL) test, yet their $\mathcal O(n^k)$ memory requirements limit their applicability. Other proposals infuse GNNs with local higher-order graph structural information from the start, hereby inheriting the desirable $\mathcal O(n)$ memory requirement from GNNs at the cost of a one-time, possibly non-linear, preprocessing step. We propose local graph parameter enabled GNNs as a framework for studying the latter kind of approaches and precisely characterize their distinguishing power, in terms of a variant of the WL test, and in terms of the graph structural properties that they can take into account. Local graph parameters can be added to any GNN architecture, and are cheap to compute. In terms of expressive power, our proposal lies in the middle of GNNs and their higher-order counterparts. Further, we propose several techniques to aide in choosing the right local graph parameters. Our results connect GNNs with deep results in finite model theory and finite variable logics. Our experimental evaluation shows that adding local graph parameters often has a positive effect for a variety of GNNs, datasets and graph learning tasks. △ Less

Submitted 12 June, 2021; originally announced June 2021.

arXiv:2104.08015 [pdf, other]

On the Complexity of SHAP-Score-Based Explanations: Tractability via Knowledge Compilation and Non-Approximability Results

Authors: Marcelo Arenas, Pablo Barceló, Leopoldo Bertossi, Mikaël Monet

Abstract: In Machine Learning, the $\mathsf{SHAP}$-score is a version of the Shapley value that is used to explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is an intractable problem, we prove a strong positive result stating that the $\mathsf{SHAP}$-score can be computed in polynomial time over deterministic and decom… ▽ More In Machine Learning, the $\mathsf{SHAP}$-score is a version of the Shapley value that is used to explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is an intractable problem, we prove a strong positive result stating that the $\mathsf{SHAP}$-score can be computed in polynomial time over deterministic and decomposable Boolean circuits. Such circuits are studied in the field of Knowledge Compilation and generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees and Ordered Binary Decision Diagrams (OBDDs). We also establish the computational limits of the SHAP-score by observing that computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider. It also implies that computing $\mathsf{SHAP}$-scores is intractable as well over the class of propositional formulas in DNF. Based on this negative result, we look for the existence of fully-polynomial randomized approximation schemes (FPRAS) for computing $\mathsf{SHAP}$-scores over such class. In contrast to the model counting problem for DNF formulas, which admits an FPRAS, we prove that no such FPRAS exists for the computation of $\mathsf{SHAP}$-scores. Surprisingly, this negative result holds even for the class of monotone formulas in DNF. These techniques can be further extended to prove another strong negative result: Under widely believed complexity assumptions, there is no polynomial-time algorithm that checks, given a monotone DNF formula $\varphi$ and features $x,y$, whether the $\mathsf{SHAP}$-score of $x$ in $\varphi$ is smaller than the $\mathsf{SHAP}$-score of $y$ in $\varphi$. △ Less

Submitted 30 March, 2023; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: Up to the formatting, this is the exact content of the paper in Journal of Machine Learning Research (JMLR)

arXiv:2011.09314 [pdf, ps, other]

First-Order Rewritability of Frontier-Guarded Ontology-Mediated Queries

Authors: Pablo Barcelo, Gerald Berger, Carsten Lutz, Andreas Pieris

Abstract: We focus on ontology-mediated queries (OMQs) based on (frontier-)guarded existential rules and (unions of) conjunctive queries, and we investigate the problem of FO-rewritability, i.e., whether an OMQ can be rewritten as a first-order query. We adopt two different approaches. The first approach employs standard two-way alternating parity tree automata. Although it does not lead to a tight complexi… ▽ More We focus on ontology-mediated queries (OMQs) based on (frontier-)guarded existential rules and (unions of) conjunctive queries, and we investigate the problem of FO-rewritability, i.e., whether an OMQ can be rewritten as a first-order query. We adopt two different approaches. The first approach employs standard two-way alternating parity tree automata. Although it does not lead to a tight complexity bound, it provides a transparent solution based on widely known tools. The second approach relies on a sophisticated automata model, known as cost automata. This allows us to show that our problem is 2ExpTime-complete. In both approaches, we provide semantic characterizations of FO-rewritability that are of independent interest. △ Less

Submitted 18 November, 2020; originally announced November 2020.

arXiv:2011.06330 [pdf, other]

The Complexity of Counting Problems over Incomplete Databases

Authors: Marcelo Arenas, Pablo Barceló, Mikaël Monet

Abstract: We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query $q$, we consider the following two problems: given as input an incomplete database $D$, (a) re… ▽ More We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query $q$, we consider the following two problems: given as input an incomplete database $D$, (a) return the number of completions of $D$ that satisfy $q$; or (b) return the number of valuations of the nulls of $D$ yielding a completion that satisfies $q$. We obtain dichotomies between \#P-hardness and polynomial-time computability for these problems when $q$ is a self-join-free conjunctive query, and study the impact on the complexity of the following two restrictions: (1) every null occurs at most once in $D$ (what is called Codd tables); and (2) the domain of each null is the same. Roughly speaking, we show that counting completions is much harder than counting valuations: for instance, while the latter is always in \#P, we prove that the former is not in \#P under some widely believed theoretical complexity assumption. Moreover, we find that both (1) and (2) can reduce the complexity of our problems. We also study the approximability of these problems and show that, while counting valuations always has a fully polynomial-time randomized approximation scheme (FPRAS), in most cases counting completions does not. Finally, we consider more expressive query languages and situate our problems with respect to known complexity classes. △ Less

Submitted 28 April, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

Comments: 51 pages, including 43 pages of main text. Extended version of arXiv:1912.11064. Up to the stylesheet, page/environment numbering, minor formatting, and publisher-induced changes, this is the exact content of the paper in ACM Transactions on Computational Logic

ACM Class: H.2

arXiv:2010.12265 [pdf, other]

Model Interpretability through the Lens of Computational Complexity

Authors: Pablo Barceló, Mikaël Monet, Jorge Pérez, Bernardo Subercaseaux

Abstract: In spite of several claims stating that some models are more interpretable than others -- e.g., "linear models are more interpretable than deep neural networks" -- we still lack a principled notion of interpretability to formally compare among different classes of models. We make a step towards such a notion by studying whether folklore interpretability claims have a correlate in terms of computat… ▽ More In spite of several claims stating that some models are more interpretable than others -- e.g., "linear models are more interpretable than deep neural networks" -- we still lack a principled notion of interpretability to formally compare among different classes of models. We make a step towards such a notion by studying whether folklore interpretability claims have a correlate in terms of computational complexity theory. We focus on local post-hoc explainability queries that, intuitively, attempt to answer why individual inputs are classified in a certain way by a given model. In a nutshell, we say that a class $\mathcal{C}_1$ of models is more interpretable than another class $\mathcal{C}_2$, if the computational complexity of answering post-hoc queries for models in $\mathcal{C}_2$ is higher than for those in $\mathcal{C}_1$. We prove that this notion provides a good theoretical counterpart to current beliefs on the interpretability of models; in particular, we show that under our definition and assuming standard complexity-theoretical assumptions (such as P$\neq$NP), both linear and tree-based models are strictly more interpretable than neural networks. Our complexity analysis, however, does not provide a clear-cut difference between linear and tree-based models, as we obtain different results depending on the particular post-hoc explanations considered. Finally, by applying a finer complexity analysis based on parameterized complexity, we are able to prove a theoretical result suggesting that shallow neural networks are more interpretable than deeper ones. △ Less

Submitted 12 November, 2020; v1 submitted 23 October, 2020; originally announced October 2020.

Comments: 36 pages, including 9 pages of main text. This is the arXiv version of the NeurIPS'2020 paper. Except from minor differences that could be introduced by the publisher, the only difference should be the addition of the appendix, which contains all the proofs that do not appear in the main text

arXiv:2007.14045 [pdf, ps, other]

The Tractability of SHAP-Score-Based Explanations over Deterministic and Decomposable Boolean Circuits

Authors: Marcelo Arenas, Pablo Barceló Leopoldo Bertossi, Mikaël Monet

Abstract: Scores based on Shapley values are widely used for providing explanations to classification results over machine learning models. A prime example of this is the influential SHAP-score, a version of the Shapley value that can help explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is a computationally intractab… ▽ More Scores based on Shapley values are widely used for providing explanations to classification results over machine learning models. A prime example of this is the influential SHAP-score, a version of the Shapley value that can help explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is a computationally intractable problem, it has recently been claimed that the SHAP-score can be computed in polynomial time over the class of decision trees. In this paper, we provide a proof of a stronger result over Boolean models: the SHAP-score can be computed in polynomial time over deterministic and decomposable Boolean circuits. Such circuits, also known as tractable Boolean circuits, generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees, Ordered Binary Decision Diagrams (OBDDs) and Free Binary Decision Diagrams (FBDDs). We also establish the computational limits of the notion of SHAP-score by observing that, under a mild condition, computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider, as removing one or the other renders the problem of computing the SHAP-score intractable (namely, #P-hard). △ Less

Submitted 3 April, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

Comments: 17 pages, including 8 pages of main text. arXiv version of the AAAI'21 conference paper. Except from the addition of the technical appendix, the content is the same as the AAAI one

arXiv:2003.07800 [pdf, ps, other]

When is Ontology-Mediated Querying Efficient?

Authors: Pablo Barcelo, Cristina Feier, Carsten Lutz, Andreas Pieris

Abstract: In ontology-mediated querying, description logic (DL) ontologies are used to enrich incomplete data with domain knowledge which results in more complete answers to queries. However, the evaluation of ontology-mediated queries (OMQs) over relational databases is computationally hard. This raises the question when OMQ evaluation is efficient, in the sense of being tractable in combined complexity or… ▽ More In ontology-mediated querying, description logic (DL) ontologies are used to enrich incomplete data with domain knowledge which results in more complete answers to queries. However, the evaluation of ontology-mediated queries (OMQs) over relational databases is computationally hard. This raises the question when OMQ evaluation is efficient, in the sense of being tractable in combined complexity or fixed-parameter tractable. We study this question for a range of ontology-mediated query languages based on several important and widely-used DLs, using unions of conjunctive queries as the actual queries. For the DL ELHI extended with the bottom concept, we provide a characterization of the classes of OMQs that are fixed-parameter tractable. For its fragment EL extended with domain and range restrictions and the bottom concept (which restricts the use of inverse roles), we provide a characterization of the classes of OMQs that are tractable in combined complexity. Both results are in terms of equivalence to OMQs of bounded tree width and rest on a reasonable assumption from parameterized complexity theory. They are similar in spirit to Grohe's seminal characterization of the tractable classes of conjunctive queries over relational databases. We further study the complexity of the meta problem of deciding whether a given OMQ is equivalent to an OMQ of bounded tree width, providing several completeness results that range from NP to 2ExpTime, depending on the DL used. We also consider the DL-Lite family of DLs, including members that admit functional roles. △ Less

Submitted 29 July, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

arXiv:1912.12442 [pdf, ps, other]

The Limits of Efficiency for Open- and Closed-World Query Evaluation Under Guarded TGDs

Authors: Pablo Barcelo, Victor Dalmau, Cristina Feier, Carsten Lutz, Andreas Pieris

Abstract: Ontology-mediated querying and querying in the presence of constraints are two key database problems where tuple-generating dependencies (TGDs) play a central role. In ontology-mediated querying, TGDs can formalize the ontology and thus derive additional facts from the given data, while in querying in the presence of constraints, they restrict the set of admissible databases. In this work, we stud… ▽ More Ontology-mediated querying and querying in the presence of constraints are two key database problems where tuple-generating dependencies (TGDs) play a central role. In ontology-mediated querying, TGDs can formalize the ontology and thus derive additional facts from the given data, while in querying in the presence of constraints, they restrict the set of admissible databases. In this work, we study the limits of efficient query evaluation in the context of the above two problems, focussing on guarded and frontier-guarded TGDs and on UCQs as the actual queries. We show that a class of ontology-mediated queries (OMQs) based on guarded TGDs can be evaluated in FPT iff the OMQs in the class are equivalent to OMQs in which the actual query has bounded treewidth, up to some reasonable assumptions. For querying in the presence of constraints, we consider classes of constraint-query specifications (CQSs) that bundle a set of constraints with an actual query. We show a dichotomy result for CQSs based on guarded TGDs that parallels the one for OMQs except that, additionally, FPT coincides with PTime combined complexity. The proof is based on a novel connection between OMQ and CQS evaluation. Using a direct proof, we also show a similar dichotomy result, again up to some reasonable assumptions, for CQSs based on frontier-guarded TGDs with a bounded number of atoms in TGD heads. Our results on CQSs can be viewed as extensions of Grohe's well-known characterization of the tractable classes of CQs (without constraints). Like Grohe's characterization, all the above results assume that the arity of relation symbols is bounded by a constant. We also study the associated meta problems, i.e., whether a given OMQ or CQS is equivalent to one in which the actual query has bounded treewidth. △ Less

Submitted 28 December, 2019; originally announced December 2019.

arXiv:1912.11064 [pdf, other]

doi 10.1145/3375395.3387656

Counting Problems over Incomplete Databases

Authors: Marcelo Arenas, Pablo Barceló, Mikaël Monet

Abstract: We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query $q$, we consider the following two problems: given as input an incomplete database $D$, (a) re… ▽ More We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query $q$, we consider the following two problems: given as input an incomplete database $D$, (a) return the number of completions of $D$ that satisfy $q$; or (b) return or the number of valuations of the nulls of $D$ yielding a completion that satisfies $q$. We obtain dichotomies between #P-hardness and polynomial-time computability for these problems when $q$ is a self-join--free conjunctive query, and study the impact on the complexity of the following two restrictions: (1) every null occurs at most once in $D$ (what is called Codd tables); and (2) the domain of each null is the same. Roughly speaking, we show that counting completions is much harder than counting valuations (for instance, while the latter is always in #P, we prove that the former is not in #P under some widely believed theoretical complexity assumption). Moreover, we find that both (1) and (2) reduce the complexity of our problems. We also study the approximability of these problems and show that, while counting valuations always has a fully polynomial randomized approximation scheme, in most cases counting completions does not. Finally, we consider more expressive query languages and situate our problems with respect to known complexity classes. △ Less

Submitted 26 March, 2020; v1 submitted 23 December, 2019; originally announced December 2019.

Comments: 29 pages, including 12 pages of main text. This is the arXiv version of the PODS'20 paper. Except from minor differences that could be introduced by the publisher, the only difference should be the addition of the appendix, which contains all the proofs that do not appear in the main text

arXiv:1909.11693 [pdf, other]

On the Expressiveness of LARA: A Unified Language for Linear and Relational Algebra

Authors: Pablo Barceló, Nelson Higuera, Jorge Pérez, Bernardo Subercaseaux

Abstract: We study the expressive power of the LARA language -- a recently proposed unified model for expressing relational and linear algebra operations -- both in terms of traditional database query languages and some analytic tasks often performed in machine learning pipelines. We start by showing LARA to be expressive complete with respect to first-order logic with aggregation. Since LARA is parameteriz… ▽ More We study the expressive power of the LARA language -- a recently proposed unified model for expressing relational and linear algebra operations -- both in terms of traditional database query languages and some analytic tasks often performed in machine learning pipelines. We start by showing LARA to be expressive complete with respect to first-order logic with aggregation. Since LARA is parameterized by a set of user-defined functions which allow to transform values in tables, the exact expressive power of the language depends on how these functions are defined. We distinguish two main cases depending on the level of genericity queries are enforced to satisfy. Under strong genericity assumptions the language cannot express matrix convolution, a very important operation in current machine learning operations. This language is also local, and thus cannot express operations such as matrix inverse that exhibit a recursive behavior. For expressing convolution, one can relax the genericity requirement by adding an underlying linear order on the domain. This, however, destroys locality and turns the expressive power of the language much more difficult to understand. In particular, although under complexity assumptions the resulting language can still not express matrix inverse, a proof of this fact without such assumptions seems challenging to obtain. △ Less

Submitted 25 September, 2019; originally announced September 2019.

arXiv:1904.00934 [pdf, ps, other]

A More General Theory of Static Approximations for Conjunctive Queries

Authors: Pablo Barceló, Miguel Romero, Thomas Zeume

Abstract: Conjunctive query (CQ) evaluation is NP-complete, but becomes tractable for fragments of bounded hypertreewidth. Approximating a hard CQ by a query from such a fragment can thus allow for an efficient approximate evaluation. While underapproximations (i.e., approximations that return correct answers only) are well-understood, the dual notion of overapproximations (i.e, approximations that return c… ▽ More Conjunctive query (CQ) evaluation is NP-complete, but becomes tractable for fragments of bounded hypertreewidth. Approximating a hard CQ by a query from such a fragment can thus allow for an efficient approximate evaluation. While underapproximations (i.e., approximations that return correct answers only) are well-understood, the dual notion of overapproximations (i.e, approximations that return complete - but not necessarily sound - answers), and also a more general notion of approximation based on the symmetric difference of query results, are almost unexplored. In fact, the decidability of the basic problems of evaluation, identification, and existence of those approximations has been open. This article establishes a connection between overapproximations and existential pebble games that allows for studying such problems systematically. Building on this connection, it is shown that the evaluation and identification problem for overapproximations can be solved in polynomial time. While the general existence problem remains open, the problem is shown to be decidable in 2EXPTIME over the class of acyclic CQs and in PTIME for Boolean CQs over binary schemata. Additionally we propose a more liberal notion of overapproximations to remedy the known shortcoming that queries might not have an overapproximation, and study how queries can be overapproximated in the presence of tuple generating and equality generating dependencies. The techniques are then extended to symmetric difference approximations and used to provide several complexity results for the identification, existence, and evaluation problem for this type of approximations. △ Less

Submitted 1 April, 2019; originally announced April 2019.

arXiv:1904.00850 [pdf, other]

Boundedness of Conjunctive Regular Path Queries

Authors: Pablo Barceló, Diego Figueira, Miguel Romero

Abstract: We study the boundedness problem for unions of conjunctive regular path queries with inverses (UC2RPQs). This is the problem of, given a UC2RPQ, checking whether it is equivalent to a union of conjunctive queries (UCQ). We show the problem to be ExpSpace-complete, thus coinciding with the complexity of containment for UC2RPQs. As a corollary, when a UC2RPQ is bounded, it is equivalent to a UCQ of… ▽ More We study the boundedness problem for unions of conjunctive regular path queries with inverses (UC2RPQs). This is the problem of, given a UC2RPQ, checking whether it is equivalent to a union of conjunctive queries (UCQ). We show the problem to be ExpSpace-complete, thus coinciding with the complexity of containment for UC2RPQs. As a corollary, when a UC2RPQ is bounded, it is equivalent to a UCQ of at most triple-exponential size, and in fact we show that this bound is optimal. We also study better behaved classes of UC2RPQs, namely acyclic UC2RPQs of bounded thickness, and strongly connected UCRPQs, whose boundedness problem are, respectively, PSpace-complete and $Π^p_2$-complete. Most upper bounds exploit results on limitedness for distance automata, in particular extending the model with alternation and two-wayness, which may be of independent interest. △ Less

Submitted 1 April, 2019; originally announced April 2019.

arXiv:1903.00728 [pdf, other]

Monadic Decomposability of Regular Relations

Authors: Pablo Barcelo, Chih-Duo Hong, Xuan-Bach Le, Anthony W. Lin, Reino Niskanen

Abstract: Monadic decomposibility --- the ability to determine whether a formula in a given logical theory can be decomposed into a boolean combination of monadic formulas --- is a powerful tool for devising a decision procedure for a given logical theory. In this paper, we revisit a classical decision problem in automata theory: given a regular (a.k.a. synchronized rational) relation, determine whether it… ▽ More Monadic decomposibility --- the ability to determine whether a formula in a given logical theory can be decomposed into a boolean combination of monadic formulas --- is a powerful tool for devising a decision procedure for a given logical theory. In this paper, we revisit a classical decision problem in automata theory: given a regular (a.k.a. synchronized rational) relation, determine whether it is recognizable, i.e., it has a monadic decomposition (that is, a representation as a boolean combination of cartesian products of regular languages). Regular relations are expressive formalisms which, using an appropriate string encoding, can capture relations definable in Presburger Arithmetic. In fact, their expressive power coincide with relations definable in a universal automatic structure; equivalently, those definable by finite set interpretations in WS1S (Weak Second Order Theory of One Successor). Determining whether a regular relation admits a recognizable relation was known to be decidable (and in exponential time for binary relations), but its precise complexity still hitherto remains open. Our main contribution is to fully settle the complexity of this decision problem by develo** new techniques employing infinite Ramsey theory. The complexity for DFA (resp. NFA) representations of regular relations is shown to be NLOGSPACE-complete (resp. \PSPACE-complete). △ Less

Submitted 8 May, 2019; v1 submitted 2 March, 2019; originally announced March 2019.

Comments: Full version of the paper accepted to ICALP 2019

arXiv:1901.03429 [pdf, other]

On the Turing Completeness of Modern Neural Network Architectures

Authors: Jorge Pérez, Javier Marinković, Pablo Barceló

Abstract: Alternatives to recurrent neural networks, in particular, architectures based on attention or convolutions, have been gaining momentum for processing input sequences. In spite of their relevance, the computational properties of these alternatives have not yet been fully explored. We study the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Trans… ▽ More Alternatives to recurrent neural networks, in particular, architectures based on attention or convolutions, have been gaining momentum for processing input sequences. In spite of their relevance, the computational properties of these alternatives have not yet been fully explored. We study the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer (Vaswani et al., 2017) and the Neural GPU (Kaiser & Sutskever, 2016). We show both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external memory to become Turing complete. Our study also reveals some minimal sets of elements needed to obtain these completeness results. △ Less

Submitted 10 January, 2019; originally announced January 2019.

Comments: ICLR2019 camera ready

arXiv:1712.01550 [pdf, other]

G-CORE: A Core for Future Graph Query Languages

Authors: Renzo Angles, Marcelo Arenas, Pablo Barceló, Peter Boncz, George H. L. Fletcher, Claudio Gutierrez, Tobias Lindaaker, Marcus Paradies, Stefan Plantikow, Juan Sequeda, Oskar van Rest, Hannes Voigt

Abstract: We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class… ▽ More We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity. △ Less

Submitted 6 December, 2017; v1 submitted 5 December, 2017; originally announced December 2017.

arXiv:1703.07994 [pdf, ps, other]

Containment for Rule-Based Ontology-Mediated Queries

Authors: Pablo Barcelo, Gerald Berger, Andreas Pieris

Abstract: Many efforts have been dedicated to identifying restrictions on ontologies expressed as tuple-generating dependencies (tgds), a.k.a. existential rules, that lead to the decidability for the problem of answering ontology-mediated queries (OMQs). This has given rise to three families of formalisms: guarded, non-recursive, and sticky sets of tgds. In this work, we study the containment problem for OM… ▽ More Many efforts have been dedicated to identifying restrictions on ontologies expressed as tuple-generating dependencies (tgds), a.k.a. existential rules, that lead to the decidability for the problem of answering ontology-mediated queries (OMQs). This has given rise to three families of formalisms: guarded, non-recursive, and sticky sets of tgds. In this work, we study the containment problem for OMQs expressed in such formalisms, which is a key ingredient for solving static analysis tasks associated with them. Our main contribution is the development of specially tailored techniques for OMQ containment under the classes of tgds stated above. This enables us to obtain sharp complexity bounds for the problems at hand, which in turn allow us to delimitate its practical applicability. We also apply our techniques to pinpoint the complexity of problems associated with two emerging applications of OMQ containment: distribution over components and UCQ rewritability of OMQs. △ Less

Submitted 18 April, 2017; v1 submitted 23 March, 2017; originally announced March 2017.

arXiv:1701.09007 [pdf, other]

Research Directions for Principles of Data Management (Dagstuhl Perspectives Workshop 16151)

Authors: Serge Abiteboul, Marcelo Arenas, Pablo Barceló, Meghyn Bienvenu, Diego Calvanese, Claire David, Richard Hull, Eyke Hüllermeier, Benny Kimelfeld, Leonid Libkin, Wim Martens, Tova Milo, Filip Murlak, Frank Neven, Magdalena Ortiz, Thomas Schwentick, Julia Stoyanovich, Jianwen Su, Dan Suciu, Victor Vianu, Ke Yi

Abstract: In April 2016, a community of researchers working in the area of Principles of Data Management (PDM) joined in a workshop at the Dagstuhl Castle in Germany. The workshop was organized jointly by the Executive Committee of the ACM Symposium on Principles of Database Systems (PODS) and the Council of the International Conference on Database Theory (ICDT). The mission of this workshop was to identify… ▽ More In April 2016, a community of researchers working in the area of Principles of Data Management (PDM) joined in a workshop at the Dagstuhl Castle in Germany. The workshop was organized jointly by the Executive Committee of the ACM Symposium on Principles of Database Systems (PODS) and the Council of the International Conference on Database Theory (ICDT). The mission of this workshop was to identify and explore some of the most important research directions that have high relevance to society and to Computer Science today, and where the PDM community has the potential to make significant contributions. This report describes the family of research directions that the workshop focused on from three perspectives: potential practical relevance, results already obtained, and research questions that appear surmountable in the short and medium term. △ Less

Submitted 31 January, 2017; originally announced January 2017.

arXiv:1610.06264 [pdf, ps, other]

Foundations of Modern Query Languages for Graph Databases

Authors: Renzo Angles, Marcelo Arenas, Pablo Barcelo, Aidan Hogan, Juan Reutter, Domagoj Vrgoc

Abstract: We survey foundational features underlying modern graph query languages. We first discuss two popular graph data models: edge-labelled graphs, where nodes are connected by directed, labelled edges; and property graphs, where nodes and edges can further have attributes. Next we discuss the two most fundamental graph querying functionalities: graph patterns and navigational expressions. We start wit… ▽ More We survey foundational features underlying modern graph query languages. We first discuss two popular graph data models: edge-labelled graphs, where nodes are connected by directed, labelled edges; and property graphs, where nodes and edges can further have attributes. Next we discuss the two most fundamental graph querying functionalities: graph patterns and navigational expressions. We start with graph patterns, in which a graph-structured query is matched against the data. Thereafter we discuss navigational expressions, in which patterns can be matched recursively against the graph to navigate paths of arbitrary length; we give an overview of what kinds of expressions have been proposed, and how they can be combined with graph patterns. We also discuss several semantics under which queries using the previous features can be evaluated, what effects the selection of features and semantics has on complexity, and offer examples of such features in three modern languages that are used to query graphs: SPARQL, Cypher and Gremlin. We conclude by discussing the importance of formalisation for graph query languages; a summary of what is known about SPARQL, Cypher and Gremlin in terms of expressivity and complexity; and an outline of possible future directions for the area. △ Less

Submitted 15 June, 2017; v1 submitted 19 October, 2016; originally announced October 2016.

arXiv:1606.01206 [pdf, other]

The complexity of reverse engineering problems for conjunctive queries

Authors: Pablo Barcelo, Miguel Romero

Abstract: Reverse engineering problems for conjunctive queries (CQs), such as query by example (QBE) or definability, take a set of user examples and convert them into an explanatory CQ. Despite their importance, the complexity of these problems is prohibitively high (coNEXPTIME-complete). We isolate their two main sources of complexity and propose relaxations of them that reduce the complexity while having… ▽ More Reverse engineering problems for conjunctive queries (CQs), such as query by example (QBE) or definability, take a set of user examples and convert them into an explanatory CQ. Despite their importance, the complexity of these problems is prohibitively high (coNEXPTIME-complete). We isolate their two main sources of complexity and propose relaxations of them that reduce the complexity while having meaningful theoretical interpretations. The first relaxation is based on the idea of using existential pebble games for approximating homomorphism tests. We show that this characterizes QBE/definability for CQs up to treewidth $k$, while reducing the complexity to EXPTIME. As a side result, we obtain that the complexity of the QBE/definability problems for CQs of treewidth $k$ is EXPTIME-complete for each $k \geq 1$. The second relaxation is based on the idea of "desynchronizing" direct products, which characterizes QBE/definability for unions of CQs and reduces the complexity to coNP. The combination of these two relaxations yields tractability for QBE and characterizes it in terms of unions of CQs of treewidth at most $k$. We also study the complexity of these problems for conjunctive regular path queries over graph databases, showing them to be no more difficult than for CQs. △ Less

Submitted 7 July, 2016; v1 submitted 3 June, 2016; originally announced June 2016.

arXiv:1603.04309 [pdf, ps, other]

doi 10.2168/LMCS-12(1:9)2016

Order-Invariant Types and Their Applications

Authors: Pablo Barcelo, Leonid Libkin

Abstract: Our goal is to show that the standard model-theoretic concept of types can be applied in the study of order-invariant properties, i.e., properties definable in a logic in the presence of an auxiliary order relation, but not actually dependent on that order relation. This is somewhat surprising since order-invariant properties are more of a combinatorial rather than a logical object. We provide tw… ▽ More Our goal is to show that the standard model-theoretic concept of types can be applied in the study of order-invariant properties, i.e., properties definable in a logic in the presence of an auxiliary order relation, but not actually dependent on that order relation. This is somewhat surprising since order-invariant properties are more of a combinatorial rather than a logical object. We provide two applications of this notion. One is a proof, from the basic principles, of a theorem by Courcelle stating that over trees, order-invariant MSO properties are expressible in MSO with counting quantifiers. The other is an analog of the Feferman-Vaught theorem for order-invariant properties. △ Less

Submitted 30 March, 2016; v1 submitted 14 March, 2016; originally announced March 2016.

Journal ref: Logical Methods in Computer Science, Volume 12, Issue 1 (April 1, 2016) lmcs:1632

arXiv:1602.01366 [pdf, ps, other]

Semantic Acyclicity Under Constraints

Authors: Pablo Barcelo, Georg Gottlob, Andreas Pieris

Abstract: A conjunctive query (CQ) is semantically acyclic if it is equivalent to an acyclic one. Semantic acyclicity has been studied in the constraint-free case, and deciding whether a query enjoys this property is NP-complete. However, in case the database is subject to constraints such as tuple-generating dependencies (tgds) that can express, e.g., inclusion dependencies, or equality-generating dependen… ▽ More A conjunctive query (CQ) is semantically acyclic if it is equivalent to an acyclic one. Semantic acyclicity has been studied in the constraint-free case, and deciding whether a query enjoys this property is NP-complete. However, in case the database is subject to constraints such as tuple-generating dependencies (tgds) that can express, e.g., inclusion dependencies, or equality-generating dependencies (egds) that capture, e.g., functional dependencies, a CQ may turn out to be semantically acyclic under the constraints while not semantically acyclic in general. This opens avenues to new query optimization techniques. In this paper we initiate and develop the theory of semantic acyclicity under constraints. More precisely, we study the following natural problem: Given a CQ and a set of constraints, is the query semantically acyclic under the constraints, or, in other words, is the query equivalent to an acyclic one over all those databases that satisfy the set of constraints? We show that, contrary to what one might expect, decidability of CQ containment is a necessary but not sufficient condition for the decidability of semantic acyclicity. In particular, we show that semantic acyclicity is undecidable in the presence of full tgds (i.e., Datalog rules). In view of this fact, we focus on the main classes of tgds for which CQ containment is decidable, and do not capture the class of full tgds, namely guarded, non-recursive and sticky tgds. For these classes we show that semantic acyclicity is decidable, and its complexity coincides with the complexity of CQ containment. In the case of egds, we show that for keys over unary and binary predicates semantic acyclicity is decidable (NP-complete). We finally consider the problem of evaluating a semantically acyclic query over a database that satisfies a set of constraints; for guarded tgds and functional dependencies this problem is tractable. △ Less

Submitted 3 June, 2016; v1 submitted 3 February, 2016; originally announced February 2016.

arXiv:1511.01633 [pdf, ps, other]

String Solving with Word Equations and Transducers: Towards a Logic for Analysing Mutation XSS (Full Version)

Authors: Anthony W. Lin, Pablo Barcelo

Abstract: We study the fundamental issue of decidability of satisfiability over string logics with concatenations and finite-state transducers as atomic operations. Although restricting to one type of operations yields decidability, little is known about the decidability of their combined theory, which is especially relevant when analysing security vulnerabilities of dynamic web pages in a more realistic br… ▽ More We study the fundamental issue of decidability of satisfiability over string logics with concatenations and finite-state transducers as atomic operations. Although restricting to one type of operations yields decidability, little is known about the decidability of their combined theory, which is especially relevant when analysing security vulnerabilities of dynamic web pages in a more realistic browser model. On the one hand, word equations (string logic with concatenations) cannot precisely capture sanitisation functions (e.g. htmlescape) and implicit browser transductions (e.g. innerHTML mutations). On the other hand, transducers suffer from the reverse problem of being able to model sanitisation functions and browser transductions, but not string concatenations. Naively combining word equations and transducers easily leads to an undecidable logic. Our main contribution is to show that the "straight-line fragment" of the logic is decidable (complexity ranges from PSPACE to EXPSPACE). The fragment can express the program logics of straight-line string-manipulating programs with concatenations and transductions as atomic operations, which arise when performing bounded model checking or dynamic symbolic executions. We demonstrate that the logic can naturally express constraints required for analysing mutation XSS in web applications. Finally, the logic remains decidable in the presence of length, letter-counting, regular, indexOf, and disequality constraints. △ Less

Submitted 5 November, 2015; originally announced November 2015.

Comments: Full version of POPL'16 paper

ACM Class: F.3.1

arXiv:1507.07911 [pdf, ps, other]

doi 10.2168/LMCS-11(4:1)2015

Expressive Path Queries on Graph with Data

Authors: Pablo Barcelo, Gaelle Fontaine, Anthony Widjaja Lin

Abstract: Graph data models have recently become popular owing to their applications, e.g., in social networks and the semantic web. Typical navigational query languages over graph databases - such as Conjunctive Regular Path Queries (CRPQs) - cannot express relevant properties of the interaction between the underlying data and the topology. Two languages have been recently proposed to overcome this problem… ▽ More Graph data models have recently become popular owing to their applications, e.g., in social networks and the semantic web. Typical navigational query languages over graph databases - such as Conjunctive Regular Path Queries (CRPQs) - cannot express relevant properties of the interaction between the underlying data and the topology. Two languages have been recently proposed to overcome this problem: walk logic (WL) and regular expressions with memory (REM). In this paper, we begin by investigating fundamental properties of WL and REM, i.e., complexity of evaluation problems and expressive power. We first show that the data complexity of WL is nonelementary, which rules out its practicality. On the other hand, while REM has low data complexity, we point out that many natural data/topology properties of graphs expressible in WL cannot be expressed in REM. To this end, we propose register logic, an extension of REM, which we show to be able to express many natural graph properties expressible in WL, while at the same time preserving the elementariness of data complexity of REMs. It is also incomparable to WL in terms of expressive power. △ Less

Submitted 5 October, 2015; v1 submitted 28 July, 2015; originally announced July 2015.

Comments: 39 pages

Journal ref: Logical Methods in Computer Science, Volume 11, Issue 4 (October 5, 2015) lmcs:1602

arXiv:1304.4150 [pdf, other]

doi 10.2168/LMCS-9(3:1)2013

Graph Logics with Rational Relations

Authors: Pablo Barcelo, Diego Figueira, Leonid Libkin

Abstract: We investigate some basic questions about the interaction of regular and rational relations on words. The primary motivation comes from the study of logics for querying graph topology, which have recently found numerous applications. Such logics use conditions on paths expressed by regular languages and relations, but they often need to be extended by rational relations such as subword or subseque… ▽ More We investigate some basic questions about the interaction of regular and rational relations on words. The primary motivation comes from the study of logics for querying graph topology, which have recently found numerous applications. Such logics use conditions on paths expressed by regular languages and relations, but they often need to be extended by rational relations such as subword or subsequence. Evaluating formulae in such extended graph logics boils down to checking nonemptiness of the intersection of rational relations with regular or recognizable relations (or, more generally, to the generalized intersection problem, asking whether some projections of a regular relation have a nonempty intersection with a given rational relation). We prove that for several basic and commonly used rational relations, the intersection problem with regular relations is either undecidable (e.g., for subword or suffix, and some generalizations), or decidable with non-primitive-recursive complexity (e.g., for subsequence and its generalizations). These results are used to rule out many classes of graph logics that freely combine regular and rational relations, as well as to provide the simplest problem related to verifying lossy channel systems that has non-primitive-recursive complexity. We then prove a dichotomy result for logics combining regular conditions on individual paths and rational relations on paths, by showing that the syntactic form of formulae classifies them into either efficiently checkable or undecidable cases. We also give examples of rational relations for which such logics are decidable even without syntactic restrictions. △ Less

Submitted 2 July, 2013; v1 submitted 15 April, 2013; originally announced April 2013.

Journal ref: Logical Methods in Computer Science, Volume 9, Issue 3 (July 4, 2013) lmcs:664

arXiv:1107.0577 [pdf, ps, other]

Parameterized Regular Expressions and their Languages

Authors: Pablo Barceló, Leonid Libkin, Juan Reutter

Abstract: We study regular expressions that use variables, or parameters, which are interpreted as alphabet letters. We consider two classes of languages denoted by such expressions: under the possibility semantics, a word belongs to the language if it is denoted by some regular expression obtained by replacing variables with letters; under the certainly semantics, the word must be denoted by every such exp… ▽ More We study regular expressions that use variables, or parameters, which are interpreted as alphabet letters. We consider two classes of languages denoted by such expressions: under the possibility semantics, a word belongs to the language if it is denoted by some regular expression obtained by replacing variables with letters; under the certainly semantics, the word must be denoted by every such expression. Such languages are regular, and we show that they naturally arise in several applications such as querying graph databases and program analysis. As the main contribution of the paper, we provide a complete characterization of the complexity of the main computational problems related to such languages: nonemptiness, universality, containment, membership, as well as the problem of constructing NFAs capturing such languages. We also look at the extension when domains of variables could be arbitrary regular languages, and show that under the certainty semantics, languages remain regular and the complexity of the main computational problems does not change. △ Less

Submitted 4 July, 2011; originally announced July 2011.

arXiv:0811.0537 [pdf, ps, other]

doi 10.2168/LMCS-4(4:11)2008

First-Order and Temporal Logics for Nested Words

Authors: Rajeev Alur, Marcelo Arenas, Pablo Barcelo, Kousha Etessami, Neil Immerman, Leonid Libkin

Abstract: Nested words are a structured model of execution paths in procedural programs, reflecting their call and return nesting structure. Finite nested words also capture the structure of parse trees and other tree-structured data, such as XML. We provide new temporal logics for finite and infinite nested words, which are natural extensions of LTL, and prove that these logics are first-order expressivel… ▽ More Nested words are a structured model of execution paths in procedural programs, reflecting their call and return nesting structure. Finite nested words also capture the structure of parse trees and other tree-structured data, such as XML. We provide new temporal logics for finite and infinite nested words, which are natural extensions of LTL, and prove that these logics are first-order expressively-complete. One of them is based on adding a "within" modality, evaluating a formula on a subword, to a logic CaRet previously studied in the context of verifying properties of recursive state machines (RSMs). The other logic, NWTL, is based on the notion of a summary path that uses both the linear and nesting structures. For NWTL we show that satisfiability is EXPTIME-complete, and that model-checking can be done in time polynomial in the size of the RSM model and exponential in the size of the NWTL formula (and is also EXPTIME-complete). Finally, we prove that first-order logic over nested words has the three-variable property, and we present a temporal logic for nested words which is complete for the two-variable fragment of first-order. △ Less

Submitted 3 March, 2011; v1 submitted 4 November, 2008; originally announced November 2008.

Comments: revised and corrected version of Mar 03, 2011

ACM Class: F.1.1, F.3.1, F.4.1

Journal ref: Logical Methods in Computer Science, Volume 4, Issue 4 (November 25, 2008) lmcs:782

Showing 1–38 of 38 results for author: Barcelo, P