Cardinality Estimation over Knowledge Graphs with Embeddings and Graph Neural Networks

[Uncaptioned image] Tim Schwabe
Technical University of Munich
TUM School of Computation, Information and Technology, Germany
&[Uncaptioned image] Maribel Acosta
Technical University of Munich
TUM School of Computation, Information and Technology, Germany
Work partially conducted at Ruhr University Bochum.
Abstract

Cardinality Estimation over Knowledge Graphs (KG) is crucial for query optimization, yet remains a challenging task due to the semi-structured nature and complex correlations of data in typical KGs. In this work, we propose GNCE, a novel approach that leverages knowledge graph embeddings and Graph Neural Networks (GNN) to accurately predict the cardinality of conjunctive queries over KGs. GNCE first creates semantically meaningful embeddings for all entities in the KG, which are then used to learn a representation of a query using a GNN to estimate the cardinality of the query. We evaluate GNCE on several KGs in terms of q-Error and demonstrate that it outperforms state-of-the-art approaches based on sampling, summaries, and (machine) learning in terms of estimation accuracy while also having a low execution time and few parameters. Additionally, we show that GNCE performs similarly well on real-world queries and can inductively generalize to unseen entities, making it suitable for use in dynamic query processing scenarios. Our proposed approach has the potential to significantly improve query optimization and related applications that rely on accurate cardinality estimates of conjunctive queries.

Keywords Cardinality Estimation  \cdot Knowledge Graphs  \cdot Machine Learning

1 Introduction

Knowledge Graphs (KGs) allow for representing interconnected data, using semi-structured data models like the Resource Description Framework (RDF) Cyganiak et al. (2014). In KGs, statements are represented in the form of a graph, where labeled nodes typically correspond to entities, and directed, labeled edges are connections between the nodes. Nowadays, KGs are fueling diverse applications like question-answering, recommender systems, and semantic databases Noy et al. (2019).

Similar to other database technologies, the data encoded in a KG can be queried using a structured query language like SPARQL Prud’hommeaux and Seaborne (2008). In particular, conjunctive queries over KGs can be formulated also as graphs, where nodes and edges can be either constants (i.e., values matched over the KG) or variables (i.e., placeholders). The computation of answers for a given conjunctive query is carried out as a graph pattern-matching process, where the query cardinality corresponds to the number of subgraphs in the KG that are homomorphic to the query graph. To efficiently execute a (conjunctive) query over a KG, query engines implement optimization techniques that rely on cardinality (alt. selectivity) estimates of sub-queries. Cardinality here is defined as the number of answers produced by an executed query. The optimizer relies on these estimates to determine the order in which the sub-queries are evaluated over the KG to minimize the number of intermediate results, thus, speeding up the query execution. One of the challenges in query optimization over KGs is that conjunctive queries typically contain many joins; in this context, a join occurs when subgraphs of the query graph are connected by the same variable. For example, each pair of edges connected to the same entity can be seen as a join. Another challenge is imposed by the semi-structured nature of KGs, which can lead to skewed data distributions and irregular correlations between entities and predicates. These are difficult to capture accurately in traditional statistical summaries Neumann and Moerkotte (2011); Stefanoni et al. (2018) for different classes of graph queries.

To overcome the limitations of statistical summaries, recent approaches based on Machine Learning (ML) have been proposed to the problem of query cardinality estimation Davitkova et al. (2022); Zhao et al. (2021). These models rely on a Neural Network (NN) or Graph Neural Network (GNN) to learn the underlying data distributions in a KG. For this, the models are trained in a supervised manner using sampled query graphs and their corresponding cardinalities. While it has been shown that these approaches are able to capture arbitrary correlations more effectively than other summaries, they still present several drawbacks. First, during learning, these approaches do not leverage individual characteristics of entities and predicates in the KG, i.e., the initial representations are given by vectors that lack meaning or relevant information about the elements. Second, some of these approaches do not perform well in inductive cases, i.e., when entities occur in queries that were not seen by the model or, in an extreme case, queries that consist of only unseen entities. This form of inductive setting occurs when training on query samples that do not cover all entities in the KG. Third, the training of the state-of-the-art models requires the optimization of a large number of parameters, which means that a large number of training samples are required to perform accurate predictions. This considerably hinders the scalability of state-of-the-art ML approaches to large KGs.

In this work, we present GNCE, Graph Neural Cardinality Estimation, a solution to mitigate the drawbacks of the state of the art. GNCE can be used to predict the cardinality of conjunctive queries over RDF graphs formulated in SPARQL. GNCE is based on a Graph Neural Network (GNN) as this is an expressive model Xu et al. (2019) that can naturally capture the graph structure of the query. Furthermore, our approach exploits recent advances in knowledge graph embeddings (KGE) Rossi et al. (2021) to provide semantically meaningful features of the query elements to the GNN. KGE are low-dimensional vectors that represent entities and predicates that capture latent semantic information about the KG. KGE provide two key advantages to GNCE: to learn data correlations from compact yet meaningful representations, and to generalize to entities unseen in training queries. Lastly, as both the meaning and structure of the query graph are largely covered by the KGE and the GNN, GNCE can afford to have a relatively simple architecture in terms of the number of learnable parameters. This makes GNCE applicable to large KGs thus overcoming the third limitation of current solutions. In summary, the contributions of this work are:

  1. 1.

    We present a novel supervised model, termed GNCE, for query cardinality estimation over KGs that effectively combines Graph Neural Networks and KG embeddings.

  2. 2.

    Our solution is robust in the inductive case, i.e., queries involving unseen entities can be effectively processed by the model.

  3. 3.

    We assemble a fast and lightweight ML architecture that is parameter- and data-efficient, i.e., GNCE can scale up to large KGs.

  4. 4.

    We show that GNCE on average outperforms the state of the art using benchmarks from the literature.

2 Preliminaries

2.1 Knowledge Graphs

A knowledge graph (KG) is defined as a directed graph, where nodes and edges are identified with labels Hogan et al. (2022). In addition, nodes can be assigned to classes111In the literature, sometimes classes are also referred to as labels but these should not be confused with the labels/identifiers., which correspond to groups of nodes that share some characteristics. To represent KGs, there are several data models. In this work, we follow the definition of the Resource Description Framework (RDF) Cyganiak et al. (2014), where every statement is modeled as a triple. For simplicity, the terms defined in RDF (i.e., IRIs, blank nodes222The following definitions assume skolemised blank nodes, i.e., transformed into IRIs., and literals) are treated as labels in the knowledge graph and represented as a universe of terms 𝒰𝒰\mathcal{U}caligraphic_U.

Definition 1 (Knowledge Graph).

A knowledge graph 𝒢𝒢\mathcal{G}caligraphic_G is a directed, labeled graph, represented as a set of triples, where each triple t=(s,p,o)𝑡𝑠𝑝𝑜t=(s,p,o)italic_t = ( italic_s , italic_p , italic_o ) represents a labeled edge between nodes in the graph. In each triple t𝑡titalic_t, s𝑠sitalic_s is the subject, p𝑝pitalic_p is the predicate, and o𝑜oitalic_o is the object, and s,p,o𝒰𝑠𝑝𝑜𝒰s,p,o\in\mathcal{U}italic_s , italic_p , italic_o ∈ caligraphic_U, where 𝒰𝒰\mathcal{U}caligraphic_U is a set. The elements of a triple are referred to as atoms.

Questions over KGs can be formulated as structured queries. We focus, in this work, on conjunctive queries which can be defined also as a graph where atoms are either instantiated to (bound) terms, or variables that act as placeholders that can be matched over the KG.

Definition 2 (Query Graph).

Consider the set of variables 𝒱𝒱\mathcal{V}caligraphic_V such that 𝒱𝒰=𝒱𝒰\mathcal{V}\cap\mathcal{U}=\emptysetcaligraphic_V ∩ caligraphic_U = ∅. A query graph Q𝑄Qitalic_Q is a directed, labeled graph represented as a set of triple patterns, where each triple pattern tp=(sq,pq,oq)𝑡𝑝subscript𝑠𝑞subscript𝑝𝑞subscript𝑜𝑞tp=(s_{q},p_{q},o_{q})italic_t italic_p = ( italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) represents a labeled edge between nodes in the query graph and sq,pq,oq𝒰𝒱subscript𝑠𝑞subscript𝑝𝑞subscript𝑜𝑞𝒰𝒱s_{q},p_{q},o_{q}\in\mathcal{U}\cup\mathcal{V}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ caligraphic_U ∪ caligraphic_V.

Answers to a query are then found by matching the corresponding query graph over the KG. This process is known as pattern matching, where atom variables are mapped to elements in the graph. A query solution is a subgraph in the KG to which the query graph is homomorphic. In particular, in KGs with bound atoms (i.e., no variables in the KG), this is equivalent to simply replacing variables in the query and determining whether the resulting graph is a subgraph in the KG.

Definition 3 (Query Solution).

Given a knowledge graph 𝒢𝒢\mathcal{G}caligraphic_G and a query graph Q𝑄Qitalic_Q, consider a map** μ:𝒱𝒰:𝜇maps-to𝒱𝒰\mu:\mathcal{V}\mapsto\mathcal{U}italic_μ : caligraphic_V ↦ caligraphic_U. We denote μ(Q)𝜇𝑄\mu(Q)italic_μ ( italic_Q ) as the resulting set of triples obtained by applying μ(v)𝜇𝑣\mu(v)italic_μ ( italic_v ) for each variable v𝑣vitalic_v in the triple patterns of Q𝑄Qitalic_Q. If μ(Q)𝒢𝜇𝑄𝒢\mu(Q)\subseteq\mathcal{G}italic_μ ( italic_Q ) ⊆ caligraphic_G, then μ(Q)𝜇𝑄\mu(Q)italic_μ ( italic_Q ) is a subgraph of 𝒢𝒢\mathcal{G}caligraphic_G and called query solution to Q𝑄Qitalic_Q over 𝒢𝒢\mathcal{G}caligraphic_G.

The cardinality is the number of solutions to a query:

Definition 4 (Query Cardinality).

The cardinality of a conjunctive query Q𝑄Qitalic_Q over 𝒢𝒢\mathcal{G}caligraphic_G, denoted Q𝒢subscriptnorm𝑄𝒢||Q||_{\mathcal{G}}| | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT, is the number of different solutions as of Definition 3:

Q𝒢=|{μ:μ(Q)𝒢vars(Q)=dom(μ)}|subscriptnorm𝑄𝒢conditional-set𝜇𝜇𝑄𝒢𝑣𝑎𝑟𝑠𝑄𝑑𝑜𝑚𝜇\displaystyle||Q||_{\mathcal{G}}=|\{\mu:\mu(Q)\subseteq\mathcal{G}\ \land vars% (Q)=dom(\mu)\}|| | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT = | { italic_μ : italic_μ ( italic_Q ) ⊆ caligraphic_G ∧ italic_v italic_a italic_r italic_s ( italic_Q ) = italic_d italic_o italic_m ( italic_μ ) } |

vars(Q)𝑣𝑎𝑟𝑠𝑄vars(Q)italic_v italic_a italic_r italic_s ( italic_Q ) represents the variables that occur in Q𝑄Qitalic_Q, and dom(μ)𝑑𝑜𝑚𝜇dom(\mu)italic_d italic_o italic_m ( italic_μ ) denotes the set of variables in μ𝜇\muitalic_μ.

2.2 Knowledge Graph Embeddings

Knowledge graph embeddings (KGE) are low-dimensional vectors of the atoms (entities and predicates) in a KG. These vectors encode latent semantic information about the atoms that is not explicitly available in the KG. For this reason, KGE usually achieve superior results in downstream tasks Rossi et al. (2021), e.g., link prediction, entity classification, clustering, etc., in comparison to approaches that only rely on the (explicit) statements and the structure of the KG.

To compute KGE, existing models for representation learning over KGs comprise an embedding space, a scoring function f𝑓fitalic_f, and a suitable loss function l𝑙litalic_l. The embedding space is typically a vector space over real or complex numbers, with dimensions ranging from dozens to a few hundred. The scoring function f𝑓fitalic_f assigns a score to a triple (s,p,o)𝑠𝑝𝑜(s,p,o)( italic_s , italic_p , italic_o ) where a larger score means that the triple is more probable in a KG 𝒢𝒢\mathcal{G}caligraphic_G under the model, while lower scores indicate a lower probability. The loss function l𝑙litalic_l is used to actually fit the model and drives it to assign high scores to true triples – statements that are assumed correct in the KG – and low scores to false triples – statements that do not and should not exist in the KG. An overview of state-of-the-art KGE is provided by Rossi et al. Rossi et al. (2021).

Due to differences in their components and learning process, different KGE models are able to capture different latent aspects of KGs, which makes certain KGE more suitable for specific tasks. For instance, some KGE models (e.g., translational models Rossi et al. (2021)) learn from individual triples in the KG, which are well suited for link prediction. Other models (e.g., RDF2Vec Ristoski and Paulheim (2016) or Ridle Weller and Acosta (2021)) can learn from larger structures like subgraphs, which are suited for tasks that account for the semantic similarity and relatedness of atoms like entity type prediction or clustering. In this work, we are interested in the latter group of KGE models, as learning from subgraphs allows for capturing the correlations of atoms in the KG effectively. This is achieved with RDF2Vec Ristoski and Paulheim (2016) which implements the skip-gram model Mikolov et al. (2013) based on random walks of atoms in a KG. A random walk over a KG is generated by traversing the graph starting from an arbitrary node, following an edge to another node, and so on. Given a random walk of atoms w1,,wWsubscript𝑤1subscript𝑤𝑊w_{1},...,w_{W}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT (which includes nodes and predicates) in a KG of length W𝑊Witalic_W, the objective of RDF2Vec is to maximize the following log probability (i.e. minimize its negative, which can be seen as the loss function in KGE):

1Wt=1Wcjc,j0logp(wt+j|wt)1𝑊superscriptsubscript𝑡1𝑊subscriptformulae-sequence𝑐𝑗𝑐𝑗0𝑝conditionalsubscript𝑤𝑡𝑗subscript𝑤𝑡\frac{1}{W}\sum_{t=1}^{W}\sum_{-c\leq j\leq c,j\neq 0}\log p(w_{t+j}|w_{t})divide start_ARG 1 end_ARG start_ARG italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT - italic_c ≤ italic_j ≤ italic_c , italic_j ≠ 0 end_POSTSUBSCRIPT roman_log italic_p ( italic_w start_POSTSUBSCRIPT italic_t + italic_j end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (1)

That is, given a target atom wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, RDF2Vec tries to maximize the probability of all atoms (summation over j𝑗jitalic_j) that are at most c𝑐citalic_c hops away in the walk. RDF2Vec does that for every atom in the random walk (summation over t𝑡titalic_t). The conditional probability of an atom wosubscript𝑤𝑜w_{o}italic_w start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT given another atom wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which can be seen as the analog to the scoring function in other KGE methods, is given as:

p(wo|wi)=exp(vwovwi)w=1|VE|exp(vwvwi)𝑝conditionalsubscript𝑤𝑜subscript𝑤𝑖subscriptsuperscript𝑣topsubscript𝑤𝑜subscript𝑣subscript𝑤𝑖superscriptsubscript𝑤1𝑉𝐸subscriptsuperscript𝑣top𝑤subscript𝑣subscript𝑤𝑖p(w_{o}|w_{i})=\frac{\exp(v^{\prime\top}_{w_{o}}v_{w_{i}})}{\sum_{w=1}^{|V\cup E% |}\exp(v^{\prime\top}_{w}v_{w_{i}})}italic_p ( italic_w start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( italic_v start_POSTSUPERSCRIPT ′ ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_V ∪ italic_E | end_POSTSUPERSCRIPT roman_exp ( italic_v start_POSTSUPERSCRIPT ′ ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG (2)

Here, vwjsubscript𝑣subscript𝑤𝑗v_{w_{j}}italic_v start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the input vector (embedding) of atom wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, while vwjsubscriptsuperscript𝑣subscript𝑤𝑗v^{\prime}_{w_{j}}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the so called output vector of wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Those have the same dimension as the input vector but are not used as the final embeddings. They are introduced to make the model more expressive and have different representations for target atoms and context atoms. As we can see, the unnormalized conditional probability is estimated as the exponent of the dot product between the input and output vectors of both atoms. This dot product gets maximized if the angle between the two vectors is zero. That is the case if they are the same vector or lay on the same 1-D subspace. Thus, the optimization of Eq. (1) drives embeddings of atoms with the same context to be close to each other. That is especially true for atoms that occur together in the graph, i.e. when the joint probability of observing them is high. Note that a normalized probability is obtained by dividing by the sum of all possible probabilities (often termed partition function). That sum is difficult to calculate and in practice approximated Ristoski and Paulheim (2016).

The important takeaway point here is that the embeddings generated by RDF2Vec encode the atom’s relatedness and correlate with the joint probability of observing them. At first glance, it might appear as if the embeddings encode the sampled paths, but since a single entity participates in many walks, rich contextual information from the neighboring nodes and edges is encoded. Our hypothesis is thus that RDF2Vec embeddings are suitable for atom representation in estimating query cardinalities over KGs.

2.3 Graph Neural Networks

A Graph Neural Network (GNN) is a special form of Neural Network (NN) that is tailored to graph-structured data. The distinct feature of GNNs is that they are invariant with respect to a permutation of the nodes of the input graph Bronstein et al. (2021). This means that the output of the model is the same even if the atoms in the graph are reordered. Permutation invariance makes GNNs parameter and data efficient Bronstein et al. (2021). The model is parameter efficient as it needs fewer learnable parameters because it is not necessary to learn different permutations of the same graph structure. It is also data efficient as the model needs a relatively low number of examples to perform well. One typical task GNNs are used for is graph-level prediction. Here, given a graph as input, the GNN predicts quantities that hold for the whole graph. E.g. if a molecular graph is toxic or a social network is racist. Important for us, predicting the cardinality given a query graph can be considered a graph-level prediction.

GNNs work under the message-passing framework, where in every layer k𝑘kitalic_k in the GNN the input representations of the nodes are updated based on messages from all connected neighbors. The most generic message-passing function for a node is given by Team (2022):

xi(k)=γ(k)(xi(k1),j𝒩(i)ϕ(k)(xi(k1),xj(k1),ej,i))superscriptsubscript𝑥𝑖𝑘superscript𝛾𝑘superscriptsubscript𝑥𝑖𝑘1subscript𝑗𝒩𝑖superscriptitalic-ϕ𝑘superscriptsubscript𝑥𝑖𝑘1superscriptsubscript𝑥𝑗𝑘1subscript𝑒𝑗𝑖x_{i}^{(k)}=\gamma^{(k)}\left(x_{i}^{(k-1)},\square_{j\in\mathcal{N}(i)}\phi^{% (k)}\left(x_{i}^{(k-1)},x_{j}^{(k-1)},e_{j,i}\right)\right)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , □ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N ( italic_i ) end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ) ) (3)

Here, γ𝛾\gammaitalic_γ and ϕitalic-ϕ\phiitalic_ϕ are differentiable functions, and \square is a differentiable as well as permutation invariant function. 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the set of direct neighboring nodes of the i𝑖iitalic_i-th node. For a node xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its previous representation denoted xi(k1)superscriptsubscript𝑥𝑖𝑘1x_{i}^{(k-1)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT gets non-linearly transformed to a new representation xi(k)superscriptsubscript𝑥𝑖𝑘x_{i}^{(k)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT by combining xi(k1)superscriptsubscript𝑥𝑖𝑘1x_{i}^{(k-1)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT with neighboring nodes’ representations as well as the edges between them (the messages). In order to perform graph classification, the individual node features of all nodes need to be combined into a fixed-length vector. For that, a function that is permutation-invariant and works with a varying number of nodes in the graph is suitable. Practically used examples here are max-pooling, mean-pooling, or sum-pooling. The resulting vector represents the whole graph and can be used to predict the desired graph-level quantities Wu et al. (2019).

The initial representation of a node, xi0superscriptsubscript𝑥𝑖0x_{i}^{0}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, depends on the use case. Possible options are one-hot or binary encoding to denote the id of the node or more informative features. E.g. if the node represents a person in a social graph, one dimension could represent the height of the person, another the age, etc. More meaningful features are superior as they provide crucial information to the GNN to solve the desired problem. They also free the GNN from learning internal representations of the nodes, as would be the case for one-hot or binary encoding (as the feature does not carry any meaning except for the identity of the node) where any semantics of the node would need to be learned from the data and integrated into the model. As KGE provide semantic information about the given entity, they are a promising option as initial node features.

3 Related Work

Cardinality estimation is a longstanding research problem in relational databases and proposed solutions have been taken as starting points for graph-structured data like KGs. Traditional approaches rely on summaries of the dataset but, more recently, learning-based solutions have been proposed. In the following, we discuss relevant work in both relational and graph-structured data.

Approaches for Relational Databases

Traditionally, cardinality estimation is based on sampling and histograms. Histograms can theoretically provide the full joint probability over attributes and join predicates, however, they are computationally prohibitive, since they grow exponentially with the number of attributes. One therefore often employs independence and uniformity assumptions over the data, which leads to very high cardinality estimation errors. Those can be reduced by computing an upper bound on the cardinality estimate Deeds et al. (2023). Other approaches to overcome this, e.g., sampling approaches Cormode et al. (2011); Vengerov et al. (2150); Li et al. (2016), have been devised that do not make any independence assumptions about the data. By evaluating the query over a sample from the data, the full joint probability is evaluated on that sample. As the sample size approaches the size of the real data, the true cardinality over the data is approached. Yet, in practice, the sample size is usually much smaller than the dataset to compute the estimates in a reasonable time. Therefore, the accuracy of the estimate highly depends on the sampling technique as well as the data distribution, and the sampling might need to run long in order to generate a good estimate. Approaches in the context of sampling are, e.g., Correlated Sampling Vengerov et al. (2150), Wanderjoin Li et al. (2016), and Join Sampling with Upper Bounds (jsub) Zhao et al. (2018). Correlated Sampling Vengerov et al. (2150) improves over simple Bernoulli sampling by sampling according to hashed values of the attributes, thus, reducing the variance of the estimate. Wanderjoin Li et al. (2016) represents the query and database as a query graph and data graph and estimates cardinalities by performing random walks on the data graph in coherence with the query graph. Join Sampling with Upper Bounds performs sampling similar to Wanderjoin but extends the estimate to provide an upper bound of the cardinality. The methods are summarized in Park et al. (2020). For a comprehensive introduction to sampling and histogram approaches for cardinality estimation, consider the survey by Cormode et al. Cormode et al. (2011).

To circumvent the exponential explosion of compute- and memory requirements by statistical summaries, learning approaches have been proposed. Sun et al. Sun et al. (2021) group these approaches into the so-called data models and query models.

Data models first aim at predicting the probability of the data and then estimate the query cardinality by sampling from the learned probability distribution. Some examples of these solutions are based on Bayesian Networks Getoor et al. (2001); Tzoumas et al. (2011); Halford et al. (2019). All these approaches aim at representing the joint probability by factorizing it into a product of smaller conditional probabilities Halford et al. (2019). The problem is that, in practice, the factorization can require capturing joint correlations of many attributes in the data, thus, growing the network size also exponentially. For this reason, in practice, Bayesian Networks still make independence assumptions to a great extent Halford et al. (2019). To overcome these limitations, Sum Product Networks implemented in DeepDB Hilprecht et al. (2020) are used as graphical models to approximate the data joint probability. Compared to Bayesian Networks, the memory of Sum Product Networks grows polynomial w.r.t. the database size.

Query models, instead, aim at predicting the cardinality given a query. Most of these approaches are based on Neural Networks (NNs). For example, Kipf et al. Kipf et al. (2019) present Multi-Set Convolutional Networks (MSCN), which represents queries by using table, join, and predicate sets. Woltmann et al. Woltmann et al. (2019) also propose an NN-based approach but this focuses on local parts of the data. The solution by Zhao et al. Zhao et al. (2022) is based on NN Gaussian Processes (NNGP) to handle uncertainty to improve the accuracy of predictions. While the latter solutions are closer to our work, these NNs are tailored to queries over relations and cannot directly be applied to graphs.

Approaches for Graph-Structured Data

Early works on cardinality estimation for graphs take inspiration from solutions for relational databases by using histograms and assuming attribute independence Stocker et al. (2008); Shironoshita et al. (2007); Neumann and Weikum (2009). More recent approaches try to model correlations between structures of the query. Neumann et al. Neumann and Moerkotte (2011) propose Characteristic Sets (CSET), which are synopses that count the number of entities with the same set of predicates. CSET provides exact cardinality estimates for star-shaped DISTINCT queries where all predicates are instantiated and objects are unbounded (variables). For other classes of queries, Neumann et al. Neumann and Moerkotte (2011) decompose the query into star-shaped subqueries, calculate their cardinality using CSET, and estimate the final cardinality by assuming independence between the sub-queries. SUMRDF Stefanoni et al. (2018) is also based on graph summarization and computes typed summaries where nodes that belong to the same class333In RDF KGs, the class of a node is provided with the predicate rdf:type. and have similar predicate distributions are grouped together. In addition to summaries, sampling-based approaches have also been proposed for KGs. For example, impr Chen and Lui (2017) estimates the query cardinality through random walks on the data that match the query graph. Lastly, approaches based on Bayesian Networks have also been proposed for KGs Huang and Liu (2011). All the solutions aforementioned are tailored to RDF KGs. However, similarly to their relational counterparts, the structures computed with these approaches can quickly grow w.r.t. the dataset size or are only accurate for specific classes of queries.

Table 1: Characteristics of the learned approaches
Approach ML Model Atom Representation Model Size Embeddings Perm. Inv. Assumptions Inductivity (Empirical)
LMKG Davitkova et al. (2022) NN Binary Encoding O(|V|+|E|)𝑂𝑉𝐸O(|V|+|E|)italic_O ( | italic_V | + | italic_E | ) - Entities, Query Shapes
LSS Zhao et al. (2021) GNN ProNE Zhang et al. (2019) Constant |V|𝑉|V|| italic_V | Typed entities, undirected graph Entities, Query Shapes
GNCE (Ours) GNN RDF2Vec Ristoski and Paulheim (2016) Constant |V|+|E|𝑉𝐸|V|+|E|| italic_V | + | italic_E | Entities, Query Shapes

Most similar to our work are LMKG Davitkova et al. (2022) and LSS Zhao et al. (2021), which are approaches based on Machine Learning (ML). Table 1 presents an overview of these solutions, which are discussed in detail next.

LMKG Davitkova et al. (2022) is a supervised model444There is another version of LMKG with unsupervised learning, however, its performance was inferior in comparison to its supervised counterpart. Therefore, we focus only on the supervised setting of LMKG. for learning cardinalities using Neural Networks (NNs) consisting of several nonlinear fully connected layers. The representation of the KG and queries in LMKG works as follows. First, each atom in the KG is assigned an id and transformed into a binary vector representation of that id. Then, a graph query with n𝑛nitalic_n nodes and e𝑒eitalic_e edges is represented with an adjacency tensor as well as featurization matrices to relate atoms in the query to atoms in the KG. In practice, LMKG assumes that the adjacency tensors and featurization matrices of all queries have the same dimensions, meaning that n𝑛nitalic_n and e𝑒eitalic_e are set to the size of the largest query in a workload; this makes the resulting model possibly large. A possible solution is to train several models for different query sizes Davitkova et al. (2022), yet, this induces a managing overhead. After the representations are computed, the input to the NN is the concatenation of the flattened query representations. Because of this, LMKG is not permutation invariant, as the same query where triple patterns are provided in a different order yields different representations. That means that LMKG needs a higher number of parameters and training data points to capture these permutations and produce accurate predictions.

LSS Zhao et al. (2021) applies a Graph Neural Network (GNN) to the graph query. For each node in a query, LSS computes the tree induced by a l𝑙litalic_l-hop BFS (Breadth-First Search) starting at that node. Then, the GNN is applied to all those trees; the results are aggregated by an attention mechanism to finally predict the cardinality. While LSS has been empirically proven to be effective for cardinality estimations over (regular) graphs, the initial representation of query nodes in LSS relies on two aspects that make it less suitable for KGs. First, LSS computes the initial representation of the nodes in the graph query with ProNE Zhang et al. (2019) embeddings, which are tailored to graphs and not KGs, as they do not consider typed and directed edges. Second, each atom is represented with the embedding that results from summing up the embeddings of their corresponding classes in the KG. A problem with the summed class representation is that queries with the same shape but different entities that belong to the same classes have the same representation. This is not ideal for cardinality estimation as the selectivity of these queries can greatly vary due to the different entities mentioned in the query. Additionally, LSS employs a message-passing function that treats the graph as undirected. Furthermore, the attention mechanism introduces a large number of parameters (due to the 𝐐,𝐊,𝐕𝐐𝐊𝐕\mathbf{Q,K,V}bold_Q , bold_K , bold_V matrices and MLP of an attention head Vaswani et al. (2017)), making the resulting model large (with over 2 million parameters in our experiments). LSS also supports active learning, which primarily offers accelerated training. By focusing on poorly performing queries, the model can improve its performance without the need for exhaustive training over the entire dataset. Yet, active learning is orthogonal to the architecture and could be applied to our model identically.

NeurSC Wang et al. (2022) applies several GNNs to both the query graph as well as subgraphs of the data graph. The latter provides additional information, but it is more costly since the graph needs to be sampled for every query. Furthermore, NeurSC is tailored to undirected graphs without edge labels, which is not suitable for KGs. For this reason, we consider that the closest works to ours based on ML are LMKG and LSS. Furthermore, LMKG and LSS, as well as our method GNCE, do not make any independence or uniformity assumptions about the data, as the underlying NN can in principle approximate any skewed distribution.

Lastly, in query processing over KGs, there are related research problems that have been investigated, e.g., worst-case join order optimization Hogan et al. (2019) or the prediction of query execution performance Zhang et al. (2018); Hasan and Gandon (2014); Casals et al. . However, these problems greatly differ in nature and proposed solutions to the ones tackled in this paper, therefore, they are considered out of the scope of this work.

4 Our Approach

We model the task of cardinality estimation over a Knowledge Graph (KG) as a regression process, where we directly use a predictive model to estimate the cardinality of a given query graph. Concretely, we focus on conjunctive queries formulated in SPARQL over RDF KGs.

Our model f𝑓fitalic_f aims to predict the cardinality of a query Q𝑄Qitalic_Q over the Knowledge Graph 𝒢𝒢\mathcal{G}caligraphic_G. This prediction is based on the query’s feature representation, denoted as 𝒬𝒬\mathcal{Q}caligraphic_Q (the specifics of which are detailed later in the text):

f(𝒬)=Q^𝒢Q𝒢𝑓𝒬subscript^norm𝑄𝒢subscriptnorm𝑄𝒢f(\mathcal{Q})=\widehat{||Q||}_{\mathcal{G}}\approx||Q||_{\mathcal{G}}italic_f ( caligraphic_Q ) = over^ start_ARG | | italic_Q | | end_ARG start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ≈ | | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT (4)

Here, f(𝒬)𝑓𝒬f(\mathcal{Q})italic_f ( caligraphic_Q ) yields the estimated cardinality Q^𝒢subscript^norm𝑄𝒢\widehat{||Q||}_{\mathcal{G}}over^ start_ARG | | italic_Q | | end_ARG start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT, which approximates the true cardinality Q𝒢subscriptnorm𝑄𝒢||Q||_{\mathcal{G}}| | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT of the query within the graph.

Refer to caption
Figure 1: GNCE Overview. For a KG, embeddings are learned for existing atoms. Next, a GNN is trained using sampled queries, which are represented with the embeddings. Finally, the trained model estimates the cardinality of new queries.

To approximate the true cardinality as outlined in Eq. (4), we choose a neural network representation, due to its capacity for learning complex functions Maiorov and Pinkus (1999). Considering that queries over Knowledge Graphs (KGs) inherently form graph structures (cf. Definition 2), employing a Graph Neural Network (GNN) is a fitting choice. This selection ensures permutation invariance within the query graph and also enhances parameter and data efficiency.

Figure 1 provides an overview of our approach. In the initial phase, GNCE prepares the data necessary for training the machine learning model. This preparation phase, detailed in Section 4.1, involves sampling queries from 𝒢𝒢\mathcal{G}caligraphic_G, computing Knowledge Graph Embeddings, and building meaningful query representations. Subsequently, GNCE employs a GNN to learn a model from these query representations. The learning objective is to minimize the discrepancy between the model’s predicted cardinality and the actual cardinality of the queries, as discussed in Section 4.2. Upon successful training, the GNN model is capable of predicting the cardinality for new queries unseen during the training phase (cf. Section 4.3).

4.1 Data Preparation

The goal of the data preparation is to obtain meaningful representations that can be used to learn a model for cardinality estimation. The data preparation in GNCE is carried out offline and consists of (1) obtaining queries from the KG to use for training, (2) computing embeddings for atoms that occur in the queries, (3) building the featurization of each query using (1) and (2). This process is described in the following.

GNCE first obtains a training set of queries with their true cardinality from a given KG 𝒢𝒢\mathcal{G}caligraphic_G. For this, GNCE randomly samples queries from 𝒢𝒢\mathcal{G}caligraphic_G with different structures, e.g., different shapes like stars or paths, a varying number of triple patterns, and different bound terms. During this process, it is important to sample uniformly across the KG to cover the distribution of data as best as possible and not only fit some regions of likely user queries. Then, for each sampled query Q𝑄Qitalic_Q, the individual occurrences occ(x)𝑜𝑐𝑐𝑥occ(x)italic_o italic_c italic_c ( italic_x ) in 𝒢𝒢\mathcal{G}caligraphic_G of each bound atom x𝑥xitalic_x appearing in Q𝑄Qitalic_Q is calculated. That is, occ(x)𝑜𝑐𝑐𝑥occ(x)italic_o italic_c italic_c ( italic_x ) is the number of triples in 𝒢𝒢\mathcal{G}caligraphic_G that atom x𝑥xitalic_x occurs in any position, i.e., subject, predicate, or object. Those quantities will be informative to the approximation of the overall query cardinality: the higher the number of atom occurrences, the higher the probability that a subgraph of 𝒢𝒢\mathcal{G}caligraphic_G will match the query graph Q𝑄Qitalic_Q. As the occ(x)𝑜𝑐𝑐𝑥occ(x)italic_o italic_c italic_c ( italic_x ) values can be easily obtained using simple queries, it is much better to provide them to the model instead of the model having to learn them.

Next, GNCE calculates embeddings e(x)𝑒𝑥e(x)italic_e ( italic_x ) for atoms (nodes and predicates) in 𝒢𝒢\mathcal{G}caligraphic_G using RDF2Vec Ristoski and Paulheim . Since, as explained in Section 2.2, RDF2Vec embeddings encode relatedness and joint probability, we choose it as our embedding method. With RDF2Vec, nodes (including IRIs and literals) and predicates are treated as elements in the random walks, therefore, RDF2Vec is able to produce embeddings for all types of atoms in the KG. In this way, GNCE also supports literals, even though they are treated equally to IRIs (similar to all compared methods). That is, the ordinal similarity between literals in the different data types is not considered, and we leave a special treatment (e.g. Kristiadi et al. (2019); Preisner and Paulheim (2023)) for future work.

Moreover, GNCE only trains the RDF2Vec embeddings on the entities occurring in the sampled queries instead of the full KG, which greatly improves training speed for large graphs (see Portisch et al. (2020) for details). This is justified as we observed that the resulting model performance is similar when training on the whole graph. However, in a production setting, one might want to calculate embeddings for all entities, or at least for those frequently occurring in queries. Note that the process of embedding learning happens prior to training the model for cardinality estimation, and the embeddings of existing entities are fixed after this. In Section 4.3, we discuss how to handle new entities after training. We assume that all predicates that will appear during testing and later estimation on new queries are covered in the training set.

Lastly, for each sampled query Q𝑄Qitalic_Q, GNCE computes a representation 𝒬𝒬\mathcal{Q}caligraphic_Q. The representation 𝒬𝒬\mathcal{Q}caligraphic_Q consists of replacing each atom in the query graph Q𝑄Qitalic_Q with a final input feature that is meaningful for learning the cardinality model. Here, GNCE builds the final input feature i(x)𝑖𝑥i(x)italic_i ( italic_x ) for an atom x𝑥xitalic_x, which can be a bound or a variable atom. For a bound atom x𝑥xitalic_x, GNCE concatenates its corresponding embedding e(x)𝑒𝑥e(x)italic_e ( italic_x ) and the calculated occurrence occ(x)𝑜𝑐𝑐𝑥occ(x)italic_o italic_c italic_c ( italic_x ), i.e., i(x)=e(x)||occ(x)i(x)=e(x)||occ(x)italic_i ( italic_x ) = italic_e ( italic_x ) | | italic_o italic_c italic_c ( italic_x ). The case of variable atoms in the query graph requires a special vector representation. For each query individually, GNCE randomly assign an increasing numerical id to each variable (e.g. ?s1 gets id 2, ?p1 gets id 1, etc.). Then, GNCE builds the vectors i(x)𝑖𝑥i(x)italic_i ( italic_x ) for a variable atom as follows. The first dimension of the vector is set to the id. The rest of the dimensions are all set to 1111. By that, the model is informed about which atoms are variables, and which variables are the same. Since the assignment is random, the model does not falsely learn that a certain id corresponds with a certain position or label. This is different from other approaches Davitkova et al. (2022); Zhao et al. (2021), where distinct variables in a query are encoded with 0s and treated as if they were the same variable. Finally, the featurization 𝒬𝒬\mathcal{Q}caligraphic_Q is then given by i(x),x𝑖𝑥for-all𝑥i(x),\forall xitalic_i ( italic_x ) , ∀ italic_x. In that way, GNCE jointly encodes local neighborhood and similarity to other atoms, as well as statistical information about the entity.

4.2 Model Learning

In this section, we present the model definition that explains how GNCE learns from the representations 𝒬𝒬\mathcal{Q}caligraphic_Q obtained during data preparation, followed by the details of the model training that describes how GNCE implements supervised learning using true query cardinalities to perform predictions.

Refer to caption
Figure 2: Architecture of the GNN for cardinality estimation. An initial query is represented by entity- and predicate embeddings and the connectivity between those atoms. For clarity, the TPN aggregation is only shown for one node. After node aggregation, the complete query graph is represented by summing all node embeddings. Finally, the query cardinality is estimated by map** the query representation through a multilayer perceptron to a single dimension.
Model Definition.

Figure 2 shows the GNCE architecture that consists of a Graph Neural Network (GNN) that implements Message Passing Layers with a Multi-layer Perceptron (MLP).

The representation 𝒬𝒬\mathcal{Q}caligraphic_Q is the input to our GNN (k=0𝑘0k=0italic_k = 0). First, two Message Passing Layers (k=1,k=2formulae-sequence𝑘1𝑘2k=1,k=2italic_k = 1 , italic_k = 2) are applied to the input. In a preliminary study, we found that using more layers does not improve the model’s performance. Possible explanations for this are that (i) the query graphs are rather small so two steps of message-passing are sufficient, and (ii) fewer layers avoid the oversmoothing problem Chen et al. (2020), where node representations become increasingly similar, which deteriorates the model performance.

For the Message Passing Layers in the GNN, we propose a custom Message Passing function. Our function, which we term TPN Message Passing, is a modulation of the GINECONV Hu et al. (2020) message passing function defined as

xi(k)=hθ(k)((1+ϵ)xi(k1)+j𝒩(i)ReLU(xj(k1)+ej,i))superscriptsubscript𝑥𝑖𝑘superscriptsubscript𝜃𝑘1italic-ϵsuperscriptsubscript𝑥𝑖𝑘1subscript𝑗𝒩𝑖ReLUsuperscriptsubscript𝑥𝑗𝑘1subscript𝑒𝑗𝑖x_{i}^{(k)}=h_{\theta}^{(k)}\left((1+\epsilon)x_{i}^{(k-1)}+\sum_{j\in\mathcal% {N}(i)}\mathrm{ReLU}(x_{j}^{(k-1)}+e_{j,i})\right)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( ( 1 + italic_ϵ ) italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N ( italic_i ) end_POSTSUBSCRIPT roman_ReLU ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT + italic_e start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ) ) (5)

We initially evaluated several different Message Passing Layers but found GINECONV to be superior to the others. This is in line with the theoretical findings by Xu et al. Xu et al. (2019), which states that GINECONV is strictly more expressive than other Message Passing Layers. It also aligns with results from LSS Zhao et al. (2021), which also uses GINECONV as the Message Passing Function.

In Eq. (5), each hθ(k)superscriptsubscript𝜃𝑘h_{\theta}^{(k)}italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is a multilayer perceptron (MLP) with a ReLU Activation function. 𝒩(i)𝒩𝑖\mathcal{N}(i)caligraphic_N ( italic_i ) are all neighboring nodes of node i𝑖iitalic_i with corresponding edges ej,isubscript𝑒𝑗𝑖e_{j,i}italic_e start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT. The parameter ϵitalic-ϵ\epsilonitalic_ϵ is an optionally learnable parameter that defaults to zero. Therefore, this function first sums up the neighboring nodes and edges individually, then aggregates those and the current representation of the node xi(k1)superscriptsubscript𝑥𝑖𝑘1x_{i}^{(k-1)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT again through summation to finally transform them nonlinearly through hθsubscript𝜃h_{\theta}italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Unfortunately, this Message Passing function treats the graph as undirected, as it aggregates the neighbors invariant under the direction of the edge. However, in cardinality estimation, the edge direction can be important information: The triples (spo)𝑠𝑝𝑜(s\rightarrow p\rightarrow o)( italic_s → italic_p → italic_o ) and (ops)𝑜𝑝𝑠(o\rightarrow p\rightarrow s)( italic_o → italic_p → italic_s ) can have vastly different cardinalities. One option to make the message passing directional is to only aggregate incoming edges at each node, i.e., instead of summing over 𝒩(i)𝒩𝑖\mathcal{N}(i)caligraphic_N ( italic_i ), the summation happens over 𝒩+(i)superscript𝒩𝑖\mathcal{N}^{+}(i)caligraphic_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_i ). However, we found that this yields worse results than just summing over 𝒩(i)𝒩𝑖\mathcal{N}(i)caligraphic_N ( italic_i ), presumably because it restricts information flow. Hence, we use the following customized message-passing function:

Definition 5 (TPN Message Passing).

The Triple Pattern Network (TPN) Message Passing transforms a node representation xi(k1)superscriptsubscript𝑥𝑖𝑘1x_{i}^{(k-1)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT by separately transforming incoming messages (from 𝒩+(i)superscript𝒩𝑖\mathcal{N}^{+}(i)caligraphic_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_i )) and outgoing messages (from 𝒩(i)superscript𝒩𝑖\mathcal{N}^{-}(i)caligraphic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_i )) like

xi(k)=hθ(k)(xi(k1)\displaystyle x_{i}^{(k)}=h_{\theta}^{(k)}\biggl{(}x_{i}^{(k-1)}\ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT +j𝒩+(i)ReLU(𝐖(xi(k1)ej,ixj(k1)))+limit-fromsubscript𝑗superscript𝒩𝑖ReLU𝐖superscriptsubscript𝑥𝑖𝑘1normsuperscript𝑒𝑗𝑖superscriptsubscript𝑥𝑗𝑘1\displaystyle+\sum_{j\in\mathcal{N}^{+}(i)}\mathrm{ReLU}(\mathbf{W}(x_{i}^{(k-% 1)}||e^{j,i}||x_{j}^{(k-1)}))\ ++ ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_i ) end_POSTSUBSCRIPT roman_ReLU ( bold_W ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT | | italic_e start_POSTSUPERSCRIPT italic_j , italic_i end_POSTSUPERSCRIPT | | italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ) ) +
+j𝒩(i)ReLU(𝐖(xj(k1)||ei,j||xi(k1))))\displaystyle\vspace{-3em}+\sum_{j\in\mathcal{N}^{-}(i)}\mathrm{ReLU}(\mathbf{% W}(x_{j}^{(k-1)}||e^{i,j}||x_{i}^{(k-1)}))\biggr{)}+ ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_i ) end_POSTSUBSCRIPT roman_ReLU ( bold_W ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT | | italic_e start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT | | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ) ) )

This function makes a case distinction between incoming and outgoing edges. Instead of aggregating the node and edge features through summation, the function concatenates them (||||| |) and linearly projects them to the same dimensionality as xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using the matrix 𝐖𝐖\mathbf{W}bold_W. Note that it is not necessary to have two different matrices 𝐖+superscript𝐖\mathbf{W}^{+}bold_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, 𝐖superscript𝐖\mathbf{W}^{-}bold_W start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT, since 𝐖𝐖\mathbf{W}bold_W always receives the same encoding of a triple pattern 𝐎𝐏𝐒𝐎norm𝐏𝐒\mathbf{O||P||S}bold_O | | bold_P | | bold_S, where 𝐎𝐎\mathbf{O}bold_O is the feature of the node where the edge points to, 𝐏𝐏\mathbf{P}bold_P is the edge feature between the nodes, and 𝐒𝐒\mathbf{S}bold_S is the feature of the node where the edge originates. Lastly, the MLP hΘ(k)superscriptsubscriptΘ𝑘h_{\Theta}^{(k)}italic_h start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT receives as input sums of triple pattern representations. After the graph nodes have been transformed by the two Message Passing Layers, they need to be aggregated into a fixed-length vector. For that, GNCE applies a Global Sum Aggregation to them. That is, the representations of the nodes are summed up dimension-wise, resulting in a single vector x𝑥xitalic_x with the same dimensionality as the node embeddings: x=lxl(k=2)𝑥subscript𝑙superscriptsubscript𝑥𝑙𝑘2x=\sum_{l}x_{l}^{(k=2)}italic_x = ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k = 2 ) end_POSTSUPERSCRIPT.

The last step of the GNCE model is to transform this vector x𝑥xitalic_x through a final 2-layer MLP (with a ReLU activation in the hidden layer and linear activation in the final layer) into one single dimension (i.e., one value). This value represents the predicted cardinality of the query by GNCE.

Model Training.

To learn the cardinality of queries, GNCE follows a supervised approach, using the query representations 𝒬𝒬\mathcal{Q}caligraphic_Q (processed with the GNN as previously described) and the true cardinalities Q𝒢subscriptnorm𝑄𝒢||Q||_{\mathcal{G}}| | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT. The goal during training is to minimize the difference between the predicted cardinality, denoted f(𝒬)𝑓𝒬f(\mathcal{Q})italic_f ( caligraphic_Q ), and the true value. For that, as a loss function, GNCE employs the Mean Squared Error (MSE) between our model output f(𝒬)𝑓𝒬f(\mathcal{Q})italic_f ( caligraphic_Q ) and the logarithm of the true cardinality Q𝒢subscriptnorm𝑄𝒢||Q||_{\mathcal{G}}| | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT: =f(𝒬)log(Q𝒢)2subscriptnorm𝑓𝒬subscriptnorm𝑄𝒢2\mathcal{L}=||f(\mathcal{Q})-\log(||Q||_{\mathcal{G}})||_{2}caligraphic_L = | | italic_f ( caligraphic_Q ) - roman_log ( | | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Hence, the model learns to predict logarithms of cardinalities, which stabilizes the training due to the large range of possible cardinalities. Although the model is evaluated in terms of q-Error, we found that using MSE as a loss function yields better results than using q-Error directly. We train the model with a batch size of 32323232 and for 50 epochs in all experiments. For optimization of the model, we use the Adam optimizer Kingma and Ba (2015) with a learning rate of e4superscript𝑒4e^{-4}italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The model training is carried out offline555Related approaches also pre-compute statistics or samples offline., and the learned model can then be used to perform online predictions on new queries that were not part of the training sample, as described next.

4.3 Cardinality Estimation on New Queries

The learned model can now be used to predict the cardinality of new queries. This prediction is computed online, as a query Qn+1subscript𝑄𝑛1Q_{n+1}italic_Q start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is provided to the approach. For this, GNCE identifies the bound atoms mentioned in Qn+1subscript𝑄𝑛1Q_{n+1}italic_Q start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT, and retrieves the corresponding embeddings which have been pre-computed. For variable atoms, GNCE builds their corresponding vectors, to obtain the representation 𝒬n+1subscript𝒬𝑛1\mathcal{Q}_{n+1}caligraphic_Q start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT. This representation is provided to the trained model, to compute the prediction Qn+1^𝒢subscriptnorm^subscript𝑄𝑛1𝒢||\widehat{Q_{n+1}}||_{\mathcal{G}}| | over^ start_ARG italic_Q start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_ARG | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT.

Model Learning and Deployment in Production Environments.

We now describe how GNCE can be implemented in a real-world environment. The KG 𝒢𝒢\mathcal{G}caligraphic_G is assumed to be stored in a triple store. To obtain sample queries, GNCE generates random queries that are executed over the triple store, either directly or via a SPARQL endpoint. Metadata of the queries such as the true cardinalities and the occ𝑜𝑐𝑐occitalic_o italic_c italic_c values are also obtained by GNCE querying the triple store. Then, for the atoms that occur in the sampled queries, GNCE generates random walks –also querying the triple store– which are used to learn the RDF2Vec embeddings. The pre-computed embeddings are stored in a fast-accessible storage system, e.g., a vector database. Then, GNCE trains the GNN model which is compiled and deployed for production. When a new query arrives, the relevant embeddings are retrieved from the vector database to build the query’s featurization, and the GNN directly predicts the cardinality.

An important aspect within production environments is the addition of new entities to the KG. If a new entity in a query does not yet have a corresponding embedding, GNCE generates a random embedding to represent the new entity. Theoretically, this can deteriorate the performance as the model is making prediction on an unknown entity. Still, in practice, the GNCE performance remains satisfactory (see Section 5.5). This ensures that the approach will run for any query, regardless if all embeddings already exist. Eventually, after a certain period or after a certain number of new entities are added, GNCE can learn the embeddings for new entities offline. In this process, GNCE also updates the occurrences occ(x)𝑜𝑐𝑐𝑥occ(x)italic_o italic_c italic_c ( italic_x ) for atoms related to the new entities; yet, this is a lightweight operation. Afterward, GNCE adds the new embeddings to the vector database, making them available to the deployed GNN model. Please note that all approaches that rely on pre-computed data (e.g., summaries, samples, or ML models) are also affected by changes in the KG, and they also need to update their structures to cover new entities.

5 Experiments

We empirically investigate the effectiveness of the studied approaches (§5.2, §5.3, §5.4), the generalization of the approaches to unseen entities, i.e., the inductive case (§5.5), the runtime efficiency of the approaches (§5.6), and ablation study of GNCE (§5.7), and an application of GNCE in a query optimizer (§5.8). The raw datasets, queries, results, and code are available on GitHub666https://github.com/DE-TUM/GNCE.

5.1 Experimental Setup

5.1.1 Datasets and Queries.

For datasets, we use SWDF Möller et al. (2007), LUBM Guo et al. (2005), YAGO Suchanek et al. (2008), and a subset of Wikidata Wang et al. (2021) that consists of 20 million triples.

Table 2: Dataset characteristics
KG Triples Entities Predicates Classes Typed Entities
SWDF 242256 76711 170 118 28%
LUBM 2688849 664048 18 15 66%
YAGO  58M  13M 92  189K 98%
Wikidata  21M  5M 872  19K 84%

For LUBM and SWDF, we use the same datasets as Davitkova et al. Davitkova et al. (2022), for easier comparison to previous results from the literature. For YAGO, we use the dataset provided by the benchmark tool G-CARE Park et al. (2020). LUBM is a synthetic benchmark modeling a university domain. SWDF is a small size but dense real-world dataset that models authors, papers, and conferences. Lastly, YAGO and Wikidata are large cross-domain KGs. Table 2 summarizes the statistics about these datasets. For the queries, we generated777https://github.com/DE-TUM/rdf-subgraph-sampler star and path queries as reported in Table 3. We additionally generated more complex shapes over YAGO and retrieved real queries for Wikidata (detailed in Section 5.4). The cardinality of the queries is in the range [e0,e6]superscript𝑒0superscript𝑒6[e^{0},e^{6}][ italic_e start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ] for LUBM and SWDF, and [e0,e8]superscript𝑒0superscript𝑒8[e^{0},e^{8}][ italic_e start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT ] for YAGO and Wikidata. We randomly shuffled all queries and used 80% of them for training and evaluated all approaches on the remaining 20%. Prior to this, we removed any duplicates of queries in the dataset (considering isomorphic queries with different variable names).

Table 3: Queries used for training and testing
KG Star Queries Path Queries
SWDF Overall: 116645 Overall: 60739
2tp 3tp 5tp 8tp 2tp 3tp 5tp 8tp
27843 29461 29690 29651 538 19033 550 40618
LUBM Overall: 113855 Overall: 55336
2tp 3tp 5tp 8tp 2tp 3tp 4tp 8tp
25356 29127 29727 29645 17784 33047 4505 0
YAGO Overall: 91162 Overall: 87775
2tp 3tp 5tp 8tp 2tp 3tp 5tp 8tp
13968 24416 24509 28269 19378 33807 17990 16600
Wikidata Overall: 109545 Overall: 101313
2tp 3tp 5tp 8tp 2tp 3tp 5tp 8tp
29975 29936 29898 19736 29948 29951 29997 11417

5.1.2 Compared Methods.

We compare GNCE to the implementation provided by G-CARE Park et al. (2020) of summary-based approaches (SUMRDF Stefanoni et al. (2018) and Characteristic Sets (CSETNeumann and Moerkotte (2011)) and sampling-based approaches (impr Chen and Lui (2017), Wanderjoin Li et al. (2016), and jsub Zhao et al. (2018)), using the recommended default parameters, and refer to the paper Park et al. (2020) for details. We also compare to state-of-the-art learning-based approaches (LMKG Davitkova et al. (2022) and LSS Zhao et al. (2021)). For the sampling-based approaches, we report on their average performance over 30 samples (the default setting in G-CARE). For the learning-based approaches, we trained one model per dataset and query type and set up the number of parameters as recommended by the authors. The following table summarizes the number of parameters for the learning-based approaches. Note that GNCE has the lowest number of parameters, with 18x less w.r.t. LSS. Fewer parameters typically translate to a smaller memory footprint of the approach and a faster execution time. It also indicates that the model is less prone to overfitting.

Number of model parameters
KG GNCE (Ours) LMKG LSS SWDF 1.2e51.2superscript𝑒51.2\cdot e^{5}1.2 ⋅ italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 4.5e54.5superscript𝑒54.5\cdot e^{5}4.5 ⋅ italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 2.2e62.2superscript𝑒62.2\cdot e^{6}2.2 ⋅ italic_e start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT LUBM 1.2e51.2superscript𝑒51.2\cdot e^{5}1.2 ⋅ italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 2.9e52.9superscript𝑒52.9\cdot e^{5}2.9 ⋅ italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 2.2e62.2superscript𝑒62.2\cdot e^{6}2.2 ⋅ italic_e start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT YAGO 1.2e51.2superscript𝑒51.2\cdot e^{5}1.2 ⋅ italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 3.8e53.8superscript𝑒53.8\cdot e^{5}3.8 ⋅ italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 2.2e62.2superscript𝑒62.2\cdot e^{6}2.2 ⋅ italic_e start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT Wikidata 1.2e51.2superscript𝑒51.2\cdot e^{5}1.2 ⋅ italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 3.1e53.1superscript𝑒53.1\cdot e^{5}3.1 ⋅ italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 2.2e62.2superscript𝑒62.2\cdot e^{6}2.2 ⋅ italic_e start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT

5.1.3 Implementation Details

GNCE is implemented using Python 3.8 and Pytorch Geometric. For computing the RDF2Vec embeddings, we use the implementation by Vandewiele et al. Vandewiele et al. (2022). We train the RDF2Vec model for 10 epochs. For each entity, we sample 5 random walks with a maximum depth of 4 and learn the embeddings as presented in Section 2.2. The dimensionality of the embeddings is set to 100. Finally, we store the embeddings for later use by GNCE. All experiments, except for LSS on YAGO, have been conducted on a machine with 16 GB RAM, Intel Core [email protected], and an NVIDIA GeForce RTX 3050 GPU. Since LSS requires extensive amounts of RAM as it loads the whole KG into memory for training, it was trained on a machine with 512 GB RAM, an Intel Xeon Silver 4314 [email protected], and an Nvidia A100 GPU.

5.1.4 Evaluation Metric

Similar to previous works Zhao et al. (2021), we evaluate the effectiveness of the approaches in terms of the q-Error:

q-Error(Q)=max(Q𝒢Q^𝒢,Q^𝒢Q𝒢)q-Error𝑄subscriptnorm𝑄𝒢subscript^norm𝑄𝒢subscript^norm𝑄𝒢subscriptnorm𝑄𝒢\text{q-Error}(Q)=\max\left(\frac{||Q||_{\mathcal{G}}}{\widehat{||Q||}_{% \mathcal{G}}},\frac{\widehat{||Q||}_{\mathcal{G}}}{||Q||_{\mathcal{G}}}\right)q-Error ( italic_Q ) = roman_max ( divide start_ARG | | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG | | italic_Q | | end_ARG start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT end_ARG , divide start_ARG over^ start_ARG | | italic_Q | | end_ARG start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT end_ARG start_ARG | | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT end_ARG ) (6)

5.2 Cardinality Estimation for Star Queries

In this section, we study the effectiveness of the approaches for star queries, where the subject position is the join variable.

In Table 4, we report the mean q-Error of all approaches in each dataset.888Results for SUMRDF on YAGO and Wikidata could not be reported, as the summary building for that graph did not finish after 12 hours. Overall, we observe that the q-Error increases along with the size of the KGs.

This behavior is expected, as capturing correlations in larger datasets is more challenging, and queries with large cardinalities naturally occur more frequently in these datasets. These aspects negatively impact the estimations. The sampling and summary methods behave similarly to some extent, except for jsub which produces large errors in YAGO. CSET does not perform well in the star queries tested, as these may have bound objects, which can create overestimations in CSET.

Among the learned approaches, GNCE exhibits the best overall performance. Interestingly, LSS does not perform well in these KGs. We hypothesize that this is because the learned node representations based on classes of LSS cannot distinguish between queries with similar shapes but with different cardinalities. This behaviour of LSS is clearly observed in SWDF. While all other approaches achieve their best performance in SWDF (smallest dataset), LSS produces large q-Errors because the number of typed entities is considerably small (only 22%, cf. Table 2). In the case of untyped entities, the LSS entity representation defaults to a generic vector of zeros, which does not contain any information. In contrast, GNCE does not make any assumption about the types of entities, and can still encode meaningful representations of the entities through the KG embeddings.

Refer to caption
(a) SWDF
Refer to caption
(b) LUBM
Refer to caption
(c) YAGO
Refer to caption
(d) Wikidata
Figure 3: Boxplots of q-Errors of star queries
Refer to caption
(a) SWDF
Refer to caption
(b) LUBM
Refer to caption
(c) YAGO
Refer to caption
(d) Wikidata
Figure 4: q-Errors for star-shaped queries, grouped by the true cardinalities of the queries
Refer to caption
(a) SWDF
Refer to caption
(b) LUBM
Refer to caption
(c) YAGO
Refer to caption
(d) Wikidata
Figure 5: Boxplots of q-Errors for path queries
Refer to caption
(a) SWDF
Refer to caption
(b) LUBM
Refer to caption
(c) YAGO
Refer to caption
(d) Wikidata
Figure 6: q-Errors for path queries, grouped by the true cardinalities of the queries
Table 4: Mean q-Errors for star queries
Approach SWDF LUBM YAGO Wikidata
Sampling Wanderjoin 129.00 232.00 18713.00 34564.00
impr 141.00 237.00 2605.00 169241.00
jsub 169.00 409.00 15 e6absentsuperscript𝑒6\cdot e^{6}⋅ italic_e start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 40055.00
Summary CSET 94.00 904.00 41586.00 37310.00
SUMRDF 130.00 233.00 - -
Learned LMKG 3.00 2.00 9.00 10267.00
LSS 1187.00 283.00 502.00 4089.00
GNCE 1.42 1.22 1.96 4.20

To further understand the performance of the approaches, we analyze the distributions of q-Errors. Figure 3 shows the distribution of q-Errors as boxplots; under- and overestimations are depicted by values below and above the horizontal line (perfect q-Error), respectively. We start by looking at the sampling methods (Wanderjoin, impr, and jsub), which tend to underestimate the cardinalities, in particular impr. This behavior was already observed by Park et al. Park et al. (2020) for other datasets. The summary-based approaches also tend to produce underestimations, except for CSET in YAGO and Wikidata. A closer inspection of the results reveals that CSET largely overestimates the cardinalities for highly selective queries with bound objects. For the learned-based methods, the q-Errors are concentrated around the optimal value of 1 for GNCE and LMKG. LSS tends to overestimate slightly on large parts of the queries, but also shows severe underestimation. In comparison to LSS and LMKG, the maximum under-and overestimations of GNCE, as well as the quantiles, are more symmetric around the origin than for the other methods, indicating a more stable behavior across all datasets.

Next, we analyze the average q-Error of the approaches at different cardinality sizes (cf. Figure 4). The results show that the q-Error tends to grow with increasing query cardinality, except for GNCE and LMKG. Non-learning-based approaches exhibit good performance in highly selective queries with a result size range < 5. These queries typically have many bound atoms, for which sampling-based methods like impr and Wanderjoin might have collected (nearly) exact statistics. The only exception is CSET in YAGO, yet, these results are due to the large q-Errors produced by overestimations of CSET as previously explained. For learning-based approaches, the low number of training queries in the higher cardinality range can contribute to large errors. Still, GNCE, with its inductive bias tailored to the graph nature of queries and more informative feature representation due to embeddings, requires less training data to produce accurate predictions. For this reason, GNCE consistently has the lowest q-Error across all result sizes, with the only exception being the result size range < 5 in YAGO and Wikidata, where it is slightly outperformed by Wanderjoin.

5.3 Cardinality Estimation for Path Queries

Next, we examine the effectiveness of the approaches for path queries, with joins occurring between subjects and objects of triple patterns. Table 5 presents the mean q-Errors. Most methods have higher q-Errors in comparison to the star-shaped queries (cf. Table 4). This is consistent with previous results reported by Park et al. Park et al. (2020). Also interestingly, all methods are at least an order of magnitude worse on SWDF compared to the other datasets. This can be due to the high density of the SWDF graph, i.e., there are lots of possible paths from each starting point which makes cardinality estimation significantly more error-prone. The effectiveness of the learning-based approaches is also affected by the fewer (training) queries in the dataset (cf. Table 3).999Please note that this is not a problem of the query generator, but rather that KGs in general exhibit a small diameter Zloch et al. (2019), which makes the number of different path queries considerable smaller than the number of star queries. Overall, the learned approaches exhibit the best performance, followed by sampling-based methods, and lastly summary-based methods. Still, GNCE outperforms the state of the art in path queries by several orders of magnitude.

Table 5: Mean q-Errors for path queries
Approach SWDF LUBM YAGO Wikidata
Sampling Wanderjoin 22 e4absentsuperscript𝑒4\cdot e^{4}⋅ italic_e start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT 389.00 9730.00 67001.00
impr 73.7e473.7superscript𝑒473.7\cdot e^{4}73.7 ⋅ italic_e start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT 429.00 1722.00 146623.00
jsub 37.4e437.4superscript𝑒437.4\cdot e^{4}37.4 ⋅ italic_e start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT 414.00 10952.00 69822.00
Summary CSET 1e471superscript𝑒471\cdot e^{47}1 ⋅ italic_e start_POSTSUPERSCRIPT 47 end_POSTSUPERSCRIPT 63e1063superscript𝑒1063\cdot e^{10}63 ⋅ italic_e start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT 2e292superscript𝑒292\cdot e^{29}2 ⋅ italic_e start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT 9e309superscript𝑒309\cdot e^{30}9 ⋅ italic_e start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT
SUMRDF 2e52superscript𝑒52\cdot e^{5}2 ⋅ italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 393e6393superscript𝑒6393\cdot e^{6}393 ⋅ italic_e start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT - -
Learned LMKG 1820.00 2.00 69.00 4220.00
LSS 1187.00 179.00 1075.00 1204.00
GNCE 12.50 1.14 3.78 14.00

Next, we analyze the distribution of q-Errors in Figure 5. All the sampling methods tend to underestimate the query cardinalities, especially in SWDF as previously explained due to the density of this dataset. Furthermore, their performance greatly deteriorates for YAGO and Wikidata (in comparison to LUBM) due to the size and the known semi-structuredness of these KGs. The estimates computed by CSET are unrealistically high, a phenomenon that was already reported by Davitkova et al. Davitkova et al. (2022). These results confirm that CSET summaries do not perform well at capturing join correlations between subjects and objects, as expected. In the learning-based approaches, LSS tends to produce overestimations. This was already observed in the star queries (cf. Section 5.2), indicating that this is a recurrent behavior of LSS. In contrast, both GNCE and LMKG tend to underestimate, except for LMKG on Wikidata. Still, the spread of the errors shown by the whiskers is smaller for GNCE than LMKG.

In Figure 6, we examine the q-Errors categorized by query cardinality. Figure 6(a) omits the bars for CSET due to the extremely high values. Similar to the trend observed for star-shaped queries, we found that the q-Error increases with the cardinality of queries for all approaches except for GNCE, which exhibited a mostly consistent mean q-Error. Second, in SWDF, YAGO, and Wikidata, the sampling-based methods marginally outperform GNCE in the small cardinality range <5. Interestingly, on SWDF, for queries with cardinalities between 54superscript545^{4}5 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and 55superscript555^{5}5 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, LSS performs best out of all approaches in this particular case. Lastly, GNCE performs better than LMKG in the smaller cardinality range on Wikidata, while LMKG is better in the high cardinality range.

5.4 Cardinality Estimation for User and Complex Queries

To analyze the approaches’ effectiveness beyond synthetic star- and path queries, we also investigate more challenging types of queries, namely real user queries and more complex query shapes.

Real user queries were retrieved from the Wikidata SPARQL query service log101010https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en. We selected conjunctive queries that generated answers over our used Wikidata subset, yielding 11,832 queries, 90% of which had a single triple pattern. We note that this high number might be due to filtering for conjunctive queries and ones that produce answers over the subset, as other works report a lower percentage of similar-to\sim77%  Bonifati et al. (2020). The learning-based methods were initially trained on the synthetic star and path queries, then finetuned on 80% of the user queries. All methods are evaluated on the remaining 20%. LSS could only be evaluated on queries with more than one triple pattern because the provided code does not support single triple pattern queries. Table 6 shows mean q-Errors, where GNCE outperforms all other methods. Figure 7(a) shows the mean q-Error, grouped by query sizes. LMKG performs best in the single triple pattern case, while GNCE performs best in the remaining ranges. The corresponding boxplot (Figure 7(b)) shows similar results to the synthetic queries.

Refer to caption
(a) Wikidata User Queries
Refer to caption
(b) Wikidata User Queries
Refer to caption
(c) YAGO Complex
Refer to caption
(d) YAGO Complex
Figure 7: Results for complex queries on YAGO and Wikidata user queries
Table 6: Mean q-Errors for complex and user queries
Query Wanderjoin impr jsub CSET SUMRDF LMKG LSS GNCE
User 3312 21984 3426 46e1146superscript𝑒1146\cdot e^{11}46 ⋅ italic_e start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT - 268 198 13
Complex 226282 508309 227163 3e273superscript𝑒273\cdot e^{27}3 ⋅ italic_e start_POSTSUPERSCRIPT 27 end_POSTSUPERSCRIPT - 113 2519 8.4

For more complex shape queries, we generated flower and snowflake queries over the YAGO dataset. A flower consists of a path query with one star at one end. A snowflake is a path, where at both ends a star is attached. We generated a total of 72,742 complex queries with 2 to 7 triple patterns. Table 6 shows the mean q-Errors, where all methods perform worse compared to star and path queries over YAGO, although the distribution of query sizes is comparable. That shows that the complexity of the query structure directly impacts the performance of cardinality estimators. Further, GNCE once more outperforms the other approaches in terms of the mean q-Error. Figure 7 (c) and (d) detail error distributions, reiterating Wanderjoin’s good performance at low cardinality, and GNCE’s dominance at high cardinality. We further observe that the boxplot of Wanderjoin has the narrowest box around perfect predictions, i.e., it produces high-quality predictions for a large number of queries, accompanied however by large outliers (shown in the whiskers). This is consistent with the results by Park et al. Park et al. (2020).

5.5 Inductive Case

We now test the effectiveness of the approaches in the inductive case, where test queries solely consist of entities not seen during training. Note that this is the worst-case scenario of inductivity, as typically some entities might have been seen before. The inductive setting is of utmost importance for production environments, as it entails that the models can be trained using query samples containing a (proper) subset of the KG entities. On the contrary, models would require query samples that involve all the KG entities, which makes the deployment of such models rather unpractical.

We conducted this experiment on star queries for SWDF and YAGO and proceeded as follows. For each dataset, the query set was partitioned into two disjoint sets (a training- and a test), such that the entities in both sets were mutually exclusive, i.e., each query in one set did not involve any entity present in the other set (the predicates, however, were known, as is the usual setting for inductivity). Additionally, we removed all triples involving those entities from the input knowledge graph prior to building the index structures for summary- and sampling-based methods in G-CARE. In the case of GNCE and LSS, no embeddings and occurrences were provided as featurization for the entities in the test set. Instead, a random (but fixed) vector was used to represent those entities. Those procedures ensured that the entities in the disjoint test set were never seen by any of the methods, including the statistics/sampling-builder, the embedding generator, or the neural networks.

Table 7: Mean q-Error for full inductive case on star queries
Dataset: SWDF
Q𝒢subscriptnorm𝑄𝒢||Q||_{\mathcal{G}}| | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT GNCE LMKG LSS Wanderjoin impr jsub CSET
<5absent5<5< 5 5.23 3.06 1.74 2.82 2.82 2.82 9.73
<52absentsuperscript52<5^{2}< 5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 3.16 2.88 4.26 12.51 12.51 12.51 11.68
<53absentsuperscript53<5^{3}< 5 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT 3.26 7.16 15.57 54.81 54.81 54.81 37.15
<54absentsuperscript54<5^{4}< 5 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT 4.16 29.92 89.02 264.14 274.87 284.91 176.37
<55absentsuperscript55<5^{5}< 5 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 7.98 159.26 596.94 1008.72 1010.95 1011.22 538.91
<56absentsuperscript56<5^{6}< 5 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 11.23 575.22 3567.94 1649.89 1655.85 1657.48 779.65
<59absentsuperscript59<5^{9}< 5 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT 335.51 840.27 65214.02 14198.33 14204.75 14198.37 -
Avg 52.93 231.11 9927.07 2455.89 2459.51 2460.30 2250.30
Dataset: YAGO
Q𝒢subscriptnorm𝑄𝒢||Q||_{\mathcal{G}}| | italic_Q | | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT GNCE LMKG LSS Wanderjoin impr jsub CSET
<5absent5<5< 5 189.48 319.94 8.58 2.51 2.51 2.51 12133.80
<52absentsuperscript52<5^{2}< 5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 229.71 138.35 3.94 11.60 13.55 13.55 13244.83
<53absentsuperscript53<5^{3}< 5 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT 73.98 47.19 8.10 33.96 50.27 42.74 25952.18
<54absentsuperscript54<5^{4}< 5 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT 38.18 145.65 41.76 185.62 262.63 218.33 13092.83
<55absentsuperscript55<5^{5}< 5 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 44.89 595.18 196.23 628.17 1036.19 1012.65 64086.77
<56absentsuperscript56<5^{6}< 5 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 86.65 2585.38 917.44 4249.70 6114.76 4277.01 119876.0
<59absentsuperscript59<5^{9}< 5 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT 280.94 49462.4 19016.3 141838.6 171508.1 142063.0 145802.6
Avg 134.83 7613.44 2884.61 20992.88 25569.72 21089.98 56312.71

Then, the three learned approaches, GNCE, LMKG, LSS, are trained on the training set (with embeddings for GNCE and LSS), and evaluated on the test set (without embeddings and occurrences). The methods from G-CARE are evaluated on the test set using the statistics built with the KG excluding the entities in this test set.

The mean q-Errors are presented in Table 7. The results show that either GNCE, LMKG, LSS or Wanderjoin perform the best for specific query sizes. Compared to the non-inductive case in Figure 4(a), all methods experience a performance deterioration, which is expected since statistics about the entities are missing. Overall, GNCE performs the best.

In summary, these results show that GNCE can be used in the inductive case involving new entities, without frequent retraining.

5.6 Runtime Efficiency

5.6.1 Prediction Time (Online).

We study the runtime of cardinality predictions for the different queries from the smallest and largest datasets, i.e., SWDF and YAGO. To measure the runtime we use the time library Foundation (2021) in Python. For the learned techniques, we measure 1.) upon receiving a query the loading time of embeddings from disk and creation of the query featurization and 2.) the propagation of the query representation through the network’s forward pass, which generates the final cardinality estimate.

Refer to caption
(a) SWDF - Star
Refer to caption
(b) SWDF - Path
Refer to caption
(c) YAGO - Star
Refer to caption
(d) YAGO - Path
Figure 8: Mean prediction time [ms] of the different approaches, grouped by the number of triple patterns in the query. Confidence intervals (95%) are shown in grey bars

For the methods provided by G-CARE Park et al. (2020), we use the provided timing functionality of the different methods, which measures the total time from receiving a query to generating a cardinality estimate. For the sampling-based approaches, as per G-CARE defaults, 30 samplings are performed, and their total runtime is used. For all methods, we exclude failed cardinality predictions, i.e. cases where the cardinality estimate was 0absent0\leq 0≤ 0.

Figure 8 presents the results grouped by the number of triple patterns in the query (query size). The gray bars indicate the 95% confidence interval. Comparing the results between SWDF and YAGO, we observe that the runtime of the non-learning-based approaches is impacted by the size of the dataset as well as the type of query. The sample-based methods, i.e. impr, jsub, and Wanderjoin, show higher efficiency only on the smaller SWDF dataset. The performance of CSET is similar to the sample-based approaches, as they require several accesses to (large) summaries to perform the predictions. We note that no runtimes for impr are reported for query sizes 5 and 8, since all cardinality estimates failed for those sizes. The model prediction times of the learned approaches exhibit stable behavior across the datasets, query sizes, and types. This is observed in both the mean values as well as the narrow confidence interval indicating almost no variations in runtime. The loading time for GNCE and LSS increases with the size of the query. This is expected since more embeddings need to be loaded from disk. We observe that the GNN runtimes of GNCE exhibit the best performance from the learned approaches and outperform the sampling/statistics-based methods. We further note that the majority of the total runtime for GNCE is due to embedding loading and query graph featurization, but overall still runs within 1-2 ms. Using efficient storage and compiled code can reduce GNCE runtime significantly. Overall, LMKG displays the best runtime of the methods. On the larger dataset, the runtimes of the learned methods are several orders of magnitude faster than the summary/statistics-based methods. Fast runtime within the single-digit millisecond range is a desired result since a very fast runtime is essential for a cardinality estimator to run, for example, within a query optimizer.

5.6.2 Building Model/Statistics Time (Offline).

All the studied methods need to build/train a model or build statistics/samples prior to being used as a cardinality estimator. Hence, we now measure the time it takes to do this preparation. We evaluate this using YAGO, but present building/training (refer in the remainder as training) times per atom, in order to easily interpolate between methods and to other graphs. The learned methods are evaluated with a Batch Size of 1, and a total of 20 training epochs.

Table 8: Embedding and training times per atom (in ms)
GNCE LMKG LSS Wanderjoin impr jsub CSET
Embedding 6.40 - 3.94 - - - -
Training 2.52 1.16 114.03 0.00019 0.00020 0.00019 0.00022

Table 8 shows the results of this study. Comparing the different values for the runtimes in the test case and the training times, we can see that the training times for sampling/summary-based methods are significantly faster than the learning-based methods, while the test times are faster for the learning methods (on larger graphs). The intuitive explanation for this is that during NN training, lots of statistics and patterns about the KG are compressed into the embeddings and NN parameters, which takes longer but enables a faster execution time. The sampling/summary-based methods compute less statistics up front but need to perform more computation during prediction, especially if the KG is large. Longer training times for the learned approaches prior to usage can be tolerated, but a very fast execution time (especially in the case of query optimization) is required. We also want to mention that the significantly largest time for GNCE is due to the embedding calculation, which was performed on a relatively small machine. During embedding learning, most of the time is spent on random walk generation, a process that can be parallelized if the hardware allows. To put this into perspective, with the used hardware, calculating embeddings for 7 million entities (approximate size of DBpedia Lehmann et al. (2015)) would take similar-to\sim12 hours. Then, training the GNCE GNN with 100,000 queries with an average of 5 triple patterns would take similar-to\sim 63 minutes.

5.7 Ablation Study

To determine the impact of the different components of GNCE, we performed an ablation study. For that, we compared the baseline model (GNCE) that was used in the previous sections to the same model where (1) the RDF2Vec embeddings are replaced by binary encodings, and (2) the directional TPN message passing is reduced to an undirected message passing where no case distinction between incoming and outgoing edges is performed. For (1), all entities in the KG are enumerated and given an id, and the corresponding embedding of the entity is replaced by a binary representation of the id, padded with 0s to length 100100100100. For the sake of space, the study is performed only on the largest KG YAGO.

Table 9 shows that removing RDF2Vec embeddings from the query graphs deteriorates (\downarrow) performance for both star and path queries. However, the effect is much stronger for path queries. This can be explained by the fact that the predicates of the star are already very predictive of the query cardinality (similar to characteristic sets). For the same reason, removing directed message passing also only deteriorates performance on path queries.

Table 9: Ablation study results of GNCE on YAGO
Method Star Queries Path Queries
Baseline 1.96 3.78
-RDF2Vec 2.40 (\downarrow 22%) 6.14 (\downarrow 62%)
-Directed Message Passing 1.93 (\uparrow 1%) 3.87 (\downarrow 3%)

5.8 Application of GNCE to Query Optimization

As mentioned, the overarching goal of cardinality estimation is to provide estimates to the query optimizer to minimize the query runtimes. Even though query optimization is an orthogonal problem to the one tackled in this work, we demonstrate preliminary results of GNCE assisting an actual query optimizer. For this, we have chosen nLDE Acosta and Vidal (2015), an open-source SPARQL engine also implemented in Python. nLDE uses cardinality estimates of subqueries to select physical join operators while traversing the space of plans with greedy heuristics. We compared the query plans devised with the nLDE cardinality estimator vs. GNCE.

Refer to caption
(a) Star Queries
Refer to caption
(b) Path Queries
Figure 9: Query optimization results on SWDF

For this study, we extended our SWDF query set to include queries with 3-8 triple patterns. We tested 243 star queries and 258 path queries. Therein, 15% of the star queries and 14% of the path queries had different plans when using GNCE as an estimator. Queries with the same plan under both cardinality estimators had similar runtimes, as expected. Figure 9 shows the runtimes of star and path queries with different resulting plans. In general, the larger the query size, the faster the query execution using GNCE. For small star queries, GNCE does not improve the execution times. For large queries (7 and 8 triple patterns), the execution time is improved by more than 100%. As can be read from Figure 8(a), for a query with 8 triple patterns (i.e., 7 joins), it takes GNCE 14ms to produce the cardinality predictions during optimization. This is approximately 0.1% of the total execution time.

6 Conclusion and Outlook

We presented GNCE, a method for accurate cardinality estimation of conjunctive queries over Knowledge Graphs (KGs) based on KG embeddings (KGE) and Graph Neural Networks (GNN). KGE provide semantically meaningful features to the GNN, which directly predicts the cardinality of the embedding-enhanced query graph.

Our experiments show that GNCE on average outperforms the state of the art in terms of q-Error over synthetic and real-world KGs and queries. GNCE is also robust in the sense that it does not produce severe prediction errors. GNCE also generalizes well to unseen entities, making it a suitable solution for actual deployment. In that context, we also showed that our model is lightweight leading also to smaller prediction runtime below 1 ms.

In future work, we plan to integrate queries with FILTERs. This requires extensions on the query sampling and KGE random walks for better capturing underlying range distributions, and the query featurization to represent the filter operation and the value. Further, we will investigate special treatments of numerical and text literals with more meaningful embeddings. Incorporating active learning, by means of an adaptive sampling strategy with respect to poorly performing queries, with the goal of accelerating the GNN training, is another promising direction. Lastly, a custom KGE approach tailored to cardinality estimation will be investigated.

7 Acknowledgements

The authors would like to thank Joscha Klimpel and Prof. Dr. Hanno Gottschalk for their valuable input in the preparation of the paper.

References

  • Cyganiak et al. [2014] Richard Cyganiak, David Wood, and Markus Lanthaler. RDF 1.1 concepts and abstract syntax. W3C recommendation, W3C, February 2014. https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/.
  • Noy et al. [2019] Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, Alan Patterson, and Jamie Taylor. Industry-scale knowledge graphs: Lessons and challenges. Commun. ACM, 62(8):36–43, jul 2019. ISSN 0001-0782. doi:10.1145/3331166. URL https://doi.org/10.1145/3331166.
  • Prud’hommeaux and Seaborne [2008] Eric Prud’hommeaux and Andy Seaborne. SPARQL query language for RDF. W3C recommendation, W3C, January 2008. https://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/.
  • Neumann and Moerkotte [2011] Thomas Neumann and Guido Moerkotte. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In Serge Abiteboul, Klemens Böhm, Christoph Koch, and Kian-Lee Tan, editors, Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany, pages 984–994. IEEE Computer Society, 2011. doi:10.1109/ICDE.2011.5767868. URL https://doi.org/10.1109/ICDE.2011.5767868.
  • Stefanoni et al. [2018] Giorgio Stefanoni, Boris Motik, and Egor V. Kostylev. Estimating the cardinality of conjunctive queries over rdf data using graph summarisation. pages 1043–1052. Association for Computing Machinery, Inc, 4 2018. ISBN 9781450356398. doi:10.1145/3178876.3186003.
  • Davitkova et al. [2022] Angjela Davitkova, Damjan Gjurovski, and Sebastian Michel. Lmkg: Learned models for cardinality estimation in knowledge graphs. 2022. doi:10.48786/edbt.2022.07. URL https://www.w3.org/TR/sparql11-overview/.
  • Zhao et al. [2021] Kangfei Zhao, Jeffrey Xu Yu, Hao Zhang, Qiyan Li, and Yu Rong. A learned sketch for subgraph counting. pages 2142–2155. Association for Computing Machinery, 2021. doi:10.1145/3448016.3457289.
  • Xu et al. [2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
  • Rossi et al. [2021] Andrea Rossi, Denilson Barbosa, Donatella Firmani, Antonio Matinata, and Paolo Merialdo. Knowledge graph embedding for link prediction: A comparative analysis. ACM Transactions on Knowledge Discovery from Data, 15, 1 2021. ISSN 1556472X. doi:10.1145/3424672.
  • Hogan et al. [2022] Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan F. Sequeda, Steffen Staab, and Antoine Zimmermann. Knowledge graphs. ACM Comput. Surv., 54(4):71:1–71:37, 2022. doi:10.1145/3447772. URL https://doi.org/10.1145/3447772.
  • Ristoski and Paulheim [2016] Petar Ristoski and Heiko Paulheim. RDF2Vec: RDF graph embeddings for data mining. In International Semantic Web Conference, pages 498–514. Springer, 2016.
  • Weller and Acosta [2021] Tobias Weller and Maribel Acosta. Predicting instance type assertions in knowledge graphs using stochastic neural networks. In Gianluca Demartini, Guido Zuccon, J. Shane Culpepper, Zi Huang, and Hanghang Tong, editors, CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, pages 2111–2118. ACM, 2021. doi:10.1145/3459637.3482377. URL https://doi.org/10.1145/3459637.3482377.
  • Mikolov et al. [2013] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Yoshua Bengio and Yann LeCun, editors, 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. URL http://arxiv.longhoe.net/abs/1301.3781.
  • Bronstein et al. [2021] Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. 4 2021. URL http://arxiv.longhoe.net/abs/2104.13478.
  • Team [2022] PyG Team. Creating message passing networks. https://pytorch-geometric.readthedocs.io/en/latest/notes/create_gnn.html, 2022. Accessed: 2022-11-29.
  • Wu et al. [2019] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. 1 2019. doi:10.1109/TNNLS.2020.2978386. URL http://arxiv.longhoe.net/abs/1901.00596http://dx.doi.org/10.1109/TNNLS.2020.2978386.
  • Deeds et al. [2023] Kyle Deeds, Dan Suciu, Magda Balazinska, and Walter Cai. Degree sequence bound for join cardinality estimation. In Floris Geerts and Brecht Vandevoort, editors, 26th International Conference on Database Theory, ICDT 2023, March 28-31, 2023, Ioannina, Greece, volume 255 of LIPIcs, pages 8:1–8:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.ICDT.2023.8. URL https://doi.org/10.4230/LIPIcs.ICDT.2023.8.
  • Cormode et al. [2011] Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4:1–294, 2011. ISSN 19317883. doi:10.1561/1900000004.
  • Vengerov et al. [2150] David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P Chakkappen. Join size estimation subject to filter conditions, 2150.
  • Li et al. [2016] Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. Wander join: Online aggregation via random walks. volume 26-June-2016, pages 615–629. Association for Computing Machinery, 6 2016. ISBN 9781450335317. doi:10.1145/2882903.2915235.
  • Zhao et al. [2018] Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. Random sampling over joins revisited. In Proceedings of the 2018 International Conference on Management of Data, pages 1525–1539, 2018.
  • Park et al. [2020] Yeonsu Park, Seongyun Ko, Sourav S. Bhowmick, Kyoungmin Kim, Kijae Hong, and Wook Shin Han. G-care: A framework for performance benchmarking of cardinality estimation techniques for subgraph matching. pages 1099–1114. Association for Computing Machinery, 6 2020. ISBN 9781450367356. doi:10.1145/3318464.3389702.
  • Sun et al. [2021] Ji Sun, **tao Zhang, Zhaoyan Sun, Guoliang Li, and Nan Tang. Learned cardinality estimation: A design space exploration and a comparative evaluation. volume 15, pages 85–97. VLDB Endowment, 2021. doi:10.14778/3485450.3485459.
  • Getoor et al. [2001] Lise Getoor, Ben Taskar, and Daphne Koller. Selectivity estimation using probabilistic models, 2001.
  • Tzoumas et al. [2011] Kostas Tzoumas, Amol Deshpande, and Christian S. Jensen. Lightweight graphical models for selectivity estimation without independence assumptions. Proceedings of the VLDB Endowment, 4:852–863, 8 2011. ISSN 21508097. doi:10.14778/3402707.3402724.
  • Halford et al. [2019] Max Halford, Philippe Saint-Pierre, and Frank Morvan. An approach based on bayesian networks for query selectivity estimation. 7 2019. doi:10.1007/978-3-030-18579-4_1. URL http://arxiv.longhoe.net/abs/1907.06295http://dx.doi.org/10.1007/978-3-030-18579-4_1.
  • Hilprecht et al. [2020] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. Deepdb: Learn from data, not from queries! volume 13, pages 992–1005. VLDB Endowment, 2020. doi:10.14778/3384345.3384349.
  • Kipf et al. [2019] Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. Learned cardinalities: Estimating correlated joins with deep learning. In 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. www.cidrdb.org, 2019. URL http://cidrdb.org/cidr2019/papers/p101-kipf-cidr19.pdf.
  • Woltmann et al. [2019] Lucas Woltmann, Claudio Hartmann, Maik Thiele, Dirk Habich, and Wolfgang Lehner. Cardinality estimation with local deep learning models. Association for Computing Machinery, 7 2019. ISBN 9781450368025. doi:10.1145/3329859.3329875.
  • Zhao et al. [2022] Kangfei Zhao, Jeffrey Xu Yu, Zongyan He, Rui Li, and Hao Zhang. Lightweight and accurate cardinality estimation by neural network gaussian process. In Proceedings of the 2022 International Conference on Management of Data, pages 973–987, 2022.
  • Stocker et al. [2008] Markus Stocker, Andy Seaborne, Abraham Bernstein, Christoph Kiefer, and Dave Reynolds. Sparql basic graph pattern optimization using selectivity estimation. pages 595–604, 2008. ISBN 9781605580852. doi:10.1145/1367497.1367578.
  • Shironoshita et al. [2007] E Patrick Shironoshita, Michael T Ryan, and Mansur R Kabuka. Cardinality estimation for the optimization of queries on ontologies. ACM SIGMOD Record, 36(2):13–18, 2007.
  • Neumann and Weikum [2009] Thomas Neumann and Gerhard Weikum. The rdf-3x engine for scalable management of rdf data, 2009.
  • Chen and Lui [2017] Xiaowei Chen and John C.S. Lui. Mining graphlet counts in online social networks. pages 71–80. Institute of Electrical and Electronics Engineers Inc., 1 2017. ISBN 9781509054725. doi:10.1109/ICDM.2016.99.
  • Huang and Liu [2011] Hai Huang and Chengfei Liu. Estimating selectivity for joined rdf triple patterns. pages 1435–1444, 2011. ISBN 9781450307178. doi:10.1145/2063576.2063784.
  • Zhang et al. [2019] Jie Zhang, Yuxiao Dong, Yan Wang, Jie Tang, and Ming Ding. Prone: Fast and scalable network representation learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 4278–4284. International Joint Conferences on Artificial Intelligence Organization, 7 2019. doi:10.24963/ijcai.2019/594. URL https://doi.org/10.24963/ijcai.2019/594.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. [2022] Hanchen Wang, Rong Hu, Ying Zhang, Lu Qin, Wei Wang, and Wenjie Zhang. Neural subgraph counting with wasserstein estimator. pages 160–175. Association for Computing Machinery, 6 2022. ISBN 9781450392495. doi:10.1145/3514221.3526163.
  • Hogan et al. [2019] Aidan Hogan, Cristian Riveros, Carlos Rojas, and Adrián Soto. A worst-case optimal join algorithm for SPARQL. In Chiara Ghidini, Olaf Hartig, Maria Maleshkova, Vojtech Svátek, Isabel F. Cruz, Aidan Hogan, Jie Song, Maxime Lefrançois, and Fabien Gandon, editors, The Semantic Web - ISWC 2019 - 18th International Semantic Web Conference, Auckland, New Zealand, October 26-30, 2019, Proceedings, Part I, volume 11778 of Lecture Notes in Computer Science, pages 258–275. Springer, 2019. doi:10.1007/978-3-030-30793-6_15. URL https://doi.org/10.1007/978-3-030-30793-6_15.
  • Zhang et al. [2018] Wei Emma Zhang, Quan Z. Sheng, Yongrui Qin, Kerry Taylor, and Lina Yao. Learning-based SPARQL query performance modeling and prediction. World Wide Web, 21(4):1015–1035, 2018. doi:10.1007/s11280-017-0498-1. URL https://doi.org/10.1007/s11280-017-0498-1.
  • Hasan and Gandon [2014] Rakebul Hasan and Fabien Gandon. Predicting SPARQL query performance. In Valentina Presutti, Eva Blomqvist, Raphaël Troncy, Harald Sack, Ioannis Papadakis, and Anna Tordai, editors, The Semantic Web: ESWC 2014 Satellite Events - ESWC 2014 Satellite Events, Anissaras, Crete, Greece, May 25-29, 2014, Revised Selected Papers, volume 8798 of Lecture Notes in Computer Science, pages 222–225. Springer, 2014. doi:10.1007/978-3-319-11955-7_23. URL https://doi.org/10.1007/978-3-319-11955-7_23.
  • [42] Daniel Casals, Carlos Buil-Aranda, and Carlos Valle. Sparql query execution time prediction using deep learning.
  • Maiorov and Pinkus [1999] Vitaly Maiorov and Allan Pinkus. Lower bounds for approximation by mlp neural networks. Neurocomputing, 25(1-3):81–91, 1999.
  • [44] Petar Ristoski and Heiko Paulheim. Rdf2vec: Rdf graph embeddings for data mining. URL http://www.rapidminer.com/.
  • Kristiadi et al. [2019] Agustinus Kristiadi, Mohammad Asif Khan, Denis Lukovnikov, Jens Lehmann, and Asja Fischer. Incorporating literals into knowledge graph embeddings. In Chiara Ghidini, Olaf Hartig, Maria Maleshkova, Vojtech Svátek, Isabel F. Cruz, Aidan Hogan, Jie Song, Maxime Lefrançois, and Fabien Gandon, editors, The Semantic Web - ISWC 2019 - 18th International Semantic Web Conference, Auckland, New Zealand, October 26-30, 2019, Proceedings, Part I, volume 11778 of Lecture Notes in Computer Science, pages 347–363. Springer, 2019. doi:10.1007/978-3-030-30793-6_20. URL https://doi.org/10.1007/978-3-030-30793-6_20.
  • Preisner and Paulheim [2023] Patryk Preisner and Heiko Paulheim. Universal preprocessing operators for embedding knowledge graphs with literals, 2023.
  • Portisch et al. [2020] Jan Portisch, Michael Hladik, and Heiko Paulheim. Rdf2vec light - A lightweight approach for knowledge graph embeddings. CoRR, abs/2009.07659, 2020. URL https://arxiv.longhoe.net/abs/2009.07659.
  • Chen et al. [2020] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 3438–3445, 2020.
  • Hu et al. [2020] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay S. Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=HJlWWJSFDH.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.longhoe.net/abs/1412.6980.
  • Möller et al. [2007] Knud Möller, Tom Heath, Siegfried Handschuh, and John Domingue. Recipes for semantic web dog food - the ESWC and ISWC metadata projects. In Karl Aberer, Key-Sun Choi, Natasha Fridman Noy, Dean Allemang, Kyung-Il Lee, Lyndon J. B. Nixon, Jennifer Golbeck, Peter Mika, Diana Maynard, Riichiro Mizoguchi, Guus Schreiber, and Philippe Cudré-Mauroux, editors, The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11-15, 2007, volume 4825 of Lecture Notes in Computer Science, pages 802–815. Springer, 2007. doi:10.1007/978-3-540-76298-0_58. URL https://doi.org/10.1007/978-3-540-76298-0_58.
  • Guo et al. [2005] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. LUBM: A benchmark for OWL knowledge base systems. J. Web Semant., 3(2-3):158–182, 2005. doi:10.1016/j.websem.2005.06.005. URL https://doi.org/10.1016/j.websem.2005.06.005.
  • Suchanek et al. [2008] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A large ontology from wikipedia and wordnet. Web Semantics, 6:203–217, 9 2008. ISSN 15708268. doi:10.1016/j.websem.2008.06.001.
  • Wang et al. [2021] Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang. KEPLER: A unified model for knowledge embedding and pre-trained language representation. Trans. Assoc. Comput. Linguistics, 9:176–194, 2021. doi:10.1162/tacl_a_00360. URL https://doi.org/10.1162/tacl_a_00360.
  • Vandewiele et al. [2022] Gilles Vandewiele, Bram Steenwinckel, Terencio Agozzino, and Femke Ongenae. pyrdf2vec: A python implementation and extension of rdf2vec. 2022. doi:10.48550/ARXIV.2205.02283. URL https://arxiv.longhoe.net/abs/2205.02283.
  • Zloch et al. [2019] Matthäus Zloch, Maribel Acosta, Daniel Hienert, Stefan Dietze, and Stefan Conrad. A software framework and datasets for the analysis of graph measures on rdf graphs. In The Semantic Web: 16th International Conference, ESWC 2019, Portorož, Slovenia, June 2–6, 2019, Proceedings 16, pages 523–539. Springer, 2019.
  • Bonifati et al. [2020] Angela Bonifati, Wim Martens, and Thomas Timm. An analytical study of large sparql query logs. The VLDB Journal, 29(2-3):655–679, 2020.
  • Foundation [2021] Python Software Foundation. The Python Standard Library - time. https://docs.python.org/3/library/time.html, 2021. [Online; accessed 27-February-2023].
  • Lehmann et al. [2015] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195, 2015. doi:10.3233/SW-140134. URL https://doi.org/10.3233/SW-140134.
  • Acosta and Vidal [2015] Maribel Acosta and Maria-Esther Vidal. Networks of linked data eddies: An adaptive web query processing engine for RDF data. In Marcelo Arenas, Óscar Corcho, Elena Simperl, Markus Strohmaier, Mathieu d’Aquin, Kavitha Srinivas, Paul Groth, Michel Dumontier, Jeff Heflin, Krishnaprasad Thirunarayan, and Steffen Staab, editors, The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I, volume 9366 of Lecture Notes in Computer Science, pages 111–127. Springer, 2015. doi:10.1007/978-3-319-25007-6_7. URL https://doi.org/10.1007/978-3-319-25007-6_7.