Differentiable Reasoning about Knowledge Graphs
with Region-based Graph Neural Networks

Aleksandar Pavlović¹ Emanuel Sallinger¹ Steven Schockaert²
\affiliations¹TU Wien, Vienna, Austria
²Cardiff University, Cardiff, United Kingdom \emails{aleksandar.pavlovic, emanuel.sallinger}@tuwien.ac.at, [email protected]

Abstract

Methods for knowledge graph (KG) completion need to capture semantic regularities and use these regularities to infer plausible knowledge that is not explicitly stated. Most embedding-based methods are opaque in the kinds of regularities they can capture, although region-based KG embedding models have emerged as a more transparent alternative. By modeling relations as geometric regions in high-dimensional vector spaces, such models can explicitly capture semantic regularities in terms of the spatial arrangement of these regions. Unfortunately, existing region-based approaches are severely limited in the kinds of rules they can capture. We argue that this limitation arises because the considered regions are defined as the Cartesian product of two-dimensional regions. As an alternative, in this paper, we propose ReshufflE, a simple model based on ordering constraints that can faithfully capture a much larger class of rule bases than existing approaches. Moreover, the embeddings in our framework can be learned by a monotonic Graph Neural Network (GNN), which effectively acts as a differentiable rule base. This approach has the important advantage that embeddings can be easily updated as new knowledge is added to the KG. At the same time, since the resulting representations can be used similarly to standard KG embeddings, our approach is significantly more efficient than existing approaches to differentiable reasoning.

1 Introduction

Knowledge graph (KG) embedding models learn geometric representations of knowledge graphs, with the aim of capturing regularities in the available knowledge. These representations can then be used to infer plausible knowledge that is not explicitly stated in the KG. An important research question is concerned with the kinds of regularities that can be captured by different kinds of models. While standard approaches are often difficult to analyse from this perspective, region-based embedding models aim to make these regularities more explicit. Essentially, in such approaches, each entity e is represented by an embedding $\mathbf{e}\in\mathbb{R}^{d}$ and each relation $r$ is represented by a geometric region $X_{r}\subseteq\mathbb{R}^{2d}$ . We say that the triple $(e,r,f)$ is captured by the embedding iff $\mathbf{e}\oplus\mathbf{f}\in X_{r}$ , where we write $\oplus$ for vector concatenation. In this way, we can naturally associate a KG with a given embedding. The advantage of region-based models is that we can similarly also associate a rule base with the embedding, where the rules reflect the spatial configuration of the regions $X_{r}$ . However, not all rule bases can be captured in this way. As a simple example, models based on TransE (?) cannot distinguish between the rules $r_{1}(X,Y)\wedge r_{2}(Y,Z)\rightarrow r_{3}(X,Z)$ and $r_{2}(X,Y)\wedge r_{1}(Y,Z)\rightarrow r_{3}(X,Z)$ .

This particular limitation can be avoided by using more sophisticated region-based models (?; ?), but even these models remain limited in terms of which rule bases they can capture. The underlying limitation seems to be related to the fact that these models use regions which are the Cartesian product of $d$ two-dimensional regions, i.e. $X_{r}=A^{r}_{1}\times...\times A^{r}_{d}$ , with $A_{i}^{r}\subseteq\mathbb{R}^{2}$ . To check whether $(e,r,f)$ is captured, we then check whether $(e_{i},f_{i})\in A_{i}^{r}$ for each $i\in\{1,...,d\}$ , with $\mathbf{e}=(e_{1},...,e_{d})$ and $\mathbf{f}=(f_{1},...,f_{d})$ . We will refer to such approaches as coordinate-wise models. Existing models thus primarily differ in how these two-dimensional regions are defined, e.g. ExpressivE (?) uses parallelograms for this purpose, while ? (?) used octagons. While it is, in principle, possible to use more flexible region-based representations, this typically leads to overfitting. In this paper, we go beyond coordinate-wise models but aim to avoid overfitting by otherwise kee** the model as simple as possible: we essentially learn regions $X_{r}$ , which are defined in terms of ordering constraints of the form $e_{i}\leq f_{j}$ .

Our main contributions are two-fold. First, we show that, despite its simplicity, the proposed model can capture a large class of rule bases, thus overcoming some of the limitations of existing region-based models. In fact, if we only consider consequences that can be inferred using a bounded number of inference steps, our model is capable of faithfully capturing arbitrary sets of closed path rules. Second, we show that knowledge graph embeddings in our framework can be learned using a monotonic Graph Neural Network (GNN) with randomly initialised node embeddings. This GNN effectively serves as a differentiable approximation of a rule base, acting on the initial representations of the entities to ensure that they capture the consequences that can be inferred from the KG. An important practical consequence is that our KG embeddings can be efficiently updated when new knowledge becomes available. Thus, our model is particularly well suited for KG completion in the inductive setting, where we need to predict links between entities that were not seen during training. Moreover, whereas existing inductive KG completion methods tend to be computationally expensive, e.g. by requiring one (?) or even many (?) forward passes of a GNN model for each query, our approach retains the advantage of KG embeddings, where the plausibility of a triple $(e,r,f)$ can be checked almost instantaneously.

2 Related Work

Region-based Models

Despite the vast amount of work on KG embedding models in the last decade, the reasoning abilities of most existing models are poorly understood. The main exception comes from a line of work that has focused on region-based representations (?; ?; ?; ?; ?; ?). Essentially, the region-based view makes explicit what triples and rules are captured by a given embedding. This allows us to study what kinds of semantic dependencies a given model is capable of capturing, which is important for ensuring that models have the right inductive bias, especially for settings where reasoning is important. Existing work has uncovered various limitations of existing models. For instance, ? (?) revealed that bilinear models such as RESCAL (?), DistMult (?), TuckER (?) and ComplEx (?) cannot capture relation hierarchies in a faithful way. They furthermore found that models that represent relations using convex regions have inherent limitations when it comes to modelling disjointness. However, such models were found to be capable of modelling arbitrary sets of closed path rules (and even more general classes of rule bases, involving existentials in the head and relations of different arity). In practice, learning arbitrary convex polytopes is not feasible in high-dimensional spaces. Practical region-based embedding models therefore focus on much simpler classes of regions, such as Cartesian products of boxes (?), cones (?; ?), parallelograms (?) and octagons (?). This makes the models easier to learn but limits the kinds of rules that they can capture. While the use of parallelograms and octagons makes it possible to capture arbitrary closed path rules, in practice we want to capture sets of such rules. This is only known to be possible under rather restrictive conditions (see Section 3).

Inductive KG Completion

Standard benchmarks for KG completion can only evaluate the reasoning abilities of models to a limited extent. For instance, BoxE (?) achieves strong results on these benchmarks, despite provably being incapable of modelling simple rules such as $r_{1}(X,Y)\wedge r_{2}(Y,Z)\rightarrow r_{3}(X,Z)$ . In this paper, we will therefore instead focus on the problem of inductive KG completion (?). In the inductive setting, we need to predict links between entities that are different from those that were seen during training. In particular, there is no overlap between the entities that occur in the KG that was used for training and the one that is used for testing (although the relations are the same in both KGs). To perform this task, models need to learn semantic dependencies between the relations, and then exploit this knowledge when making predictions. This can be achieved in different ways. A natural strategy is to learn rules from the training KG, either explicitly using a model such as AnyBURL (?) or implicitly using differentiable rule learners such as Neural-LP (?) or DRUM (?). The latter essentially approximate rule applications using tensor multiplications. In practice, better results have been obtained using GNNs. For instance, some approaches (?) reduce the problem of link prediction to a graph classification problem. They first construct a subgraph containing paths connecting the head entity with some candidate tail entity, and then use a GNN to predict a score from this subgraph. Such approaches suffer from limited scalability, as answering a link prediction query requires constructing and processing such a subgraph for each candidate tail entity. NBFNet (?) alleviates this limitation, by using a single GNN that processes the entire graph. The resulting node embeddings can then be used to score the different candidate tail entities. However, the node embeddings are query-specific, meaning that this model still requires a new forward pass of the GNN for each query, which is considerably less efficient than using KG embeddings.

While we use a GNN for computing entity embeddings, once these embeddings have been learned, we can use them to answer arbitrary link prediction queries. Our method is thus considerably more efficient than the aforementioned GNN-based models for inductive KG completion. ReFactor GNN (?) similarly uses a GNN to learn entity embeddings, by simulating the training dynamic of traditional KG embedding methods such as TransE (?). However, their method has the disadvantage that all embeddings have to be recomputed when new triples are added to the KG. Moreover, their model inherits the limitations of traditional embedding models when it comes to faithfully modelling rules. Conceptually, our method has more in common with differentiable rule learning methods than with subgraph classification strategies. Indeed, each layer of the GNN updates the entity embeddings by essentially simulating the application of rules. Moreover, our model can simulate the deductive chaining of rules, which makes it fundamentally different from Neural-LP and DRUM, which focus on one-off rule application.

3 Problem Setting

Let $\mathcal{R}$ be a set of relations, $\mathcal{E}$ a set of entities, and $\mathcal{G}\subseteq\mathcal{E}\times\mathcal{R}\times\mathcal{E}$ a knowledge graph. Similar to standard KG embedding models, our aim is to learn a vector space representation $\mathbf{e}\in\mathbb{R}^{d}$ for every entity $e\in\mathcal{E}$ and a scoring function $s_{r}$ for every relation $r\in\mathcal{R}$ such that $s_{r}(\mathbf{e},\mathbf{f})$ reflects the plausibility of the triple $(e,r,f)$ . In the case of region-based models, the scoring function $s_{r}$ is defined in terms of a geometric region $X_{r}\subseteq\mathbb{R}^{d}$ . Specifically, the triple $(e,r,f)$ is then considered to be captured by the embedding iff $\mathbf{e}\oplus\mathbf{f}\in X_{r}$ , where we write $\mathbf{e}\oplus\mathbf{f}$ to denote vector concatenation. Accordingly, the scoring function $s_{r}$ then reflects how close $\mathbf{e}\oplus\mathbf{f}$ is to the region $X_{r}$ (which is formalised in different ways by different models).

A key advantage of region-based models is that they offer a mechanism for modelling rules. Let us write $\eta$ to denote a given region-based embedding, i.e. $\eta(e)\in\mathbb{R}^{d}$ denotes the embedding of the entity $e\in\mathcal{E}$ and $\eta(r)\subseteq\mathbb{R}^{2d}$ denotes the region representing the relation $r\in\mathcal{R}$ . Let us consider a rule $\rho$ of the following form:

		$\displaystyle r_{1}(X_{1},X_{2})\wedge r_{2}(X_{2},X_{3})\wedge...\wedge r_{p}% (X_{p},X_{p+1})$		(1)
		$\displaystyle\quad\quad\rightarrow r(X_{1},X_{p+1})$

We say that $\eta$ captures this rule if for all vectors $\mathbf{x_{1}},...,\mathbf{x_{p+1}}\in\mathbb{R}^{n}$ we have:

	$\displaystyle(\mathbf{x_{1}}\oplus\mathbf{x_{2}}\in\eta(r_{1}))\wedge....% \wedge(\mathbf{x_{p}}\oplus\mathbf{x_{p+1}}\in\eta(r_{p}))$		(2)
	$\displaystyle\quad\quad\Rightarrow(\mathbf{x_{1}}\oplus\mathbf{x_{p+1}}\in\eta% (r))$

Rules of the form (1) are known as closed path rules. Region-based embeddings can similarly capture other kinds of rules, such as intersection rules of the form $r_{1}(X_{1},X_{2})\wedge r_{2}(X_{1},X_{2})\rightarrow r(X_{1},X_{2})$ . However, we will specifically focus on closed path rules in this paper, due to their importance for KG completion. For instance, most rule-based methods for KG completion focus on learning rules of this type (?). Moreover, existing region-based models have particular limitations when it comes to capturing this kind of rules. Some approaches, such as BoxE (?) are not capable of capturing such rules at all. More recent approaches (?; ?) are capable of capturing individual closed path rules, but they are limited when it comes to jointly capturing a set of such rules.

Specifically, given a set of closed path rules $\mathcal{P}$ , we ideally want an embedding $\eta$ that captures every rule in $\mathcal{P}$ while not capturing any rules that are not entailed by $\mathcal{P}$ . ? (?) showed this to be possible, provided that every rule entailed from $\mathcal{P}$ is either a trivial rule such as $r(X_{1},X_{2})\rightarrow r(X_{1},X_{2})$ or a rule of the form (1) in which $r_{1},...,r_{p},r$ are all distinct relations. For instance, rules of the form $r_{1}(X_{1},X_{2})\wedge r_{1}(X_{2},X_{3})\rightarrow r(X_{1},X_{3})$ were not allowed in their construction. They also provided a counterexample, which shows that without this restriction, it is not always possible to faithfully capture rule bases with octagon embeddings (without also capturing rules that are not entailed by the given rule base). ? (?) did not study the problem of capturing sets of rules, but their model is likely to suffer from similar limitations.

In the following, we write $\mathcal{P}\cup\mathcal{G}\models(e,r,f)$ to denote that the triple $(e,r,f)$ can be entailed from the rule base $\mathcal{P}$ and the knowledge graph $\mathcal{G}$ . More precisely, we have $\mathcal{P}\cup\mathcal{G}\models(e,r,f)$ iff either $(e,r,f)\in\mathcal{G}$ or $\mathcal{P}$ contains a rule of the form (1) such that $\mathcal{P}\cup\mathcal{G}\models(e,r_{1},e_{2})$ , $\mathcal{P}\cup\mathcal{G}\models(e_{2},r_{2},e_{3})$ , …, $\mathcal{P}\cup\mathcal{G}\models(e_{p},r_{p},f)$ for some entities $e_{2},...,e_{p}$ . We furthermore write $\mathcal{P}\models\rho$ for a rule $\rho$ of the form (1) to denote that $\mathcal{P}$ entails $\rho$ w.r.t. the standard notion of entailment from propositional logic (when interpreting rules in terms of material implication). Note that while we consider both a knowledge graph $\mathcal{G}$ and a rule base $\mathcal{P}$ in our analysis, in practice only the knowledge graph $\mathcal{G}$ is given. We study whether our model is capable of capturing the rule base because this is a necessary condition to allow it to learn semantic dependencies in the form of rules.

4 Model Description

Our aim is to develop a model that can capture a larger class of rule bases than existing region-based models. Furthermore, we want the embeddings to be defined such that they can be efficiently updated whenever new knowledge becomes available.

Ordering Constraints

The central idea is to rely on ordering constraints. Specifically, we model each relation $r$ using a region $X_{r}$ of the following form: $\mathbf{e}\oplus\mathbf{f}\in X_{r}$ iff

\displaystyle\forall i\in I_{r}\,.\,e_{\sigma_{r}(i)}\leq f_{i}

(3)

where $I_{r}\subseteq\{1,...,d\}$ , $\sigma_{r}:I_{r}\rightarrow\{1,...,d\}$ and we assume $\mathbf{e}=(e_{1},...,e_{d})$ and $\mathbf{f}=(f_{1},...,f_{d})$ . The following example illustrates why the use of ordering constraints is well-suited for modelling rules.

Example 1.

Consider a rule of the form $r_{1}(X,Y)\wedge r_{2}(Y,Z)\rightarrow r(X,Z)$ . This rule is captured by an embedding of the form (3) if for each $i\in I_{r}$ we have that $i\in I_{r_{2}}$ , $\sigma_{r_{2}}(i)\in I_{r_{1}}$ and $\sigma_{r_{1}}(\sigma_{r_{2}}(i))=\sigma_{r}(i)$ . Indeed, if these conditions are satisfied and we have $(e,r_{1},f)$ and $(f,r_{2},g)$ in $\mathcal{G}$ , then for each $i\in I_{r}$ we have the following constraint:

\displaystyle e_{\sigma_{r_{1}}(\sigma_{r_{2}}(i))}

\displaystyle\leq f_{\sigma_{r_{2}}(i)}\leq g_{i}

Since we assumed $\sigma_{r_{1}}(\sigma_{r_{2}}(i))=\sigma_{r}(i)$ it follows that $e_{\sigma_{r}(i)}\leq g_{i}$ for every $i\in I_{r}$ and thus that the embedding captures the triple $(e,r,f)$ .

We will come back to the analysis of how rules can be modelled using ordering constraints in the next section. We now turn our focus to how (a differentiable approximation of) the ordering constraints can be learned. Note that we can characterise (3) as follows:

\displaystyle\max(\mathbf{A_{r}}\mathbf{e},\mathbf{f})=\mathbf{f}

(4)

where the maximum is applied component-wise and the matrix $\mathbf{A_{r}}\in\mathbb{R}^{d\times d}$ is constrained such that (i) all components are either 0 or 1 and (ii) at most one component in each row is non-zero. This characterisation suggests how our embeddings can be learned using a GNN, as we explain next.

Learning Embeddings with GNNs

Let us write $\mathbf{e^{(l)}}\in\mathbb{R}^{d}$ for the representation of entity $e$ in layer $l$ of the GNN. The embeddings $\mathbf{e^{(0)}}$ are initialised randomly, ensuring that all coordinates are non-negative, the coordinates of different entity embeddings are sampled independently, and there are at least two distinct values that have a non-negative probability of being sampled for each coordinate. We use a simple message-passing GNN of the following form:

\displaystyle\mathbf{f^{(l+1)}}=\max\big{(}\{\mathbf{f^{(l)}}\}\cup\{\mathbf{A% _{r}}\mathbf{e^{(l)}}\,|\,(e,r,f)\in\mathcal{G}\}\big{)}

(5)

where the matrices $\mathbf{A_{r}}$ are constrained as before. Due to this constraint, the GNN converges after a finite number of steps $m$ to embeddings $\mathbf{f^{m}}$ satisfying $\mathbf{f^{m}}=\max(\mathbf{f^{m}},\mathbf{A_{r}}\mathbf{e^{m}})$ for each $(e,r,f)\in\mathcal{G}$ .

Representing Entities as Matrices

Since the model relies on randomly initialised entity embeddings, the dimensionality of the entity embeddings needs to be sufficiently high. At the same time, the number of parameters that have to be learned for each relation should be sufficiently low to prevent overfitting. For this reason, we learn matrices $\mathbf{A_{r}}$ of the following form:

\displaystyle\mathbf{A_{r}}=\mathbf{B_{r}}\otimes\mathbf{I_{k}}

(6)

where we write $\otimes$ for the Kronecker product, $\mathbf{I_{k}}$ is the $k$ -dimensional identity matrix and $\mathbf{B_{r}}$ is an $\ell\times\ell$ matrix, with $d=k\ell$ . To make the computation of the GNN updates more efficient, we can then represent each entity using a matrix $\mathbf{Z_{e}^{(l)}}\in\mathbb{R}^{\ell\times k}$ and compute updates as follows:

\displaystyle\mathbf{Z_{f}^{(l+1)}}=\max\big{(}\{\mathbf{Z_{f}^{(l)}}\}\cup\{% \mathbf{B_{r}}\mathbf{Z_{e}^{(l)}}\,|\,(e,r,f)\in\mathcal{G}\}\big{)}

(7)

It is easy to verify that this model is equivalent to (5) when each matrix $\mathbf{A_{r}}$ is constrained to be of the form (6). Specifically, the matrix $\mathbf{Z_{e}^{(l)}}=(z_{ij})$ corresponding to the entity embedding $\mathbf{e^{(l)}}=(e_{1},...,e_{d})$ is defined as $z_{ij}=e_{(i-1)k+j}$ , with $i\in\{1,...,\ell\}$ and $j\in\{1,...,k\}$ . Note that a triple $(e,r,f)$ is then supported by the embeddings at layer $l$ if:

\displaystyle\mathbf{B_{r}}\mathbf{Z_{e}^{(l)}}\preceq\mathbf{Z_{f}^{(l)}}

where $\mathbf{X}\preceq\mathbf{Y}$ denotes that $\max(\mathbf{X},\mathbf{Y})=\mathbf{Y}$ .

Model Details

To learn the matrix $\mathbf{B_{r}}$ , we choose each row $i$ as the first $\ell$ coordinates of the vector $\mathsf{softmax}(b^{r}_{i,1},...,b^{r}_{i,\ell+1})$ , where $b^{r}_{i,1},...,b^{r}_{i,\ell+1}$ are learnable parameters. Note that we need $\ell+1$ parameters for this softmax operation to allow for the possibility of some rows to be all 0s. Furthermore, note that while we conceptually think of $\mathbf{B_{r}}$ as binary matrices, in practice, we need to approximate such matrices to make learning possible. To initialise the entity embeddings, we set each coordinate to 0 or 1, with 50% probability. To train the model, we use the following scoring function for a given triple $(e,r,f)$ :

\displaystyle s(e,r,f)=-\|\mathsf{ReLU}(\mathbf{B_{r}}\,\mathbf{Z_{e}^{(m)}}-% \mathbf{Z_{f}^{(m)}})\|_{2}

where $m$ denotes the number of GNN layers. Note that $s(e,r,f)=0$ reaches its maximal value of 0 iff $\mathbf{B_{r}}\mathbf{Z_{e}^{(m)}}\preceq\mathbf{Z_{f}^{(m)}}$ . For each $(e,r,f)\in\mathcal{G}$ we add an inverse triple $(f,r_{\textit{inv}},e)$ to $\mathcal{G}$ . For each entity $e$ , we also add the triple $(e,\textit{eq},e)$ to $\mathcal{G}$ , which corresponds to the common practice of adding self-loops to the GNN. Following the literature (?; ?), ReshufflE’s training process uses negative sampling under the partial completeness assumption (PCA) (?), i.e., for each training triple $(e,r,f)\in\mathcal{G}$ , $N$ triples (negative samples) are created by replacing $e$ or $f$ in $(e,r,f)$ by randomly sampled entities $e^{\prime},f^{\prime}\in\mathcal{E}$ . To train ReshufflE, we minimise the margin ranking loss, defined as follows:

\displaystyle L(e,r,f)=\sum^{N}_{i=1}\max(0,s(e_{i}^{\prime},r,f_{i}^{\prime})% -s(e,r,f)+\lambda)

(8)

where $(e_{i}^{\prime},r,f_{i}^{\prime})$ is the i^th negative sample and $\lambda>0$ is a hyper-parameter, called the margin. At an intuitive level, the margin ranking loss pushes scores of true triples (i.e., those within the training graph) to be larger by at least $\lambda$ than the scores of triples that are likely false (i.e., negative samples).

5 Constructing GNNs from Rule Graphs

Consider a finite set $\mathcal{P}$ of closed path rules of the form (1). We now study the following question: Can parameters be found for the proposed GNN model (i.e. the matrices $\mathbf{B_{r}}$ ) such that the rules in $\mathcal{P}$ are captured, and no rules which are not entailed by $\mathcal{P}$ . Rather than constructing the matrices $\mathbf{B_{r}}$ directly, we first introduce the notion of a rule graph, which will serve as a convenient abstraction of the considered GNNs. We then explain how we can construct the matrices $\mathbf{B}_{r}$ from a given rule graph. Throughout this paper, we will assume that $\mathcal{G}$ contains the triple $(e,\textit{eq},e)$ for every $e\in\mathcal{E}$ . We will also assume that the relation eq does not appear in the rule base $\mathcal{P}$ .

Rule Graphs

We will encode the rule base $\mathcal{P}$ as a labelled multi-graph $\mathcal{H}$ , i.e. a set of triples $(n_{1},r,n_{2})$ . Note that this graph is formally equivalent to a knowledge graph, but the nodes in this case do not correspond to entities. A path in $\mathcal{H}$ from $n_{1}$ to $n_{p+1}$ is a sequence of triples of the form $(n_{1},r_{1},n_{2}),(n_{2},r_{2},n_{3}),...,(n_{p},r_{p},n_{p+1})$ . The type of this path is given by the sequence of relations $r_{1};r_{2};...;r_{p}$ . The eq-reduced type of the path is obtained by removing all occurrences of the relation eq in $r_{1};r_{2};...;r_{p}$ . For instance, for a path of type $r_{1};\textit{eq};\textit{eq};r_{2};\textit{eq}$ , the eq-reduced type is $r_{1};r_{2}$ .

Definition 1.

A rule graph $\mathcal{H}$ for a given rule base $\mathcal{P}$ is a labelled multi-graph, where the labels are taken from $\mathcal{R}$ , such that the following properties are satisfied:

(R1): For every relation $r\in\mathcal{R}$ , there is some edge in $\mathcal{H}$ labelled with $r$ .
(R2): For every node $n$ in $\mathcal{H}$ and every $r\in\mathcal{R}$ , it holds that $n$ has at most one incoming edge labelled with $r$ .
(R3): Suppose there is an edge in $\mathcal{H}$ with label $r$ from node $n_{1}$ to node $n_{2}$ . Suppose furthermore that $\mathcal{P}\models r_{1}(X_{1},X_{2})\wedge r_{2}(X_{2},X_{3})\wedge...\wedge r% _{p}(X_{p},X_{p+1})\rightarrow r(X_{1},X_{p+1})$ . Then there is a path in $\mathcal{H}$ from $n_{1}$ to $n_{2}$ whose eq-reduced type is $r_{1};...;r_{p}$ .
(R4): Suppose for every two nodes connected by an edge with label $r$ , there is a path connecting these two nodes whose eq-reduced type belongs to $\{(r_{11};...;r_{1p_{1}}),...,(r_{q1};...;r_{qp_{q}})\}$ . Then there is some $i\in\{1,...,q\}$ such that that $\mathcal{P}\models r_{i1}(X_{1},X_{2})\wedge...\wedge r_{ip_{i}}(X_{p_{i}},X_{% p_{i+1}})\rightarrow r(X_{1},X_{p_{i+1}})$ .

This definition reflects the fact that a rule is captured when the ordering constraints associated with its body entail the ordering constraints associated with its head, as was illustrated in Example 1. Specifically, this requirement is captured by condition (R3). Condition (R4) is needed to ensure that only the rules in $\mathcal{P}$ are captured. Conditions (R1) and (R2) are needed because, in the construction we consider below, the nodes of the rule graph will correspond to the rows of the matrices $\mathbf{B_{r}}$ . Condition (R1) will then ensure that $\mathbf{B_{r}}$ contains at least one non-zero component for each relation $r$ , while (R2) will ensure that each row of $\mathbf{B_{r}}$ has at most one non-zero component.

Example 2.

Let $\mathcal{P}$ contain the following rules:

	$\displaystyle r_{1}(X,Y)\wedge r_{2}(Y,Z)$	$\displaystyle\rightarrow r_{3}(X,Z)$
	$\displaystyle r_{4}(X,Y)\wedge r_{5}(Y,Z)$	$\displaystyle\rightarrow r_{2}(X,Z)$

A corresponding rule graph is shown in Figure 1.

Figure 1: Rule graph for Example 2.

Constructing GNNs

Given a rule graph $\mathcal{H}$ , we define the corresponding parameters of the GNN as follows. Specifically, we need to define the matrix $\mathbf{B_{r}}$ for every $r$ . Each node from the rule graph is associated with one row of $\mathbf{B_{r}}$ . Let $n_{1},...,n_{\ell}$ be an enumeration of the nodes in the rule graph. The corresponding matrix $\mathbf{B_{r}}=(b_{ij})$ is defined as:

\displaystyle b_{ij}=\begin{cases}1&\text{if $\mathcal{H}$ has an $r$-edge % from $n_{j}$ to $n_{i}$}\\ 0&\text{otherwise}\end{cases}

Note that because of condition (R2), there will be at most one non-zero element in each row of $\mathbf{B_{r}}$ , in accordance with the assumptions that we made in Section 4.

The following result shows that the constructed GNN indeed captures all the rules from $\mathcal{P}$ . Specifically, we show that the embeddings which are learned by the GNN (upon convergence) capture all triples that are entailed by $\mathcal{P}\cup\mathcal{G}$ .

Proposition 1.

Let $\mathcal{P}$ be a rule base and $\mathcal{G}$ a knowledge graph. Suppose $\mathcal{P}\cup\mathcal{G}\models(a,r,b)$ . Let $\mathcal{H}$ be a rule graph for $\mathcal{P}$ and let $\mathbf{Z_{e}^{(l)}}$ be the entity representations that are learned by the corresponding GNN. Assume $\mathbf{Z_{e}^{(m)}}=\mathbf{Z_{e}^{(m+1)}}$ for every entity $e$ ( $m\in\mathbb{N}$ ). It holds that $\mathbf{B_{r}}\mathbf{Z_{a}^{(m)}}\preceq\mathbf{Z_{b}^{(m)}}$ .

We also need to show that the GNN does not capture rules which are not entailed by $\mathcal{P}$ . However, for any given triple $(e,r,f)$ there is always a chance that it is captured by the learned embeddings, even if $\mathcal{P}\cup\mathcal{G}\not\models(e,r,f)$ , due to the fact that the entity embeddings are initialised randomly. However, by choosing $k$ to be sufficiently large, we can make the probability of this happening arbitrarily small.

Proposition 2.

Let $\mathcal{P}$ be a rule base and $\mathcal{G}$ a knowledge graph. Let $\mathcal{H}$ be a rule graph for $\mathcal{P}$ and let $\mathbf{Z_{e}^{(l)}}$ be the entity representations that are learned by the corresponding GNN. For any $\varepsilon>0$ , there exists some $k_{0}\in\mathbb{N}$ such that, when $k\geq k_{0}$ , for any $m\in\mathbb{N}$ and $(a,r,b)\in\mathcal{E}\times\mathcal{R}\times\mathcal{E}$ such that $\mathcal{P}\cup\mathcal{G}\not\models(a,r,b)$ , we have

\displaystyle\textit{Pr}[\mathbf{B_{r}}\mathbf{Z_{a}^{(m)}}\preceq\mathbf{Z_{b% }^{(m)}}]\leq\varepsilon

6 Constructing Rule Graphs

An important question is whether it is always possible, given a set of closed path rules $\mathcal{P}$ , to construct a corresponding rule graph satisfying conditions (R1)–(R4). For rule bases where a relation appearing in the head of a rule never appears in the body of some rule, this is clearly the case. The following example illustrates how rule graphs can sometimes be constructed for rule bases which encode cyclic dependencies between the relations.

Example 3.

Let $\mathcal{P}$ contain the following rules:

	$\displaystyle r_{2}(X,Y)\wedge r_{3}(Y,Z)$	$\displaystyle\rightarrow r_{1}(Y,Z)$
	$\displaystyle r_{1}(X,Y)\wedge r_{4}(Y,Z)$	$\displaystyle\rightarrow r_{2}(X,Z)$

A corresponding rule graph is shown in Figure 2.

Figure 2: Rule rule graph for Example 3.

However, there exist rule bases for which no valid rule graph can be found. This is illustrated in the next example.

Example 4.

Let $\mathcal{P}$ contain the following rule:

\displaystyle r_{1}(X,Y)\wedge r_{2}(Y,Z)\wedge r_{1}(Z,U)

\displaystyle\rightarrow r_{2}(X,U)

To see why this rule base cannot be modelled using a rule graph, consider the following knowledge graph $\mathcal{G}$ :

	$\displaystyle\mathcal{G}{=}\{$	$\displaystyle(x_{1},r_{1},x_{2}),(x_{2},r_{1},x_{3}),...,(x_{l-1},r_{1},r_{l}),$
		$\displaystyle(x_{l},r_{2},x_{l+1}),(x_{l+1},r_{1},x_{l+2}),...,(x_{k},r_{1},x_% {k+1})\}$

We have that $\mathcal{P}\cup\mathcal{G}\models(x_{1},r_{2},x_{k+1})$ only if the number of repetitions of $r_{1}$ at the start of the sequence matches the number of repetitions at the end. However, this requirement cannot be encoded using a rule graph.

The argument from the previous example can be formalised as follows. Let $\mathcal{P}$ be a set of closed path rules. Let $\mathcal{R}_{1}$ be the set of relations from $\mathcal{R}$ that appear in the head of some rule in $\mathcal{P}$ . For any $r\in\mathcal{R}_{1}$ , we can consider a context-free grammar with two types of production rules:

•

For each rule of the form (1), there is a production rule $r\Rightarrow r_{1}r_{2}...r_{p}$ .
•

For each $r\in\mathcal{R}_{1}$ , there is a production rule $r\Rightarrow\overline{r}$ .

The elements of $(\mathcal{R}\setminus\mathcal{R}_{1})\cup\{\overline{r}\,|\,r\in\mathcal{R}_{1}\}$ are viewed as terminal symbols, those in $\mathcal{R}_{1}$ are seen as non-terminal symbols, and $r$ is used as the starting symbol. Let us write $L_{r}$ for the corresponding language.

Proposition 3.

Let $\mathcal{P}$ be a set of closed path rules and suppose that there exists a rule graph $\mathcal{H}$ for $\mathcal{P}$ . Let $\mathcal{R}_{1}$ be the set of relations that appear in the head of some rule in $\mathcal{P}$ . It holds that the language $L_{r}$ is regular for every $r\in\mathcal{R}_{1}$ .

This result shows that we cannot capture arbitrary rule bases using rule graphs. For instance, for the rule base from Example 4, we have $L_{r_{2}}=\{r_{1}^{(l)}\overline{r}_{2}r_{1}^{(l)}\,|\,l\in\mathbb{N}\setminus% \{0\}\}$ , where we write $x^{(l)}$ for the string that consists of $l$ repetitions of $x$ . It is well-known that the language $L_{r_{2}}$ is not regular, hence it follows from Proposition 3 that no rule graph exists for this rule base. We address this issue in two different ways. First, in Section 6.1, we introduce a construction for a special class of rule bases, inspired by regular grammars. Second, in Section 6.2, we focus on the practically important setting of bounded inference: since GNNs use a fixed number of layers in practice, what mostly matters is what can be derived in a bounded number of steps. It turns out that if we only care about such inferences, we can capture arbitrary sets of closed path rules.

6.1 Left-Regular Rule Bases

We now introduce the notion of a left-regular rule base, which closely corresponds to the notion of left-regular grammar. As we will see, for left-regular rule bases we can always construct a valid rule graph. This, in turn, means that our model is capable of faithfully capturing such rule bases.

Definition 2.

Let $\mathcal{P}$ be a rule base. Let $\mathcal{R}_{1}$ be the set of relations that appear in the head of a rule from $\mathcal{P}$ . We call $\mathcal{P}$ left-regular if every rule is of the following form:

\displaystyle r_{1}(X,Y)\wedge r_{2}(Y,Z)\rightarrow r_{3}(X,Z)

(9)

such that $r_{2}\notin\mathcal{R}_{1}$ .

Note that even though we only consider rules of the form (9) for the purpose of the construction below, rules with more than two atoms can straightforwardly be simulated by introducing fresh relations. Given a left-regular rule base $\mathcal{P}$ , we construct the corresponding rule graph $\mathcal{H}$ as follows.

1.

We add the node $n_{0}$ .
2.

For each relation $r\in\mathcal{R}$ , we add a node $n_{r}$ , and we connect $n_{0}$ to $n_{r}$ with an $r$ -edge.
3.

For each rule of the form (9), we add an $r_{2}$ -edge from $n_{r_{1}}$ to $n_{r_{3}}$ .
4.

For each node $n$ with multiple incoming $r$ -edges for some $r\in\mathcal{R}$ , we do the following. Let $\sharp_{r}$ be the number of incoming $r$ -edges for node $n$ . Let $p=\max_{r\in\mathcal{R}}\sharp_{r}$ . We create fresh nodes $n_{1},...,n_{p-1}$ and add eq-edges from $n_{i}$ to $n_{i-1}$ ( $i\in\{1,...,p-1\}$ ), where we define $n_{0}=n$ . Let $r\in\mathcal{R}$ be such that $\sharp_{r}>1$ . Let $n^{\prime}_{0},...,n^{\prime}_{q}$ be the nodes with an $r$ -link to $n$ ; then we have $q\leq p-1$ . For each $i\in\{1,...,q\}$ we replace the edge from $n^{\prime}_{i}$ to $n$ by an edge from $n^{\prime}_{i}$ to $n_{i}$ .

We now illustrate the construction process with an example.

Example 5.

Let $\mathcal{P}$ contain the following rules:

	$\displaystyle r_{1}(X,Y)\wedge r_{2}(Y,Z)$	$\displaystyle\rightarrow r_{3}(X,Z)$
	$\displaystyle r_{4}(X,Y)\wedge r_{2}(Y,Z)$	$\displaystyle\rightarrow r_{3}(X,Z)$
	$\displaystyle r_{5}(X,Y)\wedge r_{2}(Y,Z)$	$\displaystyle\rightarrow r_{3}(X,Z)$

The corresponding rule graph is depicted in Figure 3. The nodes $n_{1}$ and $n_{2}$ were introduced in step 4 of the construction process. Before this step, there were $r_{2}$ -edges from $n_{r_{4}}$ to $n_{r_{3}}$ and from $n_{r_{5}}$ to $n_{r_{3}}$ . The node $n_{r_{3}}$ thus had three incoming $r_{2}$ -edges, which violates condition (R2). This is addressed through the use of eq edges in step 4.

Figure 3: Rule graph for Example 5.

Note that the rule graph may have loops, as illustrated next.

Example 6.

Let $\mathcal{P}$ contain the following rule:

r_{1}(X,Y)\wedge r_{2}(Y,Z)\rightarrow r_{1}(X,Z)

The corresponding rule graph is shown in Figure 4.

Figure 4: Rule graph for Example 6.

The proposed construction process clearly terminates after a finite number of steps. The following proposition shows that it constructs a valid rule graph for $\mathcal{P}$ .

Proposition 4.

Let $\mathcal{P}$ be a left-regular set of closed path rules and let $\mathcal{H}$ be the graph obtained using the proposed construction method. It holds that $\mathcal{H}$ satisfies (R1)–(R4).

6.2 Bounded Inference

In practice, the GNN can only carry out a finite number of inference steps. Rather than requiring that the resulting embeddings capture all triples that can be inferred from $\mathcal{P}\cup\mathcal{G}$ , it is natural to merely require that the result captures all triples that can be inferred using a bounded number of inference steps. As before, we assume that $\mathcal{P}$ contains rules of the form (9), but we no longer require that $r_{2}\notin\mathcal{R}_{1}$ . We know from Proposition 3 that it is then not always possible to construct a valid rule graph. To address this, we will weaken the notion of a rule graph, aiming to capture reasoning up to a fixed number of inference steps.

Let us write $\mathcal{P}\cup\mathcal{G}\models_{m}(e,r,f)$ to denote that $(e,r,f)$ can be derived from $\mathcal{P}\cup\mathcal{G}$ in $m$ steps. More precisely:

•

$\mathcal{P}\cup\mathcal{G}\models_{0}(e,r,f)$ iff $(e,r,f)\in\mathcal{G}$ .
•

$\mathcal{P}\cup\mathcal{G}\models_{m}(e,r,f)$ , for $m>0$ , iff $\mathcal{P}\cup\mathcal{G}\models_{m-1}(e,r,f)$ or there is a rule $r_{1}(X_{1},X_{2})\wedge r_{2}(X_{2},X_{3})\rightarrow r(X_{1},X_{3})$ in $\mathcal{P}$ and an entity $g\in\mathcal{E}$ such that $\mathcal{P}\cup\mathcal{G}\models_{m_{1}}(e,r_{1},g)$ and $\mathcal{P}\cup\mathcal{G}\models_{m_{2}}(g,r_{2},f)$ , with $m=m_{1}+m_{2}+1$ .

Definition 3.

Let $m\in\mathbb{N}$ . We call $\mathcal{H}$ an $m$ -bounded rule graph for $\mathcal{P}$ if $\mathcal{H}$ satisfies conditions (R1)–(R3) as well as the following weakening of (R4):

(R4m): Suppose for every two nodes connected by an edge with label $r$ , there is a path connecting these two nodes whose eq-reduced type belongs to $\{(r_{11};...;r_{1p_{1}}),...,(r_{q1};...;r_{qp_{q}})\}$ , with $p_{1},...,p_{q}\leq m+1$ . Then there is some $i\in\{1,...,q\}$ such that that $\mathcal{P}\models_{m}r_{i1}(X_{1},X_{2})\wedge...\wedge r_{ip_{i}}(X_{p_{i}},% X_{p_{i+1}})\rightarrow r(X_{1},X_{p_{i+1}})$ .

Given an $m$ -bounded rule graph, we can construct a corresponding GNN in the same way as in Section 5. Moreover, Proposition 1 remains valid for $m$ -bounded rule graphs, as its proof does not depend on (R4). Proposition 2 can be weakened as follows.

Proposition 5.

Let $\mathcal{P}$ be a rule base and $\mathcal{G}$ a knowledge graph. Let $\mathcal{H}$ be an $m$ -bounded rule graph for $\mathcal{P}$ and let $\mathbf{Z_{e}^{(l)}}$ be the entity representations that are learned by the corresponding GNN. For any $\varepsilon>0$ , there exists some $k_{0}\in\mathbb{N}$ such that, when $k\geq k_{0}$ , for any $i\leq m+1$ and $(a,r,b)\in\mathcal{E}\times\mathcal{R}\times\mathcal{E}$ such that $\mathcal{P}\cup\mathcal{G}\not\models_{m}(a,r,b)$ , we have

\displaystyle\textit{Pr}[\mathbf{B_{r}}\mathbf{Z_{a}^{(i)}}\preceq\mathbf{Z_{b% }^{(i)}}]\leq\varepsilon

Given a set of closed path rules $\mathcal{P}$ we can construct an $m$ -bounded rule graph as follows.

1.

We add the node $n_{0}$ .
2.

For each relation $r\in\mathcal{R}$ , we add a node $n_{r}$ , and we connect $n_{0}$ to $n_{r}$ with an $r$ -edge.
3.

We repeat the following until convergence. Let $r\in\mathcal{R}$ and assume there is an $r$ -edge from $n$ to $n^{\prime}$ . Let $r_{1}(X,Y)\wedge r_{2}(Y,Z)\rightarrow r(X,Z)$ be a rule from $\mathcal{P}$ and suppose that there is no $r_{1};r_{2}$ path connecting $n$ and $n^{\prime}$ . Suppose furthermore that the edge $(n,n^{\prime})$ is on some path from $n_{0}$ to a node $n_{r^{\prime}}$ , with $r^{\prime}\in\mathcal{R}$ whose length is at most $m$ . We add a fresh node $n^{\prime\prime}$ to the rule graph, an $r_{1}$ -edge from $n$ to $n^{\prime\prime}$ , and an $r_{2}$ -edge from $n^{\prime\prime}$ to $n^{\prime}$ .
4.
For each $r\in\mathcal{R}$ and $r$ -edge $(n,n^{\prime})$ such that for some rule $r_{1}(X,Y)\wedge r_{2}(Y,Z)\rightarrow r(X,Z)$ from $\mathcal{P}$ there is no $r_{1};r_{2}$ path connecting $n$ and $n^{\prime}$ , we do the following:
1. (a)
  
  We add a fresh node $n^{\prime\prime}$ , an $r_{1}$ -edge from $n$ to $n^{\prime\prime}$ and an $r_{2}$ -edge from $n^{\prime\prime}$ to $n^{\prime}$ .
2. (b)
  
  We repeat the following until convergence. For each $r^{\prime}$ -edge from $n$ to $n^{\prime\prime}$ and each rule $r_{1}^{\prime}(X,Y)\wedge r_{2}^{\prime}(Y,Z)\rightarrow r^{\prime}(X,Z)$ from $\mathcal{P}$ , we add an $r_{1}^{\prime}$ edge from $n$ to $n^{\prime\prime}$ and an $r^{\prime}_{2}$ -loop to $n^{\prime\prime}$ (if no such edges/loops exist yet).
3. (c)
  
  We repeat the following until convergence. For each $r^{\prime}$ -edge from $n^{\prime\prime}$ to $n^{\prime}$ and each rule $r_{1}^{\prime}(X,Y)\wedge r_{2}^{\prime}(Y,Z)\rightarrow r^{\prime}(X,Z)$ from $\mathcal{P}$ , we add an $r_{1}^{\prime}$ -loop to $n^{\prime\prime}$ and an $r_{2}^{\prime}$ -edge from $n^{\prime\prime}$ to $n^{\prime}$ (if no such edges/loops exist yet).
4. (d)
  
  We repeat the following until convergence. For each $r^{\prime}$ -loop at $n^{\prime\prime}$ , and each rule $r_{1}^{\prime}(X,Y)\wedge r_{2}^{\prime}(Y,Z)\rightarrow r^{\prime}(X,Z)$ from $\mathcal{P}$ , we add an $r_{1}^{\prime}$ -loop and an $r_{2}^{\prime}$ -loop to $n^{\prime\prime}$ (if no such loops exist yet).
5.

For each node $n$ with multiple incoming $r$ -edges for one or more relations from $\mathcal{R}$ , we do the following. Let $\sharp_{r}$ be the number of incoming $r$ -edges for node $n$ . Let $p=\max_{r\in\mathcal{R}}\sharp_{r}$ . We create fresh nodes $n_{1},...,n_{p-1}$ and add eq-edges from $n_{i}$ to $n_{i-1}$ ( $i\in\{1,...,p-1\}$ ), where we define $n_{0}=n$ . Let $r\in\mathcal{R}$ be such that $\sharp_{r}>1$ . Let $n^{\prime}_{0},...,n^{\prime}_{q}$ be the nodes with an $r$ -link to $n$ ; then we have $q\leq p-1$ . For each $i\in\{1,...,q\}$ we replace the edge from $n^{\prime}_{i}$ to $n$ by an edge from $n^{\prime}_{i}$ to $n_{i}$ .

We illustrate the construction process with two examples.

Example 7.

Let us consider the following set of rules:

	$\displaystyle r_{1}(X,Y)\wedge r_{2}(Y,Z)$	$\displaystyle\rightarrow r_{3}(X,Z)$
	$\displaystyle r_{3}(X,Y)\wedge r_{1}(Y,Z)$	$\displaystyle\rightarrow r_{2}(X,Z)$

The corresponding $1$ -bounded rule graph is shown in Fig. 5.

Figure 5: Rule graph for Example 7.

Example 8.

Let us consider the following set of rules:

	$\displaystyle r_{1}(X,Y)\wedge r_{2}(Y,Z)$	$\displaystyle\rightarrow r_{3}(X,Z)$
	$\displaystyle r_{4}(X,Y)\wedge r_{5}(Y,Z)$	$\displaystyle\rightarrow r_{1}(X,Z)$
	$\displaystyle r_{4}(X,Y)\wedge r_{5}(Y,Z)$	$\displaystyle\rightarrow r_{2}(X,Z)$

The corresponding $2$ -bounded rule graph is shown in Fig. 6. Note how this graph is in fact also a rule graph: due to the fact that there are no cyclic dependencies in the rule base $\mathcal{P}\cup\mathcal{G}\models_{2}(e,r,g)$ is equivalent with $\mathcal{P}\cup\mathcal{G}\models(e,r,g)$ .

Figure 6: Rule graph for Example 8.

The construction process clearly terminates after a finite number of steps. Indeed, only edges that are on a path of length $m$ are expanded in step 3, and given that there are only finitely many such paths, step 3 must terminate. It is also straightforward to see that the other steps must terminate. As the following proposition shows, the proposed process indeed constructs an $m$ -bounded rule graph.

Proposition 6.

Let $\mathcal{P}$ be a set of closed path rules and let $\mathcal{H}$ be the graph obtained using the proposed construction method for $m$ -bounded rule graphs. It holds that $\mathcal{H}$ satisfies (R1)–(R3) and (R4m).

		$\mathcal{R}_{\textit{Train}}$	$\mathcal{E}_{\textit{Train}}$	$\mathcal{G}_{\textit{Train}}$	$\mathcal{R}_{\textit{Test}}$	$\mathcal{E}_{\textit{Test}}$	$\mathcal{G}_{\textit{Test}}$
FB15k-237	v1	180	1594	5226	142	1093	2404
	v2	200	2608	12085	172	1660	5092
	v3	215	3668	22394	183	2501	9137
	v4	219	4707	33916	200	3051	14554
WN18RR	v1	9	2746	6678	8	922	1991
	v2	10	6954	18968	10	2757	4863
	v3	11	12078	32150	11	5084	7470
	v4	9	3861	9842	9	7084	15157
NELL-995	v1	14	3103	5540	14	225	1034
	v2	88	2564	10109	79	2086	5521
	v3	142	4647	20117	122	3566	9668
	v4	76	2092	9289	61	2795	8520

Table 1: Number of relation, entities, and triples of the train, validation, and test split of the training and testing graph of the inductive benchmarks, split by corresponding benchmark versions v1-4.

		FB15k-237				WN18RR				NELL-995
		v1	v2	v3	v4	v1	v2	v3	v4	v1	v2	v3	v4
GNN	CoMPILE	0.676	0.829	0.846	0.874	0.836	0.798	0.606	0.754	0.583	0.938	0.927	0.751
	GraIL	0.642	0.818	0.828	0.893	0.825	0.787	0.584	0.734	0.595	0.933	0.914	0.732
	NBFNet	0.845	0.949	0.946	0.947	0.946	0.897	0.904	0.889	0.644	0.953	0.967	0.928
Rule	RuleN	0.498	0.778	0.877	0.856	0.809	0.782	0.534	0.716	0.535	0.818	0.773	0.614
Rule	AnyBURL	0.604	0.823	0.847	0.849	0.867	0.828	0.656	0.796	0.683	0.835	0.798	0.652
Diff-R	DRUM	0.529	0.587	0.529	0.559	0.744	0.689	0.462	0.671	0.194	0.786	0.827	0.806
	Neural-LP	0.529	0.589	0.529	0.559	0.744	0.689	0.462	0.671	0.408	0.787	0.827	0.806
	ReshufflE	0.747	0.885	0.903	0.918	0.710	0.729	0.602	0.694	0.638	0.861	0.882	0.812

Table 2: Hits@10 for 50 negative samples on inductive KGC split by method type (GNN-based vs. rule-based vs. differentiable rule-based).

7 Experimental Results

We now empirically evaluate the effectiveness of the proposed model. We focus on inductive KG completion, as the need to capture reasoning patterns is intuitively more important for this setting compared to the traditional (i.e. transductive) setting. Our model has significant practical advantages compared to the state-of-the-art models. For instance, by only comparing the learned embeddings at query time, it is significantly more efficient than approaches that use GNNs for evaluating queries. Moreover, by using a monotonic GNN, our embeddings can straightforwardly be updated when new knowledge becomes available. As such, our main interest is to see whether our model can be competitive in terms of link prediction performance rather than expecting it to improve the state-of-the-art in this respect.

Datasets

We evaluate ReshufflE on the three standard benchmarks for inductive knowledge graph completion (KGC) that were derived by ? (?) from the datasets: FB15k-237, WN18RR, and NELL-995. Each of these inductive benchmarks contains four different dataset variants, named v1 to v4, and each of these variants consists of two graphs (the training and testing graph) that are sampled from the original dataset as follows. The training graph $\mathcal{G}_{\textit{Train}}$ was obtained by randomly sampling different numbers of entities and selecting their $k$ -hop neighbourhoods. Next, to construct a disjoint testing graph $\mathcal{G}_{\textit{Test}}$ , the entities of $\mathcal{G}_{\textit{Train}}$ were removed from the initial graph, and the same sampling procedure was repeated. Each of these graphs was split into a train set ( $80\%$ ), validation set ( $10\%$ ), and test set ( $10\%$ ). Thus, the three inductive benchmarks consist in total of twelve datasets: FB15k-237 v1-4, WN18RR v1-4, and NELL-995 v1-4. Furthermore, each of these datasets consists of six graphs: the train, validation, and test splits of $\mathcal{G}_{\textit{Train}}$ and $\mathcal{G}_{\textit{Test}}$ . Table 1 states the entity, relation, and triple counts of each graph. The supplementary materials provide additional information about these benchmarks, such as their origins and licenses.

Experimental Setup

Following ? (?), we train ReshufflE on the train split of $\mathcal{G}_{\textit{Train}}$ , tune our model’s hyper-parameters on the validation split of $\mathcal{G}_{\textit{Train}}$ , and finally evaluate the performance of the best model on the test split of $\mathcal{G}_{\textit{Test}}$ . As discussed by ? (?), some approaches in the literature have been evaluated in different ways, e.g. by tuning hyper-parameters on the validation split of $\mathcal{G}_{\textit{Test}}$ , and their reported results are thus not directly comparable. ReshufflE is trained on an NVIDIA Tesla V100 PCIe 32 GB GPU. We train ReshufflE for up to $1000$ epochs, minimizing the margin ranking loss (see Equation 8) with the Adam optimiser (?). If the Hits@10 score on the validation split of $\mathcal{G}_{\textit{Train}}$ does not increase by at least $1\%$ within $100$ epochs, we stop the training early. To account for small performance fluctuations, we repeat our experiments three times and report ReshufflE’s average performance.¹¹1Results for all seeds and the resulting standard deviations are provided in the supplementary materials. For the final evaluation, we select the hyper-parameter configuration with the highest Hits@10 score on the validation split of $\mathcal{G}_{\textit{Train}}$ . In accordance with ? (?), we evaluate ReshufflE’s test performance on $50$ negatively sampled entities per triple of the test split of $\mathcal{G}_{\textit{Test}}$ and report the Hits@10 scores. We list further details about the experimental setup in the supplementary materials. To facilitate ReshufflE’s reuse by our community, we will provide its source code in a public GitHub repository upon acceptance of our paper.

Baselines

As the analysis in Sections 5 and 6 reveals, our GNN model acts as a kind of differentiable rule base. We therefore compare ReshufflE to existing approaches for differentiable rule learning: Neural-LP (?) and DRUM (?). We also compare our method to two classical rule learning methods: RuleN (?) and AnyBURL (?). Finally, we include a comparison with GNN-based approaches: CoMPILE (?), GraIL (?), and NBFNet (?).

	FB15k-237				WN18RR				NELL-995
	v1	v2	v3	v4	v1	v2	v3	v4	v1	v2	v3	v4
ReshufflE²	0.304	0.569	0.385	0.916	0.293	0.309	0.155	0.270	0.488	0.558	0.334	0.370
ReshufflE_nL	0.744	0.890	0.903	0.917	0.698	0.685	0.618	0.682	0.627	0.738	0.886	0.815
ReshufflE	0.747	0.885	0.903	0.918	0.710	0.729	0.602	0.694	0.638	0.861	0.882	0.812

Table 3: Hits@10 for 50 negative samples on inductive KGC for each ablation of ReshufflE.

Inductive KGC Results

Table 2 reports the performance of ReshufflE on the inductive benchmarks. The results of ReshufflE were obtained by us; AnyBURL and NBFNet results are from ? (?); Neural-LP, DRUM, RuleN, and GraIL results are from ? (?); and CoMPILE results are from ? (?). Table 2 reveals that ReshufflE consistently outperforms the differentiable rule learners DRUM and Neural-LP, often by a significant margin (with WN18RR-v1 the only exception). Compared to the traditional rule learners, ReshufflE performs clearly better on FB15k-237 and NELL-995 (apart from v1) but underperforms on WN18RR. ? (?) found that the kind of rules which are needed for WN18RR are much noisier compared to those than those which are needed for FB15k-237 and NELL-995. Our use of ordering constraints may be less suitable in such cases. Finally, compared to the GNN-based methods, ReshufflE outperforms CoMPILE and GraIL on FB15k-237 and NELL-995 v1 and v4 while again (mostly) underperforming on WN18RR. ReshufflE furthermore consistently underperforms the state-of-the-art method NBFNet. Recall, however, that ReshufflE is significantly more efficient than such GNN-based approaches, as ReshufflE can score the plausibility of a given triple almost instantaneously.

Ablation Study

Finally, we empirically investigate ReshufflE’s components. We consider two variants for this study, namely: $(i)$ ReshufflE_nL, which does not add a self-loop relation to the KG (i.e. triples of the form $(e,\textit{eq},e)$ ); and $(ii)$ ReshufflE², which allows for more general $\mathbf{B_{r}}$ matrices. In particular, different from ReshufflE, which applies the softmax function on the rows of $\mathbf{B_{r}}$ (see Section 4), ReshufflE² squares the $\mathbf{B_{r}}$ matrices component-wise, thereby allowing them to contain arbitrary positive values. For a fair comparison, we train each of ReshufflE’s versions with the same hyper-parameter values, experimental setup, and evaluation protocol (see supplementary materials). Table 3 depicts the outcome of this study. It reveals that ReshufflE performs comparable to or better than ReshufflE_nL and dramatically outperforms ReshufflE² on all benchmarks. The similar performance of ReshufflE and ReshufflE_nL on most datasets suggests that the self-loop relation only matters in specific cases, which may not occur frequently in some datasets. The poor performance of ReshufflE² is as expected since allowing arbitrary positive parameters makes overfitting the training data more likely.

8 Conclusions

We have proposed a region-based knowledge graph embedding model that can faithfully capturing rule bases. Specifically, we have shown that embeddings can be constructed that exactly capture the deductive closure of a rule base, provided that the rules are left-regular, a condition which is inspired by left-regular grammars. Furthermore, we have shown that for arbitrary sets of closed path rules, we can learn embeddings which faithfully capture consequences that can be inferred using a bounded number of steps. In this way, our approach goes significantly beyond existing region-based embedding models. An important design choice is that our entity embeddings are constructed using a monotonic GNN, which essentially acts as a differentiable representation of a rule base. We introduced the notion of the rule graph to make this connection between the GNN model and rule bases explicit. The monotonic nature of the GNN also has practical advantages, in particular, the fact that entity embeddings can easily and efficiently be updated when new knowledge becomes available. However, this approach is perhaps less suitable for cases where we need to weigh different pieces of weak evidence (as illustrated by the disappointing results on WN18RR). In such cases, when further evidence becomes available, we may want to revise earlier assumptions, which is not possible with the proposed model. Develo** effective models that can provably simulate non-monotonic (or probabilistic) reasoning thus remains as an important challenge for future work.

Appendix A Constructing GNNs from Rule Graphs

Let $\mathcal{P}$ be a set of closed path rules and let $\mathcal{H}$ be a corresponding rule graph, satisfying the conditions (R1)–(R4). We also assume that a knowledge graph $\mathcal{G}$ is given. We show that the GNN, which is constructed based on $\mathcal{H}$ , correctly simulates the rules from $\mathcal{P}$ . For the proofs, it will be more convenient to characterise the GNN in terms of operations on the coordinates of entity embeddings. Specifically, let $Z_{i}=\{(i-1)k+1,...,(i-1)k+k\}$ and let $N_{r}\subseteq\{n_{1},...,n_{\ell}\}$ be the set of nodes from the rule graph $\mathcal{H}$ which have an incoming edge labelled with $r$ . We define:

\displaystyle I_{r}

\displaystyle=\bigcup_{n_{i}\in N_{r}}Z_{i}

Let $n_{i}\in N_{r}$ and let $(n_{j},n_{i})$ be the unique incoming edge with label $r$ . Then we define ( $t\in\{1,...,k\}$ ):

\displaystyle\sigma_{r}((i-1)k+t)

\displaystyle=(j-1)k+t

Now let us define:

\displaystyle\mu_{r}(e_{1},...,e_{d})

\displaystyle=(e_{1}^{\prime},...,e_{d}^{\prime})

where $e_{i}^{\prime}=e_{\sigma_{r}(i)}$ if $i\in I_{r}$ and $e_{i}^{\prime}=0$ otherwise. Let $\mathbf{e}^{(l)}$ be the entity embedding corresponding to the matrix $\mathbf{Z_{e}^{(l)}}$ . In other words, if we write $z_{ij}$ for the components of $\mathbf{Z_{e}^{(l)}}$ and $e_{i}$ for the components of $\mathbf{e}^{(l)}$ , then we have $z_{ij}=e_{(i-1)k+j}$ . For a matrix $\mathbf{X}=(x_{ij})$ , let us write $\textit{flatten}(\mathbf{X})$ for the vector that is obtained by concatenating the rows of $\mathbf{X}$ . In particular, $\textit{flatten}(\mathbf{Z_{e}^{(l)}})=\mathbf{e}^{(l)}$ . The following lemma reveals how the GNN constructed from the rule graph $\mathcal{H}$ can be characterised in terms of entity embeddings.

Lemma 1.

It holds that $\textit{flatten}(\mathbf{B_{r}}\mathbf{Z_{e}^{(l)}})=\mu_{r}(\mathbf{e}^{(l)})$ .

Proof.

Let us write $\textit{flatten}(\mathbf{B_{r}}\mathbf{Z_{e}^{(l)}})=(x_{1},...,x_{d})$ , $\mu_{r}(\mathbf{e}^{(l)})=(y_{1},...,y_{d})$ and $\mathbf{e}^{(l)}=(e_{1},...,e_{d})$ . Let $i\in\{1,...,\ell\}$ . Let us first assume that $n_{i}$ does not have any incoming edges in $\mathcal{H}$ which are labelled with $r$ . In that case, row $i$ of $\mathbf{B_{r}}$ consists only of 0s and we have $x_{(i-1)k+1}=...=x_{(i-1)k+k}=0$ . Similarly, we then also have $(i-1)k+j\notin I_{r}$ for $j\in\{1,...,k\}$ and thus $y_{(i-1)k+1}=...=y_{(i-1)k+k}=0$ . Now assume that there is an edge from $n_{j}$ to $n_{i}$ which is labelled with $r$ . Then we have that row $i$ of $\mathbf{B_{r}}$ is a one-hot vector with 1 at position $j$ . Accordingly, we have $x_{(i-1)k+t}=e_{(j-1)k+t}$ for $t\in\{1,...,k\}$ . Accordingly we then have $\sigma_{r}((i-1)k+t)=(j-1)k+t$ and thus $y_{(i-1)k+t}=e_{(j-1)k+t}$ . ∎

For a sequence of relations $r_{1},...,r_{p}$ , we define $\mu_{r_{1};...;r_{p}}$ as follows. We define $\mu_{r_{1};...;r_{p}}(x_{1},...,x_{d})=(y_{1},...,y_{d})$ , where ( $i\in\{1,...,\ell\}$ , $t\in\{1,...,k\}$ ):

\displaystyle y_{(i-1)k+t}=\begin{cases}x_{(j-1)k+t}&\text{if there is an $r_{% 1};...;r_{p}$ path }\\ &\text{from $n_{j}$ to $n_{i}$}\\ 0&\text{otherwise}\end{cases}

Note that if there is an $r_{1};...;r_{k}$ path arriving at node $n_{i}$ in the rule graph, it has to be unique, given that each node has at most one incoming edge of a given type. In the following, we will also use $I_{r_{1};...;r_{p}}$ , defined as follows:

	$\displaystyle I_{r_{1};...;r_{p}}$
	$\displaystyle\quad=\{(i-1)k+t\,\|\,\text{there is an $r_{1};...;r_{p}$ path % ending in $n_{i}$}\}$

We have the following result.

Lemma 2.

For $r_{1},...,r_{p}\in\mathcal{R}$ we have

\mu_{r_{1};...;r_{p}}(x_{1},...,x_{d})=\mu_{r_{p}}(...\mu_{r_{1}}(x_{1},...,x_% {d})...)

Proof.

It is sufficient to show

\mu_{r_{1};...;r_{p}}(x_{1},...,x_{d})=\mu_{r_{p}}(\mu_{r_{1};...;r_{p-1}}(x_{% 1},...,x_{d}))

We have $\mu_{r_{1};...;r_{p-1}}(x_{1},...,x_{d})=(y_{1},...,y_{d})$ , with

\displaystyle y_{(i-1)k+t}=\begin{cases}x_{(j-1)k+t}&\text{if there is an $r_{% 1};...;r_{p-1}$ path}\\ &\text{from $n_{j}$ to $n_{i}$}\\ 0&\text{otherwise}\end{cases}

We furthermore have $\mu_{r_{p}}(y_{1},...,y_{d})=(z_{1},...,z_{d})$ with

\displaystyle z_{(i-1)k+t}=\begin{cases}y_{(j-1)k+t}&\text{if there is an $r_{% p}$-edge}\\ &\text{from $n_{j}$ to $n_{i}$}\\ 0&\text{otherwise}\end{cases}

Taking into account the definition of $(y_{1},...,y_{d})$ , we have $y_{(j-1)k+t}\neq 0$ only if there is an $r_{1};...;r_{p-1}$ path from some node $n_{l}$ to the node $n_{j}$ , in which case we have $y_{(j-1)k+t}=x_{(l-1)k+t}$ . In other words, we have:

\displaystyle z_{(i-1)k+t}=\begin{cases}x_{(l-1)k+t}&\text{if there is an $r_{% 1};...;r_{p-1}$ path }\\ &\text{from $n_{l}$ to some $n_{j}$ and an }\\ &\text{$r_{p}$ edge from $n_{j}$ to $n_{i}$}\\ 0&\text{otherwise}\end{cases}

In other words, we have

\displaystyle z_{(i-1)k+t}=\begin{cases}x_{(l-1)k+t}&\text{if there is an $r_{% 1};...;r_{p}$ path}\\ &\text{from $n_{l}$ to $n_{i}$}\\ 0&\text{otherwise}\end{cases}

We thus have $(z_{1},...,z_{d})=\mu_{r_{1};...;r_{p}}(x_{1},...,x_{d})$ . ∎

We also have the following result.

Lemma 3.

Suppose $\mathcal{P}\models r_{1}(X_{1},X_{2})\wedge r_{2}(X_{2},X_{3})\wedge...\wedge r% _{p}(X_{p},X_{p+1})\rightarrow r(X_{1},X_{p+1})$ . There exists paths of type $r^{1}_{1};...;r^{1}_{q_{1}}$ and $r^{2}_{1};...;r^{2}_{q_{2}}$ and … and $r^{l}_{1};...;r^{l}_{q_{l}}$ , all of whose eq-reduced type is $r_{1};...;r_{p}$ , such that for every embedding $(x_{1},...,x_{d})$ we have:

\mu_{r}(x_{1},...,x_{d})\preccurlyeq\max_{i=1}^{l}\mu_{r^{i}_{1};...;r^{i}_{q_% {i}}}(x_{1},...,x_{d})

Proof.

This follows immediately from the fact that whenever there is an $r$ -edge between two nodes $n$ and $n^{\prime}$ , there must also be a path between these nodes whose eq-reduced type is $r_{1};...;r_{p}$ , because of condition (R3). ∎

The following result shows that the GNN will correctly predict all triples that can be inferred from $\mathcal{G}\cup\mathcal{P}$ .

Proposition 7.

Proof.

Because of Lemma 1, it is sufficient to show that $\mu_{r}(\mathbf{a^{(m)}})\preceq\mathbf{b^{(m)}}$ . If $\mathcal{G}$ contains the triple $(a,r,b)$ then the result is trivially satisfied. Otherwise, $\mathcal{P}\cup\mathcal{G}\models r(a,b)$ implies that $\mathcal{P}\models r_{1}(X_{1},X_{2})\wedge r_{2}(X_{2},X_{3})\wedge...\wedge r% _{p}(X_{p},X_{p+1})\rightarrow r(X_{1},X_{p+1})$ , for some $r_{1},...,r_{p},r\in\mathcal{R}$ such that $\mathcal{G}$ contains triples $(a,r_{1},a_{2}),(a_{2},r_{2},a_{3}),...,(a_{p},r_{p},b)$ , for some $a_{2},...,a_{p}\in\mathcal{E}$ . Because $(a,r_{1},a_{2})\in\mathcal{G}$ , by construction, it holds for each $i\in\mathbb{N}$ that:

\mu_{r_{1}}(\mathbf{a^{(i)}})\preccurlyeq\mathbf{a_{2}^{(i+1)}}

Similarly, because $(a_{2},r_{2},a_{3})\in\mathcal{G}$ , we have $\mu_{r_{2}}(\mathbf{a_{2}^{(i+1)}})\preccurlyeq\mathbf{a_{3}^{(i+2)}}$ and thus

\mu_{r_{2}}(\mu_{r_{1}}(\mathbf{a^{(i)}}))\preccurlyeq\mu_{r_{2}}(\mathbf{a_{2% }^{(i+1)}})\preccurlyeq\mathbf{a_{3}^{(i+2)}}

In other words, we have

\mu_{r_{1};r_{2}}(\mathbf{a^{(i)}})\preccurlyeq\mathbf{a_{3}^{(i+2)}}

Continuing in the same way, we find that

\mu_{r_{1};...;r_{p-1};r_{p}}(\mathbf{a^{(i)}})\preccurlyeq\mathbf{b^{(i+p)}}

Now consider a path of type $r^{\prime}_{1};...;r_{q}^{\prime}$ whose eq-reduced type is $r_{1};...;r_{p}$ . Then we have that $\mathcal{G}$ contains triples of the form $(a,r^{\prime}_{1},b_{2}),(b_{2},r_{2},b_{3}),...,(b_{p},r^{\prime}_{q},b)$ . Indeed, the only triples that need to be considered in addition to the triples $(a,r_{1},a_{2}),(a_{2},r_{2},a_{3}),...,(a_{p},r_{p},b)$ are of the form $(a_{i},\textit{eq},a_{i})$ , which we have assumed to belong to $\mathcal{G}$ for every $a_{i}\in\mathcal{E}$ . For every path of type $r^{\prime}_{1};...;r_{q}^{\prime}$ whose eq-reduced type is $r_{1};...;r_{p}$ , we thus find entirely similarly to before that

\mu_{r^{\prime}_{1};...;r^{\prime}_{q}}(\mathbf{a^{(i)}})\preccurlyeq\mathbf{b% ^{(i+p)}}

Because of Lemma 3, this implies

\mu_{r}(\mathbf{a^{(i)}})\preccurlyeq\mathbf{b^{(i+p)}}

In particular, we have

\mu_{r}(\mathbf{a^{(m)}})\preccurlyeq\mathbf{b^{(m+p)}}

and because of the assumption that the GNN has converged after $m$ steps, we also have $\mu_{r}(\mathbf{a^{(m)}})\preccurlyeq\mathbf{b^{(m)}}$ . ∎

For $e\in\mathcal{E}$ , let $\textit{paths}_{\mathcal{G}}(e)$ be the set of all paths in the knowledge graph $\mathcal{G}$ which end in $e$ . For a path $\pi$ in $\textit{paths}_{\mathcal{G}}(e)$ , we write $\mathit{head}(\pi)$ for the entity where the path starts and $\mathit{rels}(\pi)$ for the corresponding sequence of relations. For an entity $e$ , we write $\textit{emb}_{m}(e)$ for its embedding in layer $m$ , i.e. $\textit{emb}_{m}(e)=\mathbf{e^{(m)}}$ . The following observation follows immediately from the construction of the GNN, together with Lemma 2.

Lemma 4.

For any entity $e\in\mathcal{E}$ it holds that

\displaystyle\mathbf{e^{(m)}}\preceq\max\Big{(}\mathbf{e^{(0)}},\max_{\pi\in% \textit{paths}_{\mathcal{G}}(e)}\mu_{\mathit{rels}(\pi)}\big{(}\textit{emb}_{0% }(\textit{head}(\pi))\big{)}\Big{)}

We will also need the following technical lemma.

Lemma 5.

Suppose $\mathcal{P}\cup\mathcal{G}\not\models(a,r,b)$ . Then there is some $i\in\{1,...,\ell\}$ such that:

•

$Z_{i}\subseteq I_{r}$ ; and
•

whenever $\pi\in\textit{paths}_{\mathcal{G}}(b)$ with $\textit{head}(\pi)=a$ , it holds that $I_{\textit{rels}(\pi)}\cap Z_{i}=\emptyset$ .

Proof.

Let us write $\mathcal{Z}_{r}=\{i\in\{1,...,\ell\}\,|\,Z_{i}\subseteq I^{1}_{r}\}$ . Note that $i\in\mathcal{Z}_{r}$ iff node $n_{i}$ in $\mathcal{H}$ has an incoming $r$ -edge. It thus follows from condition (R1) that $\mathcal{Z}_{r}\neq\emptyset$ . Suppose that for every $i\in\mathcal{Z}_{r}$ , there was some $\pi\in\textit{paths}_{\mathcal{G}}(b)$ with $\textit{head}(\pi)=a$ such that $I_{\textit{rels}(\pi)}\cap Z_{i}\neq\emptyset$ . Let us write $X=\{\textit{rels}(\pi)\,|\,\pi\in\textit{paths}_{\mathcal{G}}(b),\textit{head}% (\pi)=a,I_{\textit{rels}(\pi)}\cap Z_{i}\neq\emptyset\}$ . We then have that for every $r$ -edge in $\mathcal{H}$ , there is a path $\tau$ connecting the same nodes, with $\textit{rels}(\tau)\in X$ . From Condition (R4), it then follows that $\mathcal{P}\cup\mathcal{G}\models(a,r,b)$ , a contradiction. ∎

The following result shows that the GNN is unlikely to predict triples that cannot be inferred from $\mathcal{G}\cup\mathcal{P}$ , as long as the embeddings are sufficiently high-dimensional.

Proposition 8.

\displaystyle\textit{Pr}[\mathbf{B_{r}}\mathbf{Z_{a}^{(m)}}\preceq\mathbf{Z_{b% }^{(m)}}]\leq\varepsilon

Proof.

First, note that because of Lemma 1, what we need to show is equivalent to:

\displaystyle\textit{Pr}[\mu_{r}(\mathbf{a^{(m)}})\preceq\mathbf{b^{(m)}}]\leq\varepsilon

Let $(a,b)\in\mathcal{E}\times\mathcal{E}$ be such that $\mathcal{P}\cup\mathcal{G}\not\models(a,r,b)$ . From Lemma 5, we know that there is some $i\in\{1,...,\ell\}$ such that $Z_{i}\subseteq I^{1}_{r}$ and whenever $\pi\in\textit{paths}_{\mathcal{G}}(b)$ with $\textit{head}(\pi)=a$ , it holds that $I_{\textit{rels}(\pi)}\cap Z_{i}=\emptyset$ . The following condition is clearly a necessary requirement for $\mu_{r}(\mathbf{a^{(m)}})\preceq\mathbf{b^{(m)}}$ :

\forall j\in Z_{i}\,.\,\mu_{r}(\mathbf{a^{(m)}})\preccurlyeq_{j}\mathbf{b^{(m)}}

where we write $(x_{1},...,x_{d})\preccurlyeq_{j}(y_{1},...,y_{d})$ for $x_{j}\leq y_{j}$ . We need in particular also that:

\forall j\in Z_{i}\,.\,\mu_{r}(\mathbf{a^{(0)}})\preccurlyeq_{j}\mathbf{b^{(m)}}

Due to Lemma 4 this is equivalent to requiring that for every $j\in Z_{i}$ we have:

\displaystyle\mu_{r}(\mathbf{a^{(0)}}){\preccurlyeq_{j}}\max\big{(}\mathbf{b^{% (0)}},\max_{\pi\in\textit{paths}_{\mathcal{G}}(b)}\mu_{\mathit{rels}(\pi)}\big% {(}\textit{emb}_{0}(\textit{head}(\pi))\big{)}\big{)}

We can view the coordinates of the input embeddings as random variables. The latter condition is thus equivalent to a condition of the following form:

\forall j\in Z_{i}\,.\,A^{r}_{j}\leq\max(B_{j},X^{1}_{j},...,X^{p}_{j})

where $A^{r}_{j}$ is the random variable corresponding to the $j$ ^th coordinate of $\mu_{r}(\mathbf{a^{(0)}})$ , $B_{j}$ is the $j$ ^th coordinate of $\mathbf{b^{(0)}}$ and $X^{1}_{j},...,X^{p}_{j}$ are the random variables corresponding to the $j$ ^th coordinate of the vectors $\mu_{\mathit{rels}(\pi)}\big{(}\textit{emb}_{0}(\textit{head}(\pi)))$ . By construction, we have that the coordinates of different entity embeddings are sampled independently and that there are at least two distinct values that have a non-negative probability of being sampled for each coordinate. This means that there exists some value $\lambda>0$ such that $Pr[A^{r}_{j}>B_{j}]\geq\lambda$ and $Pr[A^{r}_{j}>X_{j}^{t}]\geq\lambda$ for each $t\in\{1,...,p\}$ . Moreover, since we have that whenever $\pi\in\textit{paths}_{\mathcal{G}}(b)$ with $\textit{head}(\pi)=a$ it holds that $I_{\textit{rels}(\pi)}\cap Z_{i}=\emptyset$ , it follows that the random variable $A^{r}_{j}$ is not among $B_{j},X^{1}_{j},...,X^{p}_{j}$ . We thus have:

	$\displaystyle\textit{Pr}[\forall j\in Z_{i}\,.\,A^{r}_{j}\leq\max(B_{j},X^{1}_% {j},...,X^{p}_{j})]$
	$\displaystyle\quad\leq\left(1-\lambda^{p+1}\right)^{\|Z_{i}\|}$
	$\displaystyle\quad=\left(1-\lambda^{p+1}\right)^{k}$
	$\displaystyle\quad\leq e^{-k\lambda^{p+1}}$

The value of $p$ is upper bounded by $\ell\cdot|\mathcal{E}|$ , with $\ell$ the number of nodes in the rule graph. By choosing $k$ sufficiently large, we can thus make this probability arbitrarily small. In particular:

\displaystyle e^{-k\lambda^{p+1}}\leq\varepsilon\quad\Leftrightarrow\quad k% \geq\frac{1}{\lambda^{p+1}}\log\frac{1}{\varepsilon}

∎

Appendix B Constructing Rule Graphs

We write $\mathcal{R}_{1}$ for the set of relations that appear in the head of some rule from the considered rule base, and $\mathcal{R}_{2}=\mathcal{R}\setminus\mathcal{R}_{1}$ for the remaining relations.

Proposition 9.

Proof.

Let $\alpha(r_{i})=r_{i}$ if $r_{i}\in\mathcal{R}_{2}$ and $\alpha(r_{i})=\overline{r}_{i}$ otherwise. We clearly have that $\alpha(r_{1})...\alpha(r_{k})\in L_{r}$ iff $\mathcal{P}$ entails the following rule:

r_{1}(X_{1},X_{2})\wedge...\wedge r_{k}(X_{k},X_{k+1})\rightarrow r(X_{1},X_{k% +1})

Since we have assumed that $\mathcal{P}$ has a rule graph, thanks to conditions (R3) and (R4), we can check whether this rule is valid by checking whether for each edge labelled with $r$ there is a path connecting the same nodes whose eq-reduced type is $r_{1};...;r_{k}$ . Let $(n_{i},n_{j})$ be a an edge labelled with $r$ . Then, we can construct a finite state machine (FSM) from $\mathcal{H}$ by treating $n_{i}$ as the start node and $n_{j}$ as the unique final node and interpreting eq edges as $\varepsilon$ -transitions (i.e. corresponding to the empty string). Clearly, this FSM will accept the string $r_{1}...r_{k}$ if there is a path labelled with $r_{1};...;r_{k}$ connecting $n_{i}$ to $n_{j}$ . For each edge labelled with $r$ , we can construct such an FSM. Let $F_{1},...,F_{m}$ be the languages associated with these FSMs. By construction, $L_{r}$ is the intersection of $F_{1},...,F_{m}$ . Since $F_{1},...,F_{m}$ are regular, it follows that $L_{r}$ is regular as well. ∎

B.1 Left Regular Rule Bases

We show that the graph resulting from the construction process satisfies the conditions (R1)-(R4). The fact that (R1) is satisfied follows from the following lemma.

Lemma 6.

Let $\mathcal{P}$ be a left-regular set of closed path rules and let $\mathcal{H}$ be the graph obtained using the proposed construction method. For every $r\in\mathcal{R}$ , it holds that $\mathcal{H}$ contains an outgoing $r$ -edge from $n_{0}$ .

Proof.

Let $r\in\mathcal{R}$ . The edge from $n_{0}$ to $n_{r}$ is added in step 2 of the construction process. This edge may be removed in step 4, but in that case, a new $r$ -edge is added from $n_{0}$ to a fresh node. ∎

The fact that (R2) is satisfied follows immediately from the construction in step 4. We now move to condition (R3).

Lemma 7.

Let $\mathcal{P}$ be a left-regular set of closed path rules and let $\mathcal{H}$ be the graph obtained using the proposed construction method. If $\mathcal{P}$ contains the rule $r_{1}(X_{1},X_{2})\wedge r_{2}(X_{2},X_{3})\rightarrow r_{3}(X_{1},X_{3})$ , then whenever two nodes $n$ and $n^{\prime}$ are connected in $\mathcal{H}$ by a path whose eq-reduced type is $r_{3}$ , there is some node $n^{\prime\prime}$ such that $n$ and $n^{\prime\prime}$ are connected by a path whose eq-reduced type is $r_{1}$ and $n^{\prime\prime}$ and $n^{\prime}$ are connected by a path whose eq-reduced type is $r_{2}$ .

Proof.

The stated assertion clearly holds after step 3 of the construction method. Indeed, the only $r_{3}$ -edge in $\mathcal{H}$ is from $n_{0}$ to $n_{r_{3}}$ . Note in particular that no $r_{3}$ edges can be added in step 3, given our assumption that $\mathcal{P}$ is left-regular. Finally, it is also easy to see that this property remains satisfied after step 4. ∎

The next lemma shows that (R3) is satisfied.

Lemma 8.

Let $\mathcal{P}$ be a left-regular set of closed path rules and let $\mathcal{H}$ be the graph obtained using the proposed construction method. Suppose nodes $n$ and $n^{\prime}$ are connected with an edge of type $r$ and suppose $\mathcal{P}\models r_{1}(X_{1},X_{2})\wedge r_{2}(X_{2},X_{3})\wedge...\wedge r% _{p}(X_{p},X_{p+1})\rightarrow r(X_{1},X_{p+1})$ . Then there is a path whose eq-reduced type is $r_{1};...;r_{p}$ from $n$ to $n^{\prime}$ .

Proof.

Assume $\mathcal{P}\models r_{1}(X_{1},X_{2})\wedge r_{2}(X_{2},X_{3})\wedge...\wedge r% _{p}(X_{p},X_{p+1})\rightarrow r(X_{1},X_{p+1})$ . Let $n$ and $n^{\prime}$ be nodes connected by an edge of type $r$ . We show the result by structural induction. First, suppose $p=2$ . In this case, the considered rule is of the form $r_{1}(X_{1},X_{2})\wedge r_{2}(X_{2},X_{3})\rightarrow r(X_{1},X_{3})$ . It then follows from Lemma 7 that there is a path whose eq-reduced type is $r_{1};r_{2}$ connecting $n$ and $n^{\prime}$ . Let us now consider the inductive case. If $p>3$ then $r_{1}(X_{1},X_{2})\wedge r_{2}(X_{2},X_{3})\wedge...\wedge r_{p}(X_{p},X_{p+1}% )\rightarrow r(X_{1},X_{p+1})$ is derived from at least two rules in $\mathcal{P}$ (given that the rules in $\mathcal{P}$ were restricted to have only two atoms in the body). The last step of the derivation of this rule is done by secting some rule $s_{1}(X,Y)\wedge s_{2}(Y,Z)\rightarrow r(X,Z)$ from $\mathcal{P}$ such that

	$\displaystyle\mathcal{P}\models r_{1}(X_{1},X_{2})\wedge...\wedge r_{i-1}(X_{i% -1},X_{i})$	$\displaystyle\rightarrow s_{1}(X_{1},X_{i})$
	$\displaystyle\mathcal{P}\models r_{i}(X_{i},X_{i+1})\wedge...\wedge r_{p}(X_{p% },X_{p+1})$	$\displaystyle\rightarrow s_{2}(X_{i},X_{p+1})$

If there is a path from $n$ to $n^{\prime}$ whose eq-reduced type is $r$ , we know from Lemma 7 that there must be a path from $n$ to $n^{\prime\prime}$ with eq-reduced type $s_{1}$ -edge and a path from $n^{\prime\prime}$ to $n^{\prime}$ with eq-reduced type $s_{2}$ , for some node $n^{\prime\prime}$ in $\mathcal{H}$ . By induction, we furthermore know that there must then be a path with eq-reduced type $r_{1};...;r_{i-1}$ from $n$ to $n^{\prime\prime}$ and a path with eq-reduced type $r_{i};...;r_{p}$ from $n^{\prime\prime}$ to $n^{\prime}$ . Thus, we find that there must be a path with eq-reduced type $r_{1};...;r_{p}$ from $n$ to $n^{\prime}$ . ∎

The fact that (R4) is satisfied follows from the next lemma.

Lemma 9.

Let $\mathcal{P}$ be a left-regular set of closed path rules and let $\mathcal{H}$ be the graph obtained using the proposed construction method. Suppose there is a path in $\mathcal{H}$ from $n_{0}$ to $n_{r}$ whose eq-reduced type is $r_{1};...;r_{p}$ . Then it holds that $\mathcal{P}\models r_{1}(X_{1},X_{2})\wedge...\wedge r_{p}(X_{p},X_{p_{1}})% \rightarrow r(X_{1},X_{p+1})$ .

Proof.

The result clearly holds after step 2. We show that the result remains valid after each iteration of step 3. Suppose in step 3 we add an $r_{2}$ -edge between $n_{r_{1}}$ and $n_{r_{3}}$ . This means that:

\displaystyle\mathcal{P}\models r_{1}(X,Y)\wedge r_{2}(Y,Z)\rightarrow r_{3}(X% ,X)

Let $\tau$ be a path from $n_{0}$ to $n_{r}$ . If $\tau$ does not contain the new $r_{2}$ -edge, then the fact that the result is valid for $\tau$ follows by induction. Now, suppose that $\tau$ contains the new $r_{2}$ edge. Then $\tau$ is of the form $r_{i_{1}};...;r_{i_{s}};r_{2};r_{j_{1}};...;r_{j_{t}}$ . By induction we have:

\displaystyle\mathcal{P}

\displaystyle\models r_{i_{1}}(X_{1},X_{2})\wedge...\wedge r_{i_{s}}(X_{s},X_{% s+1})\rightarrow r_{1}(X_{1},X_{s+1})

Clearly there is a path from $n_{0}$ to $n_{r_{3}}$ with eq-reduced type $r_{3}$ . In particular, there is a path from $n_{0}$ to $n_{r_{3}}$ with eq-reduced type $r_{3};r_{j_{1}};...r_{j_{t}}$ . By induction, we thus have:

	$\displaystyle\mathcal{P}$	$\displaystyle\models r_{3}(X_{0},X_{1})\wedge r_{j_{1}}(X_{1},X_{2})\wedge...$
		$\displaystyle\quad\quad\quad\wedge r_{j_{t}}(X_{t},X_{t_{1}})\rightarrow r(X_{% 0},X_{t+1})$

Together we find that the stated result is satisfied.

Finally, we need to show that the result remains satisfied after step 4. This is clearly the case, as this step replaces edges of type $r$ with paths of type $r;\textit{eq};...;\textit{eq}$ . The eq-reduced types of the paths from $n_{0}$ to $n_{r}$ thus remain unchanged after this step. ∎

Proposition 10.

Let $\mathcal{P}$ be a left-regular set of closed path rules and let $\mathcal{H}$ be the graph obtained using the proposed construction method. It holds that $\mathcal{H}$ satisfies (R1)–(R4).

Proof.

The fact that (R1), (R3) and (R4) are satisfied follows immediately from Lemmas 6, 8 and 9. The fact that (R2) is satisfied follows trivially from the construction. ∎

B.2 Bounded Inference

Let $\textit{paths}^{m}_{\mathcal{G}}(b)$ be the set of all paths in $\mathcal{G}$ of length at most $m$ which are ending in $b$ .

Lemma 10.

For any entity $e\in\mathcal{E}$ it holds that

\displaystyle\mathbf{e^{(m)}}\preceq\max\Big{(}\mathbf{e^{(0)}},\max_{\pi\in% \textit{paths}^{m}_{\mathcal{G}}(e)}\mu_{\mathit{rels}(\pi)}\big{(}\textit{emb% }_{0}(\textit{head}(\pi))\big{)}\Big{)}

Proof.

This follows immediately from the construction of the GNN. ∎

Lemma 11.

Let $\ell$ be the number of nodes in the given $m$ -bounded rule graph. Suppose $\mathcal{P}\cup\mathcal{G}\not\models_{m}(a,r,b)$ . Then there is some $i\in\{1,...,\ell\}$ such that:

•

$Z_{i}\subseteq I_{r}$ ; and
•

whenever $\pi\in\textit{paths}^{m+1}_{\mathcal{G}}(b)$ with $\textit{head}(\pi)=a$ , it holds that $I_{\textit{rels}(\pi)}\cap Z_{i}=\emptyset$ .

Proof.

This lemma is shown in exactly the same way as Lemma 5, simply replacing $\textit{paths}_{\mathcal{G}}(b)$ by $\textit{paths}^{m+1}_{\mathcal{G}}(b)$ and replacing Condition (R4) by Condition (R4m). ∎

Proposition 11.

\displaystyle\textit{Pr}[\mathbf{B_{r}}\mathbf{Z_{a}^{(i)}}\preceq\mathbf{Z_{b% }^{(i)}}]\leq\varepsilon

Proof.

This result is shown in the same way as Proposition 2, by relying on Lemma 11 instead of Lemma 5. ∎

Let us now show the correctness of the proposed process for constructing $m$ -bounded rule graphs. Conditions (R1) and (R2) are clearly satisfied. Next, we show that condition (R3) is satisfied.

Lemma 12.

Let $\mathcal{P}$ be a set of closed path rules and let $\mathcal{H}$ be the resulting $m$ -bounded rule graph, constructed using the proposed process. Suppose nodes $n$ and $n^{\prime}$ are connected with an edge of type $r$ and suppose $\mathcal{P}\models r_{i_{1}}(X_{1},X_{2})\wedge r_{i_{2}}(X_{2},X_{3})\wedge..% .\wedge r_{i_{p}}(X_{p},X_{p+1})\rightarrow r(X_{1},X_{p+1})$ . Then there is a path connecting $n$ to $n^{\prime}$ , whose eq-reduced type is $r_{i_{1}};...;r_{i_{p}}$ .

Proof.

First, we show that at the end of step 4, there must be a path of type $r_{i_{1}};...;r_{i_{p}}$ connecting $n$ and $n^{\prime}$ . By construction, we immediately have that whenever two nodes $(n,n^{\prime})$ are connected with an $r_{i}$ -edge and $\mathcal{P}$ contains the rule $r_{j}(X,Y)\wedge r_{l}(Y,Z)\rightarrow r_{i}(X,Z)$ it holds that there exists some node $n^{\prime\prime}$ such that there is an $r_{j}$ -edge from $n$ to $n^{\prime\prime}$ and an $r_{l}$ edge from $n^{\prime\prime}$ to $n^{\prime}$ . The existence of a path of type $r_{i_{1}};...;r_{i_{p}}$ then follows in the same way as in the proof of Lemma 8. It remains to be shown that the proposition remains valid after step 5. However, the paths in the final graph are those that can be found in the graph after step 4, with the possible addition of some eq-edges. This means in particular that after step 5, there must still be a path from $n$ to $n^{\prime}$ whose eq-reduced type is $r_{i_{1}};...;r_{i_{p}}$ . ∎

Finally, the fact that (R4m) is satisfied follows from the following lemma.

Lemma 13.

Let $\mathcal{P}$ be a set of closed path rules, and let $\mathcal{H}$ be the resulting $m$ -bounded rule graph, constructed using the process outlined above. Suppose there is a path from $n_{0}$ to $n_{r}$ whose eq-reduced type if $r_{1};...;r_{p}$ , with $p\leq m+1$ . Then it holds that $\mathcal{P}\models r_{1}(X_{1},X_{2})\wedge...\wedge r_{p}(X_{p},X_{p_{1}})% \rightarrow r(X_{1},X_{p+1})$ .

Proof.

We clearly have that the proposition holds after step 3 of the construction method. After step 3, if there is an $r$ -link between nodes $n$ and $n^{\prime}$ and a rule $r_{1}(X,Y)\wedge r_{2}(Y,Z)\rightarrow r(X,Z)$ such that $n$ and $n^{\prime}$ are not connected by an $r_{1};r_{2}$ path, it must be the case that any path from $n_{0}$ to some node $n_{r}$ which contains the edge $(n,n^{\prime})$ must have a length of at least $m+1$ . It follows that any path from $n_{0}$ to some node $n_{r}$ which contains an edge that was added during step 4 must have length at least $m+2$ . We thus have in particular that the proposition still holds after step 4. The paths in the final graph are those that can be found in the graph after step 4, with the possible addition of some eq-edges. Since the proposition only depends on the eq-reduced types of the paths, the result still holds after step 5. ∎

Together, we have shown the following result.

Proposition 12.

Appendix C Experimental Details

This section lists additional details about our experiment’s setup, benchmark datasets, and evaluation protocol. Section C.1 lists ReshufflE’s implementation details. The origins and licenses of the standard benchmarks for inductive KGC are discussed in Section C.2. Details on ReshufflE’s hyper-parameter optimisation are discussed in Section C.3. Finally, details about the evaluation protocol, together with the complete evaluation results, are provided in Section C.4.

C.1 Implementation Details

ReshufflE was implemented using the Python library PyKEEN 1.10.1 (?). PyKEEN employs the MIT license and offers numerous benchmarks for KGC, facilitating the comfortable reuse of ReshufflE’s code for upcoming applications and comparisons. Upon acceptance of our paper, we will provide ReshufflE’s source code in a public GitHub repository to further facilitate the reuse of ReshufflE by our community.

C.2 Benchmarks: Origins and Licenses

We did not find a license for any of the three inductive benchmarks nor their corresponding transductive supersets. Furthermore, WN18RR is a subset of the WordNet database (?), which states lexical relations of English words. We also did not find a license for this dataset. FB15k-237 is a subset of FB15k (?), which is a subset of Freebase (?), a collaborative database that contains general knowledge, such as about celebrities and awards, in English. We did not find a license for FB15k-237 but found that FB15k (?) uses the CC BY 2.5 license. Finally, NELL-995 (?) is a subset of NELL (?), a dataset that was extracted from semi-structured and natural-language data on the web and that includes information about e.g., cities, companies, and sports teams. Also for NELL, we did not find any license information.

C.3 Hyper-Parameter Optimisation

		#Layers	$l$	$k$	$\lambda$	lr
FB15k-237	v1	4	25	80	2.0	0.005
	v2	3	30	60	1.0	0.005
	v3	5	25	40	0.5	0.005
	v4	3	30	80	1.0	0.01
WN18RR	v1	3	20	40	1.0	0.01
	v2	3	20	60	0.5	0.01
	v3	3	20	40	1.0	0.01
	v4	3	30	80	1.0	0.01
NELL-995	v1	3	20	80	2.0	0.005
	v2	4	30	60	2.0	0.01
	v3	4	25	40	0.5	0.01
	v4	4	30	60	1.0	0.01

Table 4: ReshufflE’s best-performing hyper-parameters on FB15k-237 v1-4, WN18RR v1-4, and NELL-995 v1-4.

	FB15k-237				WN18RR				NELL-995
	v1	v2	v3	v4	v1	v2	v3	v4	v1	v2	v3	v4
Seed 1	0.751	0.879	0.905	0.918	0.713	0.727	0.614	0.693	0.630	0.874	0.871	0.816
Seed 2	0.744	0.892	0.908	0.916	0.707	0.726	0.574	0.690	0.650	0.860	0.893	0.808
Seed 3	0.746	0.883	0.897	0.918	0.710	0.736	0.617	0.698	0.635	0.848	0.881	0.812
mean	0.747	0.885	0.903	0.918	0.710	0.729	0.602	0.694	0.638	0.861	0.882	0.812
stdv	0.004	0.007	0.005	0.001	0.003	0.006	0.024	0.004	0.010	0.013	0.011	0.004

Table 5: ReshufflE’s benchmark Hits@10 scores on all seeds together with the mean (mean) and standard deviation (stdv) of Hits@10.

Following ? (?), we manually tune ReshufflE’s hyper-parameters on the validation split of $\mathcal{G}_{\textit{Train}}$ . We use the following ranges for the hyperparameters: the number of ReshufflE’s layers $\textit{\#Layers}\in\{3,4,5\}$ , the embedding dimensionality parameters $l\in\{20,25,30\}$ and $k\in\{40,60,80\}$ , the loss margin $\lambda\in\{0.5,1.0,2.0\}$ , and finally the learning rate $\textit{lr}\in\{0.005,0.01\}$ . We use the same batch and negative sampling size for all runs. In particular, we set the batch size to $1024$ and the negative sampling size to $100$ . We report the best hyper-parameters for ReshufflE split by each inductive benchmark in Table 4. Finally, we reuse the same hyper-parameters for each of ReshufflE’s ablations, namely, ReshufflE_nL and ReshufflE².

C.4 Evaluation Protocol and Complete Results

Following the standard evaluation protocol for inductive KGC, introduced by ? (?), we evaluate ReshufflE’s final performance on the test split of the testing graph by measuring the ranking quality of any test triple $r(e,f)$ over $50$ randomly sampled entities $e^{\prime}_{i}\in\mathcal{E}$ and $f^{\prime}_{i}\in\mathcal{E}$ : $r(e^{\prime}_{i},f)$ and $r(e,f^{\prime}_{i})$ for all $1\leq i\leq 50$ . Following ? (?), we report the Hits@10 metric, i.e., the proportion of true triples (those within the test split of the testing graph) among the predicted triples whose rank is maximally $10$ .

Table 5 states ReshufflE’s benchmark results over all inductive datasets, as well as their means and standard deviations.

References

2020 Abboud, R.; Ceylan, İ. İ.; Lukasiewicz, T.; and Salvatori, T. 2020. BoxE: A box embedding model for knowledge base completion. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
2021 Ali, M.; Berrendorf, M.; Hoyt, C. T.; Vermue, L.; Sharifzadeh, S.; Tresp, V.; and Lehmann, J. 2021. PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings. Journal of Machine Learning Research 22(82):1–6.
2023 Anil, A.; Gutiérrez-Basulto, V.; Ibáñez-García, Y.; and Schockaert, S. 2023. Inductive knowledge graph completion with gnns and rules: An analysis. CoRR abs/2308.07942.
2019 Balazevic, I.; Allen, C.; and Hospedales, T. M. 2019. TuckER: Tensor factorization for knowledge graph completion. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 5184–5193. Association for Computational Linguistics.
2013 Bordes, A.; Usunier, N.; García-Durán, A.; Weston, J.; and Yakhnenko, O. 2013. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, 2787–2795.
2010 Carlson, A.; Betteridge, J.; Kisiel, B.; Settles, B.; Jr., E. R. H.; and Mitchell, T. M. 2010. Toward an architecture for never-ending language learning. In Fox, M., and Poole, D., eds., Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, 1306–1313. AAAI Press.
2024 Charpenay, V., and Schockaert, S. 2024. Capturing knowledge graphs and rules with octagon embeddings. CoRR abs/2401.16270.
2022 Chen, Y.; Mishra, P.; Franceschi, L.; Minervini, P.; Stenetorp, P.; and Riedel, S. 2022. Refactor gnns: Revisiting factorisation-based models from a message-passing perspective. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
2013 Galárraga, L. A.; Teflioudi, C.; Hose, K.; and Suchanek, F. M. 2013. AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In Schwabe, D.; Almeida, V. A. F.; Glaser, H.; Baeza-Yates, R.; and Moon, S. B., eds., 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013, 413–422. International World Wide Web Conferences Steering Committee / ACM.
2018 Gutiérrez-Basulto, V., and Schockaert, S. 2018. From knowledge graph embedding to ontology embedding? an analysis of the compatibility between vector space representations and rules. In Thielscher, M.; Toni, F.; and Wolter, F., eds., Principles of Knowledge Representation and Reasoning: Proceedings of the Sixteenth International Conference, KR 2018, Tempe, Arizona, 30 October - 2 November 2018, 379–388. AAAI Press.
2015 Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
2022 Leemhuis, M.; Özçep, Ö. L.; and Wolter, D. 2022. Learning with cone-based geometric models and orthologics. Ann. Math. Artif. Intell. 90(11-12):1159–1195.
2021 Mai, S.; Zheng, S.; Yang, Y.; and Hu, H. 2021. Communicative message passing for inductive relation reasoning. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, 4294–4302. AAAI Press.
2018 Meilicke, C.; Fink, M.; Wang, Y.; Ruffinelli, D.; Gemulla, R.; and Stuckenschmidt, H. 2018. Fine-grained evaluation of rule- and embedding-based systems for knowledge graph completion. In Vrandecic, D.; Bontcheva, K.; Suárez-Figueroa, M. C.; Presutti, V.; Celino, I.; Sabou, M.; Kaffee, L.; and Simperl, E., eds., The Semantic Web - ISWC 2018 - 17th International Semantic Web Conference, Monterey, CA, USA, October 8-12, 2018, Proceedings, Part I, volume 11136 of Lecture Notes in Computer Science, 3–20. Springer.
2019 Meilicke, C.; Chekol, M. W.; Ruffinelli, D.; and Stuckenschmidt, H. 2019. Anytime bottom-up rule learning for knowledge graph completion. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, 3137–3143. ijcai.org.
1995 Miller, G. A. 1995. Wordnet: A lexical database for english. Commun. ACM 38(11):39–41.
2011 Nickel, M.; Tresp, V.; and Kriegel, H. 2011. A three-way model for collective learning on multi-relational data. In Getoor, L., and Scheffer, T., eds., Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, 809–816. Omnipress.
2023 Pavlovic, A., and Sallinger, E. 2023. ExpressivE: A spatio-functional embedding for knowledge graph completion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
2019 Sadeghian, A.; Armandpour, M.; Ding, P.; and Wang, D. Z. 2019. DRUM: end-to-end differentiable rule mining on knowledge graphs. In Wallach, H. M.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E. B.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 15321–15331.
2020 Teru, K. K.; Denis, E. G.; and Hamilton, W. L. 2020. Inductive relation prediction by subgraph reasoning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, 9448–9457. PMLR.
2015 Toutanova, K., and Chen, D. 2015. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, CVSC 2015, Bei**g, China, July 26-31, 2015, 57–66. Association for Computational Linguistics.
2016 Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, É.; and Bouchard, G. 2016. Complex embeddings for simple link prediction. In Balcan, M., and Weinberger, K. Q., eds., Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, 2071–2080. JMLR.org.
2017 Xiong, W.; Hoang, T.; and Wang, W. Y. 2017. Deeppath: A reinforcement learning method for knowledge graph reasoning. In Palmer, M.; Hwa, R.; and Riedel, S., eds., Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, 564–573. Association for Computational Linguistics.
2015 Yang, B.; Yih, W.; He, X.; Gao, J.; and Deng, L. 2015. Embedding entities and relations for learning and inference in knowledge bases. In Bengio, Y., and LeCun, Y., eds., 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
2017 Yang, F.; Yang, Z.; and Cohen, W. W. 2017. Differentiable learning of logical rules for knowledge base reasoning. In Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2319–2328.
2021 Zhang, Z.; Wang, J.; Chen, J.; Ji, S.; and Wu, F. 2021. Cone: Cone embeddings for multi-hop reasoning over knowledge graphs. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 19172–19183.
2021 Zhu, Z.; Zhang, Z.; Xhonneux, L. A. C.; and Tang, J. 2021. Neural bellman-ford networks: A general graph neural network framework for link prediction. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 29476–29490.

Differentiable Reasoning about Knowledge Graphs with Region-based Graph Neural Networks

Abstract

1 Introduction

2 Related Work

Region-based Models

Inductive KG Completion

3 Problem Setting

4 Model Description

Ordering Constraints

Example 1.

Learning Embeddings with GNNs

Representing Entities as Matrices

Model Details

5 Constructing GNNs from Rule Graphs

Rule Graphs

Definition 1.

Example 2.

Constructing GNNs

Proposition 1.

Proposition 2.

6 Constructing Rule Graphs

Example 3.

Example 4.

Proposition 3.

6.1 Left-Regular Rule Bases

Definition 2.

Example 5.

Example 6.

Proposition 4.

6.2 Bounded Inference

Definition 3.

Proposition 5.

Example 7.

Example 8.

Proposition 6.

7 Experimental Results

Datasets

Experimental Setup

Baselines

Inductive KGC Results

Ablation Study

8 Conclusions

Appendix A Constructing GNNs from Rule Graphs

Lemma 1.

Proof.

Lemma 2.

Proof.

Lemma 3.

Proof.

Proposition 7.

Proof.

Lemma 4.

Lemma 5.

Proof.

Proposition 8.

Proof.

Appendix B Constructing Rule Graphs

Proposition 9.

Proof.

B.1 Left Regular Rule Bases

Lemma 6.

Proof.

Lemma 7.

Proof.

Lemma 8.

Proof.

Lemma 9.

Proof.

Proposition 10.

Proof.

B.2 Bounded Inference

Lemma 10.

Proof.

Lemma 11.

Proof.

Proposition 11.

Proof.

Lemma 12.

Proof.

Lemma 13.

Differentiable Reasoning about Knowledge Graphs
with Region-based Graph Neural Networks