LPFormer: An Adaptive Graph Transformer for Link Prediction

Harry Shomer [email protected] Michigan State UniversityEast LansingUSA Yao Ma [email protected] Rensselaer Polytechnic InstituteTroyUSA Haitao Mao [email protected] Michigan State UniversityEast LansingUSA Juanhui Li [email protected] Michigan State UniversityEast LansingUSA Bo Wu [email protected] Colorado School of MinesGoldenUSA  and  Jiliang Tang [email protected] Michigan State UniversityEast LansingUSA
(2024)
Abstract.

Link prediction is a common task on graph-structured data that has seen applications in a variety of domains. Classically, hand-crafted heuristics were used for this task. Heuristic measures are chosen such that they correlate well with the underlying factors related to link formation. In recent years, a new class of methods has emerged that combines the advantages of message-passing neural networks (MPNN) and heuristics methods. These methods perform predictions by using the output of an MPNN in conjunction with a “pairwise encoding” that captures the relationship between nodes in the candidate link. They have been shown to achieve strong performance on numerous datasets. However, current pairwise encodings often contain a strong inductive bias, using the same underlying factors to classify all links. This limits the ability of existing methods to learn how to properly classify a variety of different links that may form from different factors. To address this limitation, we propose a new method, LPFormer, which attempts to adaptively learn the pairwise encodings for each link. LPFormer models the link factors via an attention module that learns the pairwise encoding that exists between nodes by modeling multiple factors integral to link prediction. Extensive experiments demonstrate that LPFormer can achieve SOTA performance on numerous datasets while maintaining efficiency. The code is available at The code is available at https://github.com/HarryShomer/LPFormer.

link prediction, graph transformer
journalyear: 2024copyright: rightsretainedconference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 25–29, 2024; Barcelona, Spainbooktitle: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spaindoi: 10.1145/3637528.3672025isbn: 979-8-4007-0490-1/24/08ccs: Computing methodologies Machine learning

1. Introduction

Refer to caption
Figure 1. Example of multiple heuristic scores for the candidate links (source, 5), (source, 6), and (source, 7). Each heuristic corresponds to a different LP factor – local (CNs), global (Katz), and feature proximity (Feat-Sim).

Link prediction (LP) attempts to predict unseen edges in a graph. It has been adopted in many applications including recommender systems (Huang et al., 2005), social networks (Daud et al., 2020), and drug discovery (Abbas et al., 2021). Traditionally, hand-crafted heuristics were used to identify new links in the graph (Newman, 2001; Zhou et al., 2009; Adamic and Adar, 2003). Heuristics are often chosen based on factors that typically correlate well with the formation of new links. For example, a popular heuristic is common neighbors (CNs), which assume that the links are more likely to exist between node pairs with more shared neighbors. It has been found that these factors, which we refer to as “LP Factors”, often stem from the local and global structural information and feature proximity (Mao et al., 2023). We give an example in Figure 1 that demonstrates different heuristic scores for multiple candidate links. Each heuristic score corresponds to one of the LP factors: CNs for local information, Katz for global, and Feat-Sim for feature proximity. We can observe that the pair (source, 5) has the highest CN and Katz score of the candidate links, indicating an abundance of local and global structural information between the pair. On the other hand, the feature similarity for (source, 5) is the lowest among the candidate links. This indicates that different LP factors and heuristics have distinct assumptions about why links are formed.

More recently, message passing neural networks (MPNNs) (Gilmer et al., 2017), which are able to learn effective node representations via message passing, have been widely adopted for LP tasks. They predict the existence of a link by combining the node representations of both nodes in the link. However, such a node-centric view is unable to incorporate the pairwise information between the nodes in the link. Because of this, conventional MPNNs have been demonstrated to be poor link predictors due to their limited capability to learn effective and expressive link representations (Zhang et al., 2021a; Srinivasan and Ribeiro, 2019). To address this issue, recent efforts (Zhang and Chen, 2018; Zhu et al., 2021) have attempted to move beyond the node-centric view of traditional MPNNs by equip** them with pairwise information specific to the link being predicted (i.e. the “target link”) (Zhang and Chen, 2018; Zhu et al., 2021). This is done by customizing the message passing process to each target link. However, a concern with this approach is that it can be prohibitively expensive (Chamberlain et al., 2022), as message passing needs to be run for each individual target link. This is as opposed to traditional MPNNs which only run message passing once for all target links.

To overcome these inefficiencies, recent methods (Yun et al., 2021; Chamberlain et al., 2022; Wang et al., 2023) have instead explored ways to inject pairwise information into the model, without individualizing the message passing to each target link. This is done by decoupling the message passing and link-specific pairwise information. By doing so, the message passing only needs to be done once for all target links. To include the pairwise information, these methods, which we refer to as “Decoupled Pairwise MPNNs” (DP-MPNNs), instead learn a “pairwise encoding” to encode the pairwise relationship of the target link. The choice of pairwise encoding is often based on heuristics that correspond to common LP factors (e.g., common neighbors). DP-MPNNs have gained attention as they can achieve promising performance while being much more efficient than methods that customize the message passing mechanism.

However, DP-MPNNs are often limited in the choice of pairwise encoding, using a one-size-fits-all solution for all target links. This has two limitations. (1) The pairwise encoding may fail to consider some integral LP factors. For example, NCNC (Wang et al., 2023) only considers the 1-hop neighborhood when computing the pairwise encoding, thereby ignoring the global structural information. This suggests the need for a pairwise encoding that considers multiple types of LP factors. (2) The pairwise encoding uses the same LP factors for all target links. This assumes that all target links need the same factors. However, it may not necessarily be true. Recently, Mao et al. (2023) have shown that different LP factors are necessary to classify different target links. It is evident that even for the same dataset, multiple LP factors are needed to properly predict all target links. This further applies to different datasets, where certain factors are more prominent than others. As such, it faces tremendous challenges when considering multiple types of LP factors. While one factor may effectively model some target links, it will fail for other target links where those patterns aren’t present. It is therefore desired to consider different LP factors for different target links.

These observations motivate us to ask – can we design an efficient method that can adaptively determine which LP factors to incorporate for each individual target link? Essentially, it requires a pairwise encoding that (a) models multiple LP factors, (b) can be tailored to fit each individual target link, and (c) is efficient to calculate. By doing so, we can flexibly adapt the pairwise information based on the existing needs of each target link. To achieve this, we propose LPFormerLink Prediction TransFormer. LPFormer is a type of graph Transformer (Müller et al., 2023) designed specifically for link prediction. Given a target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ), LPFormer models the pairwise encoding via an attention module that learns how a𝑎aitalic_a and b𝑏bitalic_b relate in the context of various LP factors. This allows for a more customizable set of pairwise encodings that are specific to each target link. Extensive experiments validate that LPFormer can achieve SOTA on a variety of benchmark datasets. We further demonstrate that LPFormer is better at modeling several types of LP factors, highlighting its adaptability, while also maintaining efficiency on denser graphs.

2. Background

2.1. Related Work

Link prediction (LP) aims to model how links are formed in a graph. The process by which links are formed, i.e., link formation, is often governed by a set of underlying factors (Barabâsi et al., 2002; Liben-Nowell and Kleinberg, 2003). We refer to these as “LP factors”. Two categories of methods are used for modeling these factors – heuristics and MPNNs. We describe each class of methods. We further include a discussion on existing graph transformers.

Heuristics for Link Prediction. Heuristics methods (Newman, 2001; Zhou et al., 2009) attempt to explicitly model the LP factors via hand-crafted measures. Recently, Mao et al. (2023) have shown that there are three main factors that correlate with the existence of a link: (1) local structural information, (2) global structural information, and (3) feature proximity. Local structural information only considers the immediate neighborhood of the target link. Representative methods include Common Neighbors (CN) (Newman, 2001), Adamic Adar (AA) (Adamic and Adar, 2003), and Resource Allocation (RA) (Zhou et al., 2009). They are predicated on the assumption that nodes that share a greater number neighbors exhibit a higher probability of forming connections. Global structural information further considers the global structure of the graph. Such methods include Katz (Katz, 1953) and Personalized PageRank (PPR) (Brin and Page, 1998). These methods posit that nodes interconnected by a higher number of paths are deemed to have larger similarity and, therefore, are more likely to form connections. Lastly, feature proximity assumes nodes with more similar features connect (Murase et al., 2019). Previous work (Nickel et al., 2014; Zhao et al., 2017) have shown that leveraging the node features are helpful in predicting links. Lastly, we note that Mao et al. (2023) has recently shown that to properly predict a wide variety of links, it’s integral to incorporate all three of these factors.

MPNNs for Link Prediction. Message Passing Neural Networks (MPNNs) (Gilmer et al., 2017) aim to learn node representations via the message passing mechanism. Traditional MPNNs have been used for LP including GCN (Kipf and Welling, 2016a), SAGE (Hamilton et al., 2017), and GAE (Kipf and Welling, 2016b). However, they have been shown to be suboptimal for LP as they aren’t expressive enough to capture important pairwise patterns (Zhang et al., 2021b; Srinivasan and Ribeiro, 2019). SEAL (Zhang and Chen, 2018) and NBFNet (Zhu et al., 2021) try to address this by customizing the message passing process to each target link. This allows for the message passing to learn pairwise information specific to the target link. However, these methods have been shown to be unduly expensive as they require a separate round of message passing for each target link. As such, recent methods have been proposed to instead decouple the message passing and pairwise information (Yun et al., 2021; Chamberlain et al., 2022; Wang et al., 2023), reducing the time needed to do message passing. Such methods include NCN/NCNC (Wang et al., 2023) which exploit the common neighbor information and BUDDY (Chamberlain et al., 2022) and Neo-GNN (Yun et al., 2021) which consider the global structural information.

Graph Transformers. Recent work has attempted to extend the original Transformer (Vaswani et al., 2017) architecture to graph-structured data. Graphormer (Ying et al., 2021) learns node representations by attending all nodes to each other. To properly model the structural information, they propose to use multiple types of structural encodings (i.e., structural, centrality, and edge). SAN (Kreuzer et al., 2021) further considers the use of the Laplacian positional encodings (LPEs) to enhance the learnt structural information. Alternatively, TokenGT (Kim et al., 2022) considers all nodes and edges as tokens in the sequence when performing attention. Due to the large complexity of these models, they are unable to scale to larger graphs. To address this, several graph transformers (Chen et al., 2022; Wu et al., 2022) have been proposed for node classification that attempt to efficiently attend to the graph. However, while some work (Chen et al., 2021; Pahuja et al., 2023) have formulated transformers for knowledge graph completion, to our knowledge, there are no graph transformers designed specifically for LP on uni-relational graphs.

2.2. Preliminaries

We denote a graph as 𝒢={𝒱,}𝒢𝒱\mathcal{G}=\{\mathcal{V},\mathcal{E}\}caligraphic_G = { caligraphic_V , caligraphic_E }, where 𝒱𝒱\mathcal{V}caligraphic_V and \mathcal{E}caligraphic_E are the sets of nodes and edges in 𝒢𝒢\mathcal{G}caligraphic_G, respectively. The adjacency matrix is represented as A|V|×|V|𝐴superscript𝑉𝑉A\in\mathbb{R}^{\lvert V\rvert\times\lvert V\rvert}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT | italic_V | × | italic_V | end_POSTSUPERSCRIPT. The d𝑑ditalic_d-dimensional node features are represented by the matrix X|V|×d𝑋superscript𝑉𝑑X\in\mathbb{R}^{\lvert V\rvert\times d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT | italic_V | × italic_d end_POSTSUPERSCRIPT. The set of neighbors for a node v𝑣vitalic_v is given by 𝒩(v)𝒩𝑣\mathcal{N}(v)caligraphic_N ( italic_v ). The set of overlap** neighbors between two nodes a𝑎aitalic_a and b𝑏bitalic_b, i.e., the common neighbors (CNs), is expressed by 𝒩(a,b)CNsubscriptsuperscript𝒩CN𝑎𝑏\mathcal{N}^{\text{CN}}_{(a,b)}caligraphic_N start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT. We further denote the set of nodes that are 1-hop neighbors of only one of a𝑎aitalic_a or b𝑏bitalic_b as 𝒩(a,b)1subscriptsuperscript𝒩1𝑎𝑏\mathcal{N}^{1}_{(a,b)}caligraphic_N start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT and the nodes that are >1absent1{>}1> 1-hop from both nodes as 𝒩(a,b)>1subscriptsuperscript𝒩absent1𝑎𝑏\mathcal{N}^{>1}_{(a,b)}caligraphic_N start_POSTSUPERSCRIPT > 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT. Lastly, the personalized pagerank (PPR) score for a root node v𝑣vitalic_v and an arbitrary node u𝑢uitalic_u is given by ppr(v,u)ppr𝑣𝑢\text{ppr}(v,u)ppr ( italic_v , italic_u ).

3. The Proposed Framework

Refer to caption
Figure 2. An overview of LPFormer. (1) Encode the nodes via a MPNN. (2) For a given target link, we determine which nodes to attend to (𝒩^(a,b)^𝒩𝑎𝑏\hat{\mathcal{N}}(a,b)over^ start_ARG caligraphic_N end_ARG ( italic_a , italic_b )) via the PPR-based thresholding technique in Eq. (10). (3) The pairwise encoding is computed by attending to each node, u𝒩^(a,b)𝑢^𝒩𝑎𝑏u\in\hat{\mathcal{N}}(a,b)italic_u ∈ over^ start_ARG caligraphic_N end_ARG ( italic_a , italic_b ) using the feature and relative positional encoding rpe(a,b,u)subscriptrpe𝑎𝑏𝑢\mathbf{\text{rpe}}_{(a,b,u)}rpe start_POSTSUBSCRIPT ( italic_a , italic_b , italic_u ) end_POSTSUBSCRIPT. (4) The pairwise encoding, node representations, and counts of different node types are concatenated and used to compute the final probability of the target link existing.

In Section 1, we highlighted the importance of adaptively modeling multiple types of LP factors. However, current methods that use pairwise encodings, i.e., DP-MPNNs, struggle to appropriately achieve this goal. This is due to two issues: (1) They only attempt to model a subset of the potential LP factors (e.g., only local structural information), limiting their ability to model multiple factors. (2) They use a one-size-fits-all approach in regard to pairwise encoding, using the same combination of LP factors for each target link. These issues strongly limit the potential of such methods to properly model a variety of different target links. To overcome these problems, we propose LPFormer, a new transformer-based method that can adaptively customize the pairwise information for each target link by considering a variety of different LP factors in an efficient manner.

3.1. A General View of Pairwise Encodings

Recent MPNNs for LP use a decoupled strategy to include the pairwise information (Chamberlain et al., 2022; Wang et al., 2023; Yun et al., 2021). These methods, DP-MPNNs, predict the existence of a link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) via both the node representations and a pairwise encoding s(a,b)𝑠𝑎𝑏s(a,b)italic_s ( italic_a , italic_b ). They follow the formulation below:

H=MPNN(A,X),𝐻MPNN𝐴𝑋\displaystyle H=\text{MPNN}(A,X),italic_H = MPNN ( italic_A , italic_X ) ,
(1) p(a,b)=σ(MLP(𝐡a𝐡b\scalerels(a,b))),\displaystyle p(a,b)=\sigma\left(\text{MLP}\left(\mathbf{h}_{a}\odot\mathbf{h}% _{b}\operatorname*{\scalerel*{\|}{\sum}}s(a,b)\right)\right),italic_p ( italic_a , italic_b ) = italic_σ ( MLP ( bold_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⊙ bold_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_OPERATOR ∗ ∥ ∑ end_OPERATOR italic_s ( italic_a , italic_b ) ) ) ,

where hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the representation of node i𝑖iitalic_i encoded by the MPNN. Various DP-MPNNs adopt different ways to model the pairwise encoding. For example, NCN (Wang et al., 2023) models the pairwise encoding s(a,b)𝑠𝑎𝑏s(a,b)italic_s ( italic_a , italic_b ) as the summation of the node representations of the CNs. The definitions of s(a,b)𝑠𝑎𝑏s(a,b)italic_s ( italic_a , italic_b ) for other prominent DP-MPNNs can be found in Appendix A. The pairwise encodings in these existing methods are typically manually selected or extracted from the graph, which limits the LP factors they can cover. For example, s(a,b)𝑠𝑎𝑏s(a,b)italic_s ( italic_a , italic_b ) in NCN and NCNC only capture the local structural information. BUDDY (Chamberlain et al., 2022) ignores the node features when computing the pairwise encoding. To flexibly model multiple types of LP factors, we propose a general formulation for pairwise encodings as follows,

(2) s(a,b)=u𝒱w(a,b,u)h(a,b,u),𝑠𝑎𝑏subscript𝑢𝒱direct-product𝑤𝑎𝑏𝑢𝑎𝑏𝑢s(a,b)=\sum_{u\in\mathcal{V}}w(a,b,u)\odot h(a,b,u),italic_s ( italic_a , italic_b ) = ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_V end_POSTSUBSCRIPT italic_w ( italic_a , italic_b , italic_u ) ⊙ italic_h ( italic_a , italic_b , italic_u ) ,

where w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) measures the importance of node u𝑢uitalic_u to (a,b)𝑎𝑏(a,b)( italic_a , italic_b ), and h(a,b,u)𝑎𝑏𝑢h(a,b,u)italic_h ( italic_a , italic_b , italic_u ) is the encoding of node u𝑢uitalic_u relative to (a,b)𝑎𝑏(a,b)( italic_a , italic_b ). By considering which nodes should be considered for (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) and how they are related to the node pair, Eq. (2) can model different LP factors by manually defining w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) and h(a,b,u)𝑎𝑏𝑢h(a,b,u)italic_h ( italic_a , italic_b , italic_u ). In particular, we demonstrate how the heuristic methods corresponding to different LP factors can fit into this framework.

Common Neighbors (CNs) (Newman, 2001): CNs considers the local structural information and is defined for a pair of nodes (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) as 𝒩(a,b)CN=𝒩(a)𝒩(b)subscriptsuperscript𝒩CN𝑎𝑏𝒩𝑎𝒩𝑏\mathcal{N}^{\text{CN}}_{(a,b)}=\mathcal{N}(a)\cap\mathcal{N}(b)caligraphic_N start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT = caligraphic_N ( italic_a ) ∩ caligraphic_N ( italic_b ). Eq. (2) is equal to the CNs when h(a,b,u)=1𝑎𝑏𝑢1h(a,b,u)=1italic_h ( italic_a , italic_b , italic_u ) = 1 and:

(3) w(a,b,u)={1,when u𝒩(a)𝒩(b)0,else }.𝑤𝑎𝑏𝑢1when 𝑢𝒩𝑎𝒩𝑏0else w(a,b,u)=\left\{\begin{array}[]{ll}1,&\text{when }u\in\mathcal{N}(a)\cap% \mathcal{N}(b)\\ 0,&\text{else }\end{array}\right\}.italic_w ( italic_a , italic_b , italic_u ) = { start_ARRAY start_ROW start_CELL 1 , end_CELL start_CELL when italic_u ∈ caligraphic_N ( italic_a ) ∩ caligraphic_N ( italic_b ) end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL else end_CELL end_ROW end_ARRAY } .

Katz Index (Katz, 1953): The Katz index models the global structural information. It is defined as weighted summation of the number of paths of different lengths connecting a𝑎aitalic_a and b𝑏bitalic_b and a decay weight β[0,1]𝛽01\beta\in[0,1]italic_β ∈ [ 0 , 1 ],

Katz(a,b)=l=1βlAa,bl.Katz𝑎𝑏superscriptsubscript𝑙1superscript𝛽𝑙superscriptsubscript𝐴𝑎𝑏𝑙\text{Katz}(a,b)=\sum_{l=1}^{\infty}\beta^{l}A_{a,b}^{l}.Katz ( italic_a , italic_b ) = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT .

This is equivalent to Eq. (2) where w(a,b,u)=l=1βleaTAl𝑤𝑎𝑏𝑢superscriptsubscript𝑙1superscript𝛽𝑙superscriptsubscript𝑒𝑎𝑇superscript𝐴𝑙w(a,b,u)=\sum_{l=1}^{\infty}\beta^{l}e_{a}^{T}A^{l}italic_w ( italic_a , italic_b , italic_u ) = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and

h(a,b,u)={ebT,when u=b𝟎,else },𝑎𝑏𝑢superscriptsubscript𝑒𝑏𝑇when 𝑢𝑏0else h(a,b,u)=\left\{\begin{array}[]{ll}e_{b}^{T},&\text{when }u=b\\ \mathbf{0},&\text{else }\end{array}\right\},italic_h ( italic_a , italic_b , italic_u ) = { start_ARRAY start_ROW start_CELL italic_e start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , end_CELL start_CELL when italic_u = italic_b end_CELL end_ROW start_ROW start_CELL bold_0 , end_CELL start_CELL else end_CELL end_ROW end_ARRAY } ,

where ei𝔹|𝒱|subscript𝑒𝑖superscript𝔹𝒱e_{i}\in\mathbb{B}^{\lvert\mathcal{V}\rvert}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_B start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT is a one-hot vector for a node i𝑖iitalic_i.

Feature Similarity: The feature similarity of the pair of nodes (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) is expressed by dis(𝐱a,𝐱b)dissubscript𝐱𝑎subscript𝐱𝑏\text{dis}(\mathbf{x}_{a},\mathbf{x}_{b})dis ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) where 𝐱asubscript𝐱𝑎\mathbf{x}_{a}bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are the node features of node a𝑎aitalic_a and dis()dis\text{dis}(\cdot)dis ( ⋅ ) is a distance function (e.g., euclidean distance). This can be rewritten as Eq. (2) by substituting w(a,b,u)=dis(𝐱a,𝐱u)𝑤𝑎𝑏𝑢dissubscript𝐱𝑎subscript𝐱𝑢w(a,b,u)=\text{dis}(\mathbf{x}_{a},\mathbf{x}_{u})italic_w ( italic_a , italic_b , italic_u ) = dis ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) and h(a,b,u)=ebT𝑎𝑏𝑢superscriptsubscript𝑒𝑏𝑇h(a,b,u)=e_{b}^{T}italic_h ( italic_a , italic_b , italic_u ) = italic_e start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

These examples demonstrate that the general formulation can indeed model many different LP factors including local and global structural information and feature proximity. We further show in Appendix B that Eq. (2) can model a variety of additional LP factors including RA (Zhou et al., 2009), the pairwise encodings used in NCN/NCNC (Wang et al., 2023) and Neo-GNN (Yun et al., 2021). However, fitting these methods into the formulation in Eq. (2) requires manually defining both w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) and h(a,b,u)𝑎𝑏𝑢h(a,b,u)italic_h ( italic_a , italic_b , italic_u ). This constrains the information represented by s(a,b)𝑠𝑎𝑏s(a,b)italic_s ( italic_a , italic_b ) based on the choice of design. Motivated by this, in the next section we introduce our method that does not rely on a handcrafting both w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) and h(a,b,u)𝑎𝑏𝑢h(a,b,u)italic_h ( italic_a , italic_b , italic_u ).

3.2. Modeling Pairwise Encodings via Attention

In Section 3.1, we introduced a general formulation for pairwise encodings in Eq. (2), which is able to capture a variety of different LP factors. However, it requires manually defining both terms in the equation. This limits our ability to customize the pairwise information to each target link. As such, we further aim to move beyond a one-size-fits-all pairwise encoding, and enable the model to produce customized pairwise encoding for each target link. This allows the model to handle more realistic graphs that often contain multiple prominent LP factors for different target links as shown in (Mao et al., 2023).

In particular, we consider the following question: How can we model Eq (2) such that it can customize the used LP factors to each target link? We consider parameterizing both w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) and h(a,b,u)𝑎𝑏𝑢h(a,b,u)italic_h ( italic_a , italic_b , italic_u ). This allows us to learn how to personalize them to each target link. To achieve this, we leverage softmax attention (Bahdanau et al., 2015). This is due to its ability to dynamically learn the relevance of different nodes to the target link. As such, for multiple target links, it can emphasize the contributions of different nodes, thereby flexibly modeling different LP factors. We note that since the attention is between different sequences (i.e., a target link and nodes), it can be considered a form of cross attention (Vaswani et al., 2017).

To enhance the adaptability of the pairwise encoding for various links, it is essential to incorporate various types of information. This allows the attention mechanism to discern and prioritize relevant information for each target link, facilitating the effective modeling of diverse LP factors. In particular, we consider two types of information. The first is the feature information. This includes the feature representation of both nodes in the target link and the node being attended to. The node features are included due to their role in link formation and relationship to structural information (Murase et al., 2019). Second, we consider the relative positional information. The relative positional information reflects the relative position in the graph of a node u𝑢uitalic_u to the target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) in the local and global structural context. Due to the importance of local and global structural information (Dong et al., 2017; Huang et al., 2015), it is vital to properly encode both. By including both the structural and feature information, we are able to cover the space of potential LP factors (see Section 2.1).

We denote the feature representation of a node u𝑢uitalic_u as 𝐡usubscript𝐡𝑢\mathbf{h}_{u}bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and the relative positional encoding (RPE) as 𝐫𝐩𝐞(a,b,u)subscript𝐫𝐩𝐞𝑎𝑏𝑢\mathbf{rpe}_{(a,b,u)}bold_rpe start_POSTSUBSCRIPT ( italic_a , italic_b , italic_u ) end_POSTSUBSCRIPT. The node importance w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) is modeled via attention as follows:

w~(a,b,u)=ϕ(𝐡a,𝐡b,𝐡u,𝐫𝐩𝐞(a,b,u)),~𝑤𝑎𝑏𝑢italic-ϕsubscript𝐡𝑎subscript𝐡𝑏subscript𝐡𝑢subscript𝐫𝐩𝐞𝑎𝑏𝑢\displaystyle\tilde{w}(a,b,u)=\phi\left(\mathbf{h}_{a},\mathbf{h}_{b},\mathbf{% h}_{u},\>\mathbf{rpe}_{(a,b,u)}\right),over~ start_ARG italic_w end_ARG ( italic_a , italic_b , italic_u ) = italic_ϕ ( bold_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_rpe start_POSTSUBSCRIPT ( italic_a , italic_b , italic_u ) end_POSTSUBSCRIPT ) ,
(4) w(a,b,u)=exp(w~(a,b,u))v𝒱¯(a,b)exp(w~(a,b,u)),𝑤𝑎𝑏𝑢exp~𝑤𝑎𝑏𝑢subscript𝑣¯𝒱𝑎𝑏exp~𝑤𝑎𝑏𝑢\displaystyle w(a,b,u)=\frac{\text{exp}(\tilde{w}(a,b,u))}{\sum_{v\in\bar{% \mathcal{V}}(a,b)}\text{exp}(\tilde{w}(a,b,u))},italic_w ( italic_a , italic_b , italic_u ) = divide start_ARG exp ( over~ start_ARG italic_w end_ARG ( italic_a , italic_b , italic_u ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ over¯ start_ARG caligraphic_V end_ARG ( italic_a , italic_b ) end_POSTSUBSCRIPT exp ( over~ start_ARG italic_w end_ARG ( italic_a , italic_b , italic_u ) ) end_ARG ,

where 𝒱¯(a,b)=𝒱{a,b}¯𝒱𝑎𝑏𝒱𝑎𝑏\bar{\mathcal{V}}(a,b)=\mathcal{V}\setminus\{a,b\}over¯ start_ARG caligraphic_V end_ARG ( italic_a , italic_b ) = caligraphic_V ∖ { italic_a , italic_b }. The attention weight w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) can be considered as the impact of a node u𝑢uitalic_u on (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) relative to all nodes in 𝒢𝒢\mathcal{G}caligraphic_G. This allows the model to emphasize different LP factors for each target link. The node encoding h(a,b,u)𝑎𝑏𝑢h(a,b,u)italic_h ( italic_a , italic_b , italic_u ) includes the features of node u𝑢uitalic_u in conjunction with the RPE and is defined as:

(5) h(a,b,u)=𝐖[𝐡u\scalerel𝐫𝐩𝐞(a,b,v)].h(a,b,u)=\mathbf{W}\left[\mathbf{h}_{u}\>\operatorname*{\scalerel*{\|}{\sum}}% \mathbf{rpe}_{(a,b,v)}\right].italic_h ( italic_a , italic_b , italic_u ) = bold_W [ bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_OPERATOR ∗ ∥ ∑ end_OPERATOR bold_rpe start_POSTSUBSCRIPT ( italic_a , italic_b , italic_v ) end_POSTSUBSCRIPT ] .

By substituting Eq. (3.2) and Eq. (5) into Eq. (2) we can compute the pairwise information s(a,b)𝑠𝑎𝑏s(a,b)italic_s ( italic_a , italic_b ). We further define ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) in Eq. (3.2) as the GATv2 (Brody et al., 2022) attention mechanism. The detailed formulation is given in Appendix D. The feature representations 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are computed via a MPNN. We use GCN (Kipf and Welling, 2017) in this work. However, it is unclear how to properly encode the RPE of a node u𝑢uitalic_u relative to (a,b)𝑎𝑏(a,b)( italic_a , italic_b ), 𝐫𝐩𝐞(a,b,u)subscript𝐫𝐩𝐞𝑎𝑏𝑢\mathbf{rpe}_{(a,b,u)}bold_rpe start_POSTSUBSCRIPT ( italic_a , italic_b , italic_u ) end_POSTSUBSCRIPT. We aim to design the RPE to capture both the local and global structural relationship between the node and target link while also being efficient to calculate. In the next section, we discuss our solution for modeling 𝐫𝐩𝐞(a,b,u)subscript𝐫𝐩𝐞𝑎𝑏𝑢\mathbf{rpe}_{(a,b,u)}bold_rpe start_POSTSUBSCRIPT ( italic_a , italic_b , italic_u ) end_POSTSUBSCRIPT.

3.3. PPR-Based Relative Positional Encodings

In this section, we introduce our strategy for computing the RPE of a node u𝑢uitalic_u relative to a target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ). Intuitively, we want the RPE to reflect the positional relationship between u𝑢uitalic_u and (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) such that different types of information (i.e., local vs. global) are encoded differently. Using Figure 1 as an example, since node 3 is a CN of (source, 5) we expect it to have a much different relationship to the target link than node 6, which is a 2-hop neighbor of both nodes. An enticing option is to use the double radius node labeling (DRNL) trick introduced by Zhang and Chen (2018). However, Chamberlain et al. (2022) have shown it to be prohibitively expensive to calculate for larger graphs. Furthermore, existing RPEs are typically infeasible to calculate on larger graphs as they often rely on pairwise distances or the eigenvectors of the Laplacian (Rampášek et al., 2022).

As such, we seek an RPE that can both distinguish the relationship of different nodes to the target link while also being efficient to calculate. To motivate our RPE design, we draw inspiration from the following Proposition.

Proposition 0.

Consider a target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) and a node u𝒱{a,b}𝑢𝒱𝑎𝑏u\in\mathcal{V}\setminus\{a,b\}italic_u ∈ caligraphic_V ∖ { italic_a , italic_b }. The PPR (Brin and Page, 1998) score of a root node i𝑖iitalic_i and target node j𝑗jitalic_j with teleportation probability α𝛼\alphaitalic_α is denoted by ppr(i,j)ppr𝑖𝑗\text{ppr}(i,j)ppr ( italic_i , italic_j ). Let rak(u)superscriptsubscript𝑟𝑎𝑘𝑢r_{a}^{k}(u)italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_u ) be the probability of a walk of length k𝑘kitalic_k beginning at node a𝑎aitalic_a and terminating at u𝑢uitalic_u. We define ra,bk(u):=rak(u)+rbk(u)assignsuperscriptsubscript𝑟𝑎𝑏𝑘𝑢superscriptsubscript𝑟𝑎𝑘𝑢superscriptsubscript𝑟𝑏𝑘𝑢r_{a,b}^{k}(u):=r_{a}^{k}(u)+r_{b}^{k}(u)italic_r start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_u ) := italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_u ) + italic_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_u ). We also define a weight γk:=α(1α)kassignsuperscript𝛾𝑘𝛼superscript1𝛼𝑘\gamma^{k}:=\alpha(1-\alpha)^{k}italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT := italic_α ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for all walks of length k𝑘kitalic_k. The PPR scores, ppr(a,u)𝑝𝑝𝑟𝑎𝑢ppr(a,u)italic_p italic_p italic_r ( italic_a , italic_u ) and ppr(b,u)𝑝𝑝𝑟𝑏𝑢ppr(b,u)italic_p italic_p italic_r ( italic_b , italic_u ), along with the random walk probabilities of disparate lengths, are interconnected through the following relationship.

(6) Γ(a,b,u)=ppr(a,u)+ppr(b,u)=k=0γkra,bk(u).Γ𝑎𝑏𝑢ppr𝑎𝑢ppr𝑏𝑢superscriptsubscript𝑘0superscript𝛾𝑘superscriptsubscript𝑟𝑎𝑏𝑘𝑢\Gamma(a,b,u)=\text{ppr}(a,u)+\text{ppr}(b,u)=\sum_{k=0}^{\infty}\gamma^{k}r_{% a,b}^{k}(u).roman_Γ ( italic_a , italic_b , italic_u ) = ppr ( italic_a , italic_u ) + ppr ( italic_b , italic_u ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_u ) .

The detailed proof is given in Appendix C. From Proposition 1, we can make the following observations: (1) The PPR scores encode the weighted sum of the probabilities of different length random walks connecting two nodes. (2) Walks of shorter length are given higher importance, as evidenced by the dampening factor γk=α(1α)ksuperscript𝛾𝑘𝛼superscript1𝛼𝑘\gamma^{k}=\alpha(1-\alpha)^{k}italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_α ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT which decays with the increase in k𝑘kitalic_k. These observations imply that – a larger value of Γ(a,b,u)Γ𝑎𝑏𝑢\Gamma(a,b,u)roman_Γ ( italic_a , italic_b , italic_u ) correlates with the existence of many shorter walks connecting node u𝑢uitalic_u to the both nodes in the target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ).

Therefore, the PPR scores can be used as an intuitive and useful method to understand the structural relationship between node u𝑢uitalic_u and both nodes in the target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ). If both scores, ppr(a,u)ppr𝑎𝑢\text{ppr}(a,u)ppr ( italic_a , italic_u ) and ppr(b,u)ppr𝑏𝑢\text{ppr}(b,u)ppr ( italic_b , italic_u ), are high, there exists a high probability that many shorter walks connect u𝑢uitalic_u to both nodes in the target link. This implies that node u𝑢uitalic_u has a stronger impact on the nodes in the target link. On the other hand, if both PPR scores are low, there is likely very little relationship between u𝑢uitalic_u and the target link. This allows for a convenient way of differentiating how a node structurally relates to the target link. Furthermore, we note that the PPR matrix can be efficiently pre-computed using the algorithm introduced by Andersen et al. (2006), allowing for easy computation and use.

Following this idea, to calculate the RPE of a node u𝑢uitalic_u, we use the PPR scores of a node u𝑢uitalic_u relative to both nodes in the target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ). Instead of considering the sum of PPR scores as in Proposition 1, we further parameterize Γ()Γ\Gamma(\cdot)roman_Γ ( ⋅ ) via an MLP,

(7) 𝐫𝐩𝐞(a,b,u)=MLP(ppr(a,u),ppr(b,u)).subscript𝐫𝐩𝐞𝑎𝑏𝑢MLPppr𝑎𝑢ppr𝑏𝑢\mathbf{rpe}_{(a,b,u)}=\text{MLP}\left(\text{ppr}(a,u),\text{ppr}(b,u)\right).bold_rpe start_POSTSUBSCRIPT ( italic_a , italic_b , italic_u ) end_POSTSUBSCRIPT = MLP ( ppr ( italic_a , italic_u ) , ppr ( italic_b , italic_u ) ) .

By introducing learnable parameters to Γ()Γ\Gamma(\cdot)roman_Γ ( ⋅ ), it allows for the model learn the importance of individual PPR scores and how they interact with each other. To ensure that Eq. (7) is invariant to the order of the nodes in the target link, i.e., (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) and (b,u)𝑏𝑢(b,u)( italic_b , italic_u ), we further set the RPE to be equal to the summation of the representations given by both (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) and (b,a)𝑏𝑎(b,a)( italic_b , italic_a ):

(8) 𝐫𝐩𝐞¯(a,b,u)=𝐫𝐩𝐞(a,b,u)+𝐫𝐩𝐞(b,a,u).subscript¯𝐫𝐩𝐞𝑎𝑏𝑢subscript𝐫𝐩𝐞𝑎𝑏𝑢subscript𝐫𝐩𝐞𝑏𝑎𝑢\mathbf{\overline{rpe}}_{(a,b,u)}=\mathbf{rpe}_{(a,b,u)}+\mathbf{rpe}_{(b,a,u)}.over¯ start_ARG bold_rpe end_ARG start_POSTSUBSCRIPT ( italic_a , italic_b , italic_u ) end_POSTSUBSCRIPT = bold_rpe start_POSTSUBSCRIPT ( italic_a , italic_b , italic_u ) end_POSTSUBSCRIPT + bold_rpe start_POSTSUBSCRIPT ( italic_b , italic_a , italic_u ) end_POSTSUBSCRIPT .

However, a concern with Eq. (8) is that it is not guaranteed to be able to distinguish certain types of nodes from each other. For example, it is necessary to clearly distinguish CNs from other nodes due to their important role in link formation (Newman, 2001). To overcome this issue, we fit three separate MLPs for when u𝑢uitalic_u is a: CN of (a,b)𝑎𝑏(a,b)( italic_a , italic_b ), a 1-hop neighbor of either a𝑎aitalic_a and b𝑏bitalic_b, and a >1absent1{>}1> 1-hop neighbor of both a𝑎aitalic_a and b𝑏bitalic_b. This ensures that we can properly distinguish between these three types of nodes. We verify the effectiveness of this design in Section 4.4. Lastly, we note that while other work (Mialon et al., 2021; Li et al., 2020) has considered the use of random-walk based positional encodings, they are only designed for use on the node-level and are unable to be used for link-level tasks like LP.

3.4. Efficiently Attending to the Graph Context

The proposed attention mechanism in Section 3.2 attends to all nodes in the graph, sans those in the link itself. This makes it difficult to scale to large graphs. Motivated by selective (Maruf et al., 2019) and sparse (Correia et al., 2019) attention, we opt to attend to only a small portion of the nodes.

At a high level, we are interested in determining a subset of nodes 𝒩^(a,b)𝒱^𝒩𝑎𝑏𝒱\hat{\mathcal{N}}(a,b)\in\mathcal{V}over^ start_ARG caligraphic_N end_ARG ( italic_a , italic_b ) ∈ caligraphic_V to attend to for the target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ). Our goal is to choose the set of nodes 𝒩^(a,b)^𝒩𝑎𝑏\hat{\mathcal{N}}(a,b)over^ start_ARG caligraphic_N end_ARG ( italic_a , italic_b ) such that they are (a) few in number to improve scalability and (b) provide important contextual information to the pair (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) to best learn the pairwise information. This can be achieved by only considering all nodes where the importance of the node u𝑢uitalic_u to the target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) is considered high. Formally, we can write this as the following where (a,b,u)𝑎𝑏𝑢\mathcal{I}(a,b,u)caligraphic_I ( italic_a , italic_b , italic_u ) is a function that denotes the importance of a node u𝑢uitalic_u to the target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ):

(9) 𝒩^(a,b)={u𝒱{a,b}|(a,b,u)>η}.^𝒩𝑎𝑏conditional-set𝑢𝒱𝑎𝑏𝑎𝑏𝑢𝜂\hat{\mathcal{N}}(a,b)=\{u\in\mathcal{V}\setminus\{a,b\}\;|\;\mathcal{I}(a,b,u% )>\eta\}.over^ start_ARG caligraphic_N end_ARG ( italic_a , italic_b ) = { italic_u ∈ caligraphic_V ∖ { italic_a , italic_b } | caligraphic_I ( italic_a , italic_b , italic_u ) > italic_η } .

The threshold η𝜂\etaitalic_η allows us to distinguish those nodes that are sufficiently important to the target link. This allows for a simple and efficient way of determining the set 𝒩^(a,b)^𝒩𝑎𝑏\hat{\mathcal{N}}(a,b)over^ start_ARG caligraphic_N end_ARG ( italic_a , italic_b ). However, what do we use to model the importance (a,b,u)𝑎𝑏𝑢\mathcal{I}(a,b,u)caligraphic_I ( italic_a , italic_b , italic_u )? For ease of optimization and better efficiency, we avoid parameterizing the function (a,b,u)𝑎𝑏𝑢\mathcal{I}(a,b,u)caligraphic_I ( italic_a , italic_b , italic_u ). Instead, we want to choose a metric such that can properly serve as a proxy for the importance of a node u𝑢uitalic_u to (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) while also being concentrated in a small subset of nodes. Such a metric will allow Eq. (9) to choose a small but influential set of nodes to attend to.

A measure that satisfies both criteria is Personalized Pagerank (PPR) (Brin and Page, 1998). In Section 3.3 we discussed that the PPR score can serve as a good tool to model the influence of a one node on another. Furthermore, existing work (Gleich et al., 2015; Nassar et al., 2015; Andersen et al., 2006) shows that the PPR scores tend to be highly localized in a small subset of nodes. Therefore by making (a,b,u)𝑎𝑏𝑢\mathcal{I}(a,b,u)caligraphic_I ( italic_a , italic_b , italic_u ) contingent on the PPR scores of (a,u)𝑎𝑢(a,u)( italic_a , italic_u ) and (b,u)𝑏𝑢(b,u)( italic_b , italic_u ) we can extract a small but important set of nodes to attend to for the target link.

Following this idea, for a target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ), we keep all nodes whose PPR score is above some threshold η𝜂\etaitalic_η relative to both nodes in the target link. As such, we only keep a node u𝑢uitalic_u if it is related in some capacity to at least one of the nodes in the target link. Similarly to Section 3.3, we treat CN, 1-Hop, and >1absent1{>}1> 1-Hop nodes differently by applying a different threshold for them. The filtered node set for each category of nodes is given by:

(10) 𝒩^(a,b)π={u𝒩(a,b)π|ppr(a,u)>ηπ,ppr(b,u)>ηπ},subscriptsuperscript^𝒩𝜋𝑎𝑏conditional-set𝑢subscriptsuperscript𝒩𝜋𝑎𝑏formulae-sequenceppr𝑎𝑢superscript𝜂𝜋ppr𝑏𝑢superscript𝜂𝜋\hat{\mathcal{N}}^{\pi}_{(a,b)}=\{u\in\mathcal{N}^{\pi}_{(a,b)}\>|\>\text{ppr}% (a,u)>\eta^{\pi},\>\text{ppr}(b,u)>\eta^{\pi}\},over^ start_ARG caligraphic_N end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT = { italic_u ∈ caligraphic_N start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT | ppr ( italic_a , italic_u ) > italic_η start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT , ppr ( italic_b , italic_u ) > italic_η start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT } ,

where 𝒩^(a,b)πsubscriptsuperscript^𝒩𝜋𝑎𝑏\hat{\mathcal{N}}^{\pi}_{(a,b)}over^ start_ARG caligraphic_N end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT is the filtered node set for all nodes of the type π{CN,1Hop,>1Hop}\pi\in\{\text{CN},1{-}\text{Hop},{>}1{-}\text{Hop}\}italic_π ∈ { CN , 1 - Hop , > 1 - Hop } and ηπsuperscript𝜂𝜋\eta^{\pi}italic_η start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is the corresponding PPR threshold. We note that while other work (Bojchevski et al., 2020; Ying et al., 2018) has used PPR to filter the nodes on the node-level, no existing work has done so on the link-level.

We corroborate this design by demonstrating that LPFormer can achieve SOTA performance in LP (Section 4.2) while achieving a faster runtime than the second-best method, NCNC (Wang et al., 2023), on denser graphs (Section 4.7). This is despite the fact that LPFormer can attend to a wider variety of nodes. We further show in Section 4.5 that the performance is stable with regards to the values of η𝜂\etaitalic_η chosen, allowing us to easily choose a proper threshold on any dataset.

3.5. LPFormer

We now define the overall framework – LPFormer. The overall procedure is given in Figure 2: (1) We first learn node representations from the input adjacency and node features via an MPNN. We note that this step is agnostic to the target link. (2) For a target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) we extract the nodes to attend to, i.e. 𝒩^(a,b)^𝒩𝑎𝑏\hat{\mathcal{N}}(a,b)over^ start_ARG caligraphic_N end_ARG ( italic_a , italic_b ). This is done via the PPR thresholding technique defined in Section 3.4. (3) We apply L𝐿Litalic_L layers of attention, using the mechanism defined in Section 3.2. The output is the pairwise encoding s(a,b)𝑠𝑎𝑏s(a,b)italic_s ( italic_a , italic_b ). (4) We generate the prediction of the target link using three types of information: the element-wise product of the node representation, the pairwise encoding, and the number of CN, 1-Hop, and >>>1-Hop nodes identified by Eq. (10). The score function is given by:

(11) p(a,b)=σ(MLP(𝐡a𝐡b\scalerels(a,b)\scalerel|𝒩^(a,b)CN|\scalerel|𝒩^(a,b)1|\scalerel|𝒩^(a,b)>1|))p(a,b)=\sigma\left(\text{MLP}\left(\mathbf{h}_{a}\odot\mathbf{h}_{b}% \operatorname*{\scalerel*{\|}{\sum}}s(a,b)\operatorname*{\scalerel*{\|}{\sum}}% \lvert\hat{\mathcal{N}}^{\text{CN}}_{(a,b)}\rvert\operatorname*{\scalerel*{\|}% {\sum}}\lvert\hat{\mathcal{N}}^{1}_{(a,b)}\rvert\operatorname*{\scalerel*{\|}{% \sum}}\lvert\hat{\mathcal{N}}^{>1}_{(a,b)}\rvert\right)\right)italic_p ( italic_a , italic_b ) = italic_σ ( MLP ( bold_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⊙ bold_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_OPERATOR ∗ ∥ ∑ end_OPERATOR italic_s ( italic_a , italic_b ) start_OPERATOR ∗ ∥ ∑ end_OPERATOR | over^ start_ARG caligraphic_N end_ARG start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT | start_OPERATOR ∗ ∥ ∑ end_OPERATOR | over^ start_ARG caligraphic_N end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT | start_OPERATOR ∗ ∥ ∑ end_OPERATOR | over^ start_ARG caligraphic_N end_ARG start_POSTSUPERSCRIPT > 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT | ) )

We demonstrate in Section 4.4 that the inclusion of the node counts is helpful, as it provides complementary information to the pairwise encoding.

4. Experiments

Table 1. Dataset statistics. The split ratio is the % of samples for train/validation/test.
Cora Citeseer Pubmed ogbl-collab ogbl-ddi ogbl-ppa ogbl-citation2
#Nodes 2,708 3,327 18,717 235,868 4,267 576,289 2,927,963
#Edges 5,278 4,676 44,327 1,285,465 1,334,889 30,326,273 30,561,187
Split Ratio 85/5/10 85/5/10 85/5/10 92/4/4 80/10/10 70/20/10 98/1/1
Table 2. Results on benchmark datasets. OOM is an out of memory error. Colored are the results ranked first, second, and third.
Cora Citeseer Pubmed ogbl-collab ogbl-ppa ogbl-citation2 Mean Rank
Metric MRR MRR MRR H@50 H@100 MRR
CN 20.99±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 28.34±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 14.02±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 56.44±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 27.65±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 51.47±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 11.0
AA 31.87±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 29.37±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 16.66±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 64.35±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 32.45±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 51.89±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 8.5
RA 30.79±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 27.61±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 15.63±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 64.00±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 49.33±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 51.98±0.00plus-or-minus0.00{\scriptstyle\pm 0.00}± 0.00 8.7
GCN 32.50±6.87plus-or-minus6.87{\scriptstyle\pm 6.87}± 6.87 50.01±6.04plus-or-minus6.04{\scriptstyle\pm 6.04}± 6.04 19.94±4.24plus-or-minus4.24{\scriptstyle\pm 4.24}± 4.24 44.75±1.07plus-or-minus1.07{\scriptstyle\pm 1.07}± 1.07 18.67±1.32plus-or-minus1.32{\scriptstyle\pm 1.32}± 1.32 84.74±0.21plus-or-minus0.21{\scriptstyle\pm 0.21}± 0.21 8.0
SAGE 37.83±7.75plus-or-minus7.75{\scriptstyle\pm 7.75}± 7.75 47.84±6.39plus-or-minus6.39{\scriptstyle\pm 6.39}± 6.39 22.74±5.47plus-or-minus5.47{\scriptstyle\pm 5.47}± 5.47 48.10±0.81plus-or-minus0.81{\scriptstyle\pm 0.81}± 0.81 16.55±2.40plus-or-minus2.40{\scriptstyle\pm 2.40}± 2.40 82.60±0.36plus-or-minus0.36{\scriptstyle\pm 0.36}± 0.36 7.7
GAE 29.98±3.21plus-or-minus3.21{\scriptstyle\pm 3.21}± 3.21 63.33±3.14plus-or-minus3.14{\scriptstyle\pm 3.14}± 3.14 16.67±0.19plus-or-minus0.19{\scriptstyle\pm 0.19}± 0.19 OOM OOM OOM NA
SEAL 26.69±5.89plus-or-minus5.89{\scriptstyle\pm 5.89}± 5.89 39.36±4.99plus-or-minus4.99{\scriptstyle\pm 4.99}± 4.99 38.06±5.18plus-or-minus5.18{\scriptstyle\pm 5.18}± 5.18 64.74±0.43plus-or-minus0.43{\scriptstyle\pm 0.43}± 0.43 48.80±3.16plus-or-minus3.16{\scriptstyle\pm 3.16}± 3.16 87.67±0.32plus-or-minus0.32{\scriptstyle\pm 0.32}± 0.32 6.2
NBFNet 37.69±3.97plus-or-minus3.97{\scriptstyle\pm 3.97}± 3.97 38.17±3.06plus-or-minus3.06{\scriptstyle\pm 3.06}± 3.06 44.73±2.12plus-or-minus2.12{\scriptstyle\pm 2.12}± 2.12 OOM OOM OOM NA
Neo-GNN 22.65±2.60plus-or-minus2.60{\scriptstyle\pm 2.60}± 2.60 53.97±5.88plus-or-minus5.88{\scriptstyle\pm 5.88}± 5.88 31.45±3.17plus-or-minus3.17{\scriptstyle\pm 3.17}± 3.17 57.52±0.37plus-or-minus0.37{\scriptstyle\pm 0.37}± 0.37 49.13±0.60plus-or-minus0.60{\scriptstyle\pm 0.60}± 0.60 87.26±0.84plus-or-minus0.84{\scriptstyle\pm 0.84}± 0.84 7.0
BUDDY 26.40±4.40plus-or-minus4.40{\scriptstyle\pm 4.40}± 4.40 59.48±8.96plus-or-minus8.96{\scriptstyle\pm 8.96}± 8.96 23.98±5.11plus-or-minus5.11{\scriptstyle\pm 5.11}± 5.11 65.94±0.58plus-or-minus0.58{\scriptstyle\pm 0.58}± 0.58 49.85±0.20plus-or-minus0.20{\scriptstyle\pm 0.20}± 0.20 87.56±0.11plus-or-minus0.11{\scriptstyle\pm 0.11}± 0.11 5.7
NCN 32.93±3.80plus-or-minus3.80{\scriptstyle\pm 3.80}± 3.80 54.97±6.03plus-or-minus6.03{\scriptstyle\pm 6.03}± 6.03 35.65±4.60plus-or-minus4.60{\scriptstyle\pm 4.60}± 4.60 64.76±0.87plus-or-minus0.87{\scriptstyle\pm 0.87}± 0.87 61.19±0.85plus-or-minus0.85{\scriptstyle\pm 0.85}± 0.85 88.09±0.06plus-or-minus0.06{\scriptstyle\pm 0.06}± 0.06 3.8
NCNC 29.01±3.83plus-or-minus3.83{\scriptstyle\pm 3.83}± 3.83 64.03±3.67plus-or-minus3.67{\scriptstyle\pm 3.67}± 3.67 25.70±4.48plus-or-minus4.48{\scriptstyle\pm 4.48}± 4.48 66.61±0.71plus-or-minus0.71{\scriptstyle\pm 0.71}± 0.71 61.42±0.73plus-or-minus0.73{\scriptstyle\pm 0.73}± 0.73 89.12±0.40plus-or-minus0.40{\scriptstyle\pm 0.40}± 0.40 3.8
LPFormer 39.42±5.78plus-or-minus5.78{\scriptstyle\pm 5.78}± 5.78 65.42±4.65plus-or-minus4.65{\scriptstyle\pm 4.65}± 4.65 40.17±1.92plus-or-minus1.92{\scriptstyle\pm 1.92}± 1.92 68.14±0.51plus-or-minus0.51{\scriptstyle\pm 0.51}± 0.51 63.32±0.63plus-or-minus0.63{\scriptstyle\pm 0.63}± 0.63 89.81±0.13plus-or-minus0.13{\scriptstyle\pm 0.13}± 0.13 1.2

In this section, we conduct extensive experiments to validate the effectiveness of LPFormer. Specifically, we attempt to answer the following questions: (RQ1) Can LPFormer consistently outperform baseline methods on a variety of different benchmark datasets? (RQ2) Is LPFormer able to model a variety of different LP factors? (RQ3) Can LPFormer be run efficiently on large dense graphs? We further conduct studies ablating each component of our model and analyzing the effect of the PPR-based threshold on performance.

4.1. Experimental Settings

Datasets. We include Cora, Citeseer, and Pubmed (Yang et al., 2016) and ogbl-collab, ogbl-ppa, ogbl-ddi, and ogbl-citation2 (Hu et al., 2020). Furthermore, for Cora, Citeseer, and Pubmed we experiment under a single fixed split (see Appendix E.1 for further discussion). The detailed statistics for each dataset are shown in Table 1.

Baseline Models. We compare LPFormer against a wide variety of baselines including: CN (Newman, 2001), AA (Adamic and Adar, 2003), RA (Zhou et al., 2009), GCN (Kipf and Welling, 2017), SAGE (Hamilton et al., 2017), GAE (Kipf and Welling, 2016b), SEAL (Zhang and Chen, 2018), NBFNet (Zhu et al., 2021), Neo-GNN (Yun et al., 2021), BUDDY (Chamberlain et al., 2022), and NCNC (Wang et al., 2023). Results on Cora, Citeseer, and Pubmed are taken from Li et al. (2023). Results for the heuristic methods are from Hu et al. (2020). All other results are either from their respective study or Chamberlain et al. (2022).

Hyperparameters: The learning rate is tuned from {1e3,5e3}1superscript𝑒35superscript𝑒3\{1e^{-3},5e^{-3}\}{ 1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 5 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT }, the decay from {0.95,0.975,1}0.950.9751\{0.95,0.975,1\}{ 0.95 , 0.975 , 1 }, and the dropout from [0,0.7]00.7[0,0.7][ 0 , 0.7 ], and the weight decay from {0,1e4,1e7}01superscript𝑒41superscript𝑒7\{0,1e^{-4},1e^{-7}\}{ 0 , 1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 1 italic_e start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT }. The size of the hidden dimension is set to 64 for ogbl-ppa and ogbl-citation2, 128 for Cora, Pubmed, and ogbl-collab, and 256 for Citeseer. Lastly, the PPR threshold is tuned from {1e2,1e3,1e4}1superscript𝑒21superscript𝑒31superscript𝑒4\{1e^{-2},1e^{-3},1e^{-4}\}{ 1 italic_e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT }.

Evaluation Metrics. Each positive target link is evaluated against a set of given negative links. The rank of the positive link among the negatives is used to evaluate performance. The two types of metrics that are used to evaluate this ranking are Hits@K and MRR. For the OGB datasets we use the metric used in the original study. This includes Hits@50 for ogbl-collab, Hits@100 for ogbl-ppa and MRR for ogbl-citation2. For Cora, Citeseer, Pubmed we follow Li et al. (2023) and use MRR. Lastly, the same set of negative links is used for all positive links except on ogbl-citation2, where (Hu et al., 2020) provides a customized set of 1000 negatives for each individual positive link.

4.2. Main Results

We present the results of LPFormer compared with baselines on multiple benchmark datasets. Note that we omit ogbl-ddi from the main results due to recent issues discovered by Li et al. (2023) (see Appendix E.2 for more details). The results are shown in Table 2. We observe that LPFormer can achieve SOTA performance on 5/6 datasets, significantly outperforming other baselines. Moreover, LPFormer is also the most consistent of all the methods, achieving strong performance on all datasets. This is as opposed to previous SOTA methods, NCNC and BUDDY, which tend to struggle on Cora and Pubmed. We attribute the consistency of LPFormer to the flexibility of our model, allowing it to customize the LP factors needed to each link and dataset.

4.3. Performance by LP Factor

In this section, we measure the ability of LPFormer to capture a variety of different LP factors. To measure this, we identify all positive target links when there is only one dominant LP factor. For example, one group would contain all target links where the only dominant factor is the local structural information. We focus on links that correspond to one of the three groups identified in (Mao et al., 2023): local structural information, global structural information, and feature proximity.

We identify these groups by using popular heuristics as proxies for each factor. For local structural information, we use CNs (Newman, 2001), for global structural information we use PPR (Brin and Page, 1998) as it’s the most computationally efficient of all global methods, and for feature proximity, we use the cosine similarity of the features. Using these heuristics, we determine if only one factor is dominant by comparing the relative score of each heuristic. This is done by first computing the score for each factor i𝑖iitalic_i for the target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b )si(a,b)superscript𝑠𝑖𝑎𝑏s^{i}(a,b)italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_a , italic_b ). For each factor, we then compute the score corresponding to the p𝑝pitalic_p-th percentile among all links, s^isuperscript^𝑠𝑖\hat{s}^{i}over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. We choose a larger value of p𝑝pitalic_p (i.e. 90%) such that a score s^iabsentsuperscript^𝑠𝑖\geq\hat{s}^{i}≥ over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT indicates that a significant amount of pairwise information exists for that factor. For a single target link, we then compare the score of each factor si(a,b)superscript𝑠𝑖𝑎𝑏s^{i}(a,b)italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_a , italic_b ) to s^isuperscript^𝑠𝑖\hat{s}^{i}over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. If si(a,b)s^isuperscript𝑠𝑖𝑎𝑏superscript^𝑠𝑖s^{i}(a,b)\geq\hat{s}^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_a , italic_b ) ≥ over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is true for only one factor, this implies that the score for only one factor is “high”. Therefore there is a notable amount of pairwise information existing for only one factor for the link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ). This ensures that only one factor is strongly expressed. If this is true, we then assign the target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) to factor i𝑖iitalic_i. Please see Appendix E.4 for a more detailed explanation.

We demonstrate the results on Cora, Citeseer, and ogbl-collab in Figure 3. We observe that LPFormer typically performs best for each individual LP factor on all datasets. Furthermore, it is also the most consistently well-performing on each factor as compared to other methods. For example, on Cora the other methods struggle for links that correspond to the feature proximity factor. LPFormer, on the other hand, is able to significantly outperform them on those target links, performing around 33% better than the second best method. Lastly, we note that most methods tend to perform well on the links corresponding to the global factor, even if they don’t explicitly model such information. This is caused by a strong correlation that tends to exist between local and global structural information, often resulting in considerable overlap between both factors (Mao et al., 2023). These results show that LPFormer can indeed adapt to multiple types of LP factors, as it can consistently perform well on samples belonging to a variety of different LP factors. Additional results are given in Appendix E.5.

Refer to caption
(a) Cora
Refer to caption
(b) Citeseer
Refer to caption
(c) ogbl-collab
Figure 3. Performance on links that contain one dominant LP factor. Results are on (a) Cora, (b) Citeseer, and (c) ogbl-collab.

4.4. Ablation Study

We further include an ablation study to verify the effectiveness of the proposed components in LPFormer. In particular, we introduce 6 variants of LPFormer. (a) w/o Learnable Att: No attention is learned. As such, we set all attention weights to 1 and remove the RPE. (b) w/o Features in Att: We remove the node feature information from the attention mechanism. (c) w/o RPE in Att: We remove the RPE from the attention mechanism. (d) w/o PPR RPE: We replace the PPR-based RPE with a learnable embedding for each of CN, 1-Hop, and >>>1-Hop nodes. (e) w/o PPR RPE by Node Type: We don’t fit a separate function for each node type when determining the PPR RPE (see Section 3.3). Instead we use one for all nodes. (f) w/o Counts: We remove the counts of different nodes from the scoring function.

The results are shown in Table 3. We include ogbl-collab, ogbl-ppa, and Citeseer. We observe that ablating a component always decreases the performance. However, the magnitude of the decrease is dataset-dependent. For example, on ogbl-collab, ablating the feature information in the attention marginally affects the performance. However, on ogbl-ppa and Citeseer, removing the feature information results in a large decrease in performance. On the other hand, while removing learnable attention results in a modest decrease on ogbl-ppa, for the other two datasets we see a large drop. This highlights the importance of each component of our framework, as they are each necessary for consistently strong performance across multiple datasets.

Table 3. Ablation Study on LPFormer

Method ogbl-collab ogbl-ppa Citeseer w/o Learnable Att 65.05±0.50plus-or-minus65.050.5065.05{\scriptstyle\pm 0.50}65.05 ± 0.50 62.77±1.03plus-or-minus62.771.0362.77{\scriptstyle\pm 1.03}62.77 ± 1.03 56.23±1.75plus-or-minus56.231.7556.23{\scriptstyle\pm 1.75}56.23 ± 1.75 w/o Features in Att 68.04±0.79plus-or-minus68.040.7968.04\scriptstyle\pm 0.7968.04 ± 0.79 56.98±1.55plus-or-minus56.981.5556.98{\scriptstyle\pm 1.55}56.98 ± 1.55 53.40±9.30plus-or-minus53.409.3053.40{\scriptstyle\pm 9.30}53.40 ± 9.30 w/o RPE in Att 65.26±0.56plus-or-minus65.260.5665.26{\scriptstyle\pm 0.56}65.26 ± 0.56 61.20±0.69plus-or-minus61.200.6961.20{\scriptstyle\pm 0.69}61.20 ± 0.69 56.70±3.79plus-or-minus56.703.7956.70{\scriptstyle\pm 3.79}56.70 ± 3.79 w/o PPR RPE 67.09±0.51plus-or-minus67.090.5167.09{\scriptstyle\pm 0.51}67.09 ± 0.51 61.91±1.22plus-or-minus61.911.2261.91{\scriptstyle\pm 1.22}61.91 ± 1.22 51.96±15.2plus-or-minus51.9615.251.96{\scriptstyle\pm 15.2}51.96 ± 15.2 w/o PPR RPE by Node Type 67.95±0.54plus-or-minus67.950.5467.95{\scriptstyle\pm 0.54}67.95 ± 0.54 62.92±1.06plus-or-minus62.921.0662.92{\scriptstyle\pm 1.06}62.92 ± 1.06 57.40±5.71plus-or-minus57.405.7157.40{\scriptstyle\pm 5.71}57.40 ± 5.71 w/o Counts 67.75±0.41plus-or-minus67.750.4167.75{\scriptstyle\pm 0.41}67.75 ± 0.41 44.37±1.89plus-or-minus44.371.89{44.37\scriptstyle\pm 1.89}44.37 ± 1.89 54.39±5.30plus-or-minus54.395.3054.39{\scriptstyle\pm 5.30}54.39 ± 5.30 LPFormer 68.14±0.51plus-or-minus68.140.51\mathbf{68.14{\scriptstyle\pm 0.51}}bold_68.14 ± bold_0.51 63.32±0.63plus-or-minus63.320.63\mathbf{63.32{\scriptstyle\pm 0.63}}bold_63.32 ± bold_0.63 65.42±4.65plus-or-minus65.424.65\mathbf{65.42\scriptstyle\pm 4.65}bold_65.42 ± bold_4.65

Table 4. Effect of Varying the PPR Thresholds

Threshold ogbl-collab ogbl-citation2 1-Hop >1Hopabsent1Hop{>}1{-}\text{Hop}> 1 - Hop 1-Hop >1Hopabsent1Hop{>}1{-}\text{Hop}> 1 - Hop 1e-4 68.24±0.25plus-or-minus68.240.2568.24{\scriptstyle\pm 0.25}68.24 ± 0.25 67.73±0.65plus-or-minus67.730.6567.73{\scriptstyle\pm 0.65}67.73 ± 0.65 89.81±0.13plus-or-minus89.810.1389.81{\scriptstyle\pm 0.13}89.81 ± 0.13 89.14±0.22plus-or-minus89.140.2289.14{\scriptstyle\pm 0.22}89.14 ± 0.22 1e-2 67.60±0.31plus-or-minus67.600.3167.60{\scriptstyle\pm 0.31}67.60 ± 0.31 68.24±0.25plus-or-minus68.240.2568.24{\scriptstyle\pm 0.25}68.24 ± 0.25 89.49±0.18plus-or-minus89.490.1889.49{\scriptstyle\pm 0.18}89.49 ± 0.18 89.81±0.13plus-or-minus89.810.1389.81{\scriptstyle\pm 0.13}89.81 ± 0.13 1 67.08±0.65plus-or-minus67.080.6567.08{\scriptstyle\pm 0.65}67.08 ± 0.65 68.14±0.51plus-or-minus68.140.5168.14{\scriptstyle\pm 0.51}68.14 ± 0.51 89.49±0.16plus-or-minus89.490.1689.49{\scriptstyle\pm 0.16}89.49 ± 0.16 89.26±0.39plus-or-minus89.260.3989.26{\scriptstyle\pm 0.39}89.26 ± 0.39

4.5. Effect of the PPR Thresholds

We examine the effect of varying the PPR threshold for both 1-Hop and >1Hopabsent1Hop{>}1{-}\text{Hop}> 1 - Hop nodes as described in Eq. (10). The results for ogbl-collab and ogbl-citation2 are shown in Table 4. When varying the 1-Hop threshold, we fix the value of the >1Hopabsent1Hop{>}1{-}\text{Hop}> 1 - Hop threshold to 1e-2 for both datasets. When varying the >1Hopabsent1Hop{>}1{-}\text{Hop}> 1 - Hop threshold, we fix the value of the 1111-Hop threshold to 1e-4 for both datasets.

We can observe that modifying the threshold has little effect on the underlying performance of the model. For both datasets, a value of 1e-2 works well for the >1Hopabsent1Hop{>}1{-}\text{Hop}> 1 - Hop threshold and 1e-4 works well for the 1-Hop threshold. We typically find that setting both values to 1e-2 provides a good trade-off between performance and efficiency.

4.6. Performance on HeaRT Setting

We further test the performance of our method on the HeaRT (Li et al., 2023) evaluation setting, which considers a more realistic and difficult evaluation setting for link prediction. This is done by introducing a much harder and more realistic set of negative samples during evaluation. Li et al. (2023) observe that this results in a large decrease in performance on all datasets. Furthermore, compared to the original evaluation setting, MPNNs designed specifically for link prediction are often outperformed by heuristics or other MPNNs.

The full results can be found in Table 5. We observe that LPFormer performs considerably better than all other models. For instance, the mean rank of LPFormer is 3.1x better than the 2nd best-performing model, NCN. This indeed shows the advantage of LPFormer, as it can consistently achieve extraordinary performance across all datasets under the much more challenging HeaRT evaluation setting. This is as opposed to other LP-specific methods that often perform similarly to standard MPNN methods.

Table 5. Results (MRR) under HeaRT. Highlighted are the results ranked first, second, and third.

Models Cora Citeseer Pubmed ogbl-collab ogbl-ddi ogbl-ppa ogbl-citation2 Mean Rank CN 9.78 8.42 2.28 4.20 6.71 25.70 17.11 11.1 AA 11.91 10.82 2.63 5.07 6.97 26.85 17.83 9.6 RA 11.81 10.84 2.47 6.29 8.70 28.34 17.79 8.1 GCN 16.61 ±plus-or-minus\pm± 0.30 21.09 ±plus-or-minus\pm± 0.88 7.13 ±plus-or-minus\pm± 0.27 6.09 ±plus-or-minus\pm± 0.38 13.46 ±plus-or-minus\pm± 0.34 26.94 ±plus-or-minus\pm± 0.48 19.98 ±plus-or-minus\pm± 0.35 4.7 SAGE 14.74 ±plus-or-minus\pm± 0.69 21.09 ±plus-or-minus\pm± 1.15 9.40 ±plus-or-minus\pm± 0.70 5.53 ±plus-or-minus\pm± 0.5 12.60 ±plus-or-minus\pm± 0.72 27.27 ±plus-or-minus\pm± 0.30 22.05 ±plus-or-minus\pm± 0.12 4.7 GAE 18.32 ±plus-or-minus\pm± 0.41 25.25 ±plus-or-minus\pm± 0.82 5.27 ±plus-or-minus\pm± 0.25 OOM 3.49 ±plus-or-minus\pm± 1.73 OOM OOM NA SEAL 10.67 ±plus-or-minus\pm± 3.46 13.16 ±plus-or-minus\pm± 1.66 5.88 ±plus-or-minus\pm± 0.53 6.43 ±plus-or-minus\pm± 0.32 9.99 ±plus-or-minus\pm± 0.90 29.71 ±plus-or-minus\pm± 0.71 20.60 ±plus-or-minus\pm± 1.28 6.4 NBFNet 13.56 ±plus-or-minus\pm± 0.58 14.29 ±plus-or-minus\pm± 0.80 ¿24h OOM ¿24h OOM OOM NA BUDDY 13.71 ±plus-or-minus\pm± 0.59 22.84 ±plus-or-minus\pm± 0.36 7.56 ±plus-or-minus\pm± 0.18 5.67 ±plus-or-minus\pm± 0.36 12.43 ±plus-or-minus\pm± 0.50 27.70 ±plus-or-minus\pm± 0.33 19.17 ±plus-or-minus\pm± 0.20 5.9 Neo-GNN 13.95 ±plus-or-minus\pm± 0.39 17.34 ±plus-or-minus\pm± 0.84 7.74 ±plus-or-minus\pm± 0.30 5.23 ±plus-or-minus\pm± 0.9 10.86 ±plus-or-minus\pm± 2.16 21.68 ±plus-or-minus\pm± 1.14 16.12 ±plus-or-minus\pm± 0.25 7.4 NCN 14.66 ±plus-or-minus\pm± 0.95 28.65 ±plus-or-minus\pm± 1.21 5.84 ±plus-or-minus\pm± 0.22 5.09 ±plus-or-minus\pm± 0.38 12.86 ±plus-or-minus\pm± 0.78 35.06 ±plus-or-minus\pm± 0.26 23.35 ±plus-or-minus\pm± 0.28 4.4 NCNC 14.98 ±plus-or-minus\pm± 1.00 24.10 ±plus-or-minus\pm± 0.65 8.58 ±plus-or-minus\pm± 0.59 4.73 ±plus-or-minus\pm± 0.86 ¿24h 33.52 ±plus-or-minus\pm± 0.26 19.61 ±plus-or-minus\pm± 0.54 4.8 LPFormer 16.80 ±plus-or-minus\pm± 0.52 26.34 ±plus-or-minus\pm± 0.67 9.99 ±plus-or-minus\pm± 0.52 7.62 ±plus-or-minus\pm± 0.26 13.20 ±plus-or-minus\pm± 0.54 40.25 ±plus-or-minus\pm± 0.24 24.70 ±plus-or-minus\pm± 0.55 1.4

4.7. Runtime Analysis

In this section, we compare the runtime of LPFormer against NCNC, which is the strongest performing baseline. The results are shown in Figure 4 on all four OGB datasets We further include the mean degree of each dataset in parentheses. We observe that LPFormer shines on denser datasets, taking significantly less time to train one epoch. This is despite that LPFormer can attend to nodes beyond the 1-hop radius of the target link. This underscores the importance of the PPR thresholding technique introduced in Section 3.4, as it allows for efficient attention to a wider variety of nodes. Lastly, we note that LPFormer struggles on the ogbl-citation2 dataset due to the large number of nodes in the dataset (i.e., 2,927,963), which requires the sparse PPR matrix to be quite large. For future work we plan on exploring pre-computing the necessary PPR scores as an efficient pre-processing step, thereby removing the need to store the costly PPR matrix. Please see Appendix E.7 for more details.

Refer to caption
Figure 4. Comparison of training time of 1 epoch between LPFormer and NCNC. The mean degree is in parentheses.

5. Conclusion

In this paper we introduce a new framework, LPFormer, that aims to integrate a wider variety of pairwise information for link prediction. LPFormer does this via a specially designed graph transformer, which adaptively considers how a node pair relate to each other in the context of the graph. Extensive experiments demonstrate that LPFormer can achieve SOTA performance on a wide variety of benchmark datasets while retaining efficiency. We further demonstrate LPFormer’s supremacy at modeling multiple types of LP factors. For future work, we plan on exploring other methods of incorporating multiple LP factors with an emphasis on global structural information. We also plan to investigate the potential of alternative relative positional encodings.

Acknowledgements.
This research is supported by the National Science Foundation (NSF) under grant numbers CNS 2246050, IIS1845081, IIS2212032, IIS2212144, IOS2107215, DUE 2234015, DRL 2025244 and IOS2035472, the Army Research Office (ARO) under grant number W911NF-21-1-0198, the Home Depot, Cisco Systems Inc, Amazon Faculty Award, Johnson&Johnson, JP Morgan Faculty Award and SNAP.

References

  • (1)
  • Abbas et al. (2021) Khushnood Abbas, Alireza Abbasi, Shi Dong, Ling Niu, Laihang Yu, Bolun Chen, Shi-Min Cai, and Qambar Hasan. 2021. Application of network link prediction in drug discovery. BMC bioinformatics 22 (2021), 1–21.
  • Adamic and Adar (2003) Lada A Adamic and Eytan Adar. 2003. Friends and neighbors on the web. Social networks 25, 3 (2003), 211–230.
  • Andersen et al. (2006) Reid Andersen, Fan Chung, and Kevin Lang. 2006. Local graph partitioning using pagerank vectors. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). IEEE, 475–486.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015.
  • Barabâsi et al. (2002) Albert-Laszlo Barabâsi, Hawoong Jeong, Zoltan Néda, Erzsebet Ravasz, Andras Schubert, and Tamas Vicsek. 2002. Evolution of the social network of scientific collaborations. Physica A: Statistical mechanics and its applications 311, 3-4 (2002), 590–614.
  • Bojchevski et al. (2020) Aleksandar Bojchevski, Johannes Gasteiger, Bryan Perozzi, Amol Kapoor, Martin Blais, Benedek Rózemberczki, Michal Lukasik, and Stephan Günnemann. 2020. Scaling graph neural networks with approximate pagerank. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2464–2473.
  • Brin and Page (1998) Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30, 1-7 (1998), 107–117.
  • Broder (1997) Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, 21–29.
  • Brody et al. (2022) Shaked Brody, Uri Alon, and Eran Yahav. 2022. How Attentive are Graph Attention Networks?. In International Conference on Learning Representations. https://openreview.net/forum?id=F72ximsx7C1
  • Chamberlain et al. (2022) Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas Markovich, Nils Hammerla, Michael M Bronstein, and Max Hansmire. 2022. Graph Neural Networks for Link Prediction with Subgraph Sketching. arXiv preprint arXiv:2209.15486 (2022).
  • Chen et al. (2022) **song Chen, Kaiyuan Gao, Gaichao Li, and Kun He. 2022. NAGphormer: A tokenized graph transformer for node classification in large graphs. In The Eleventh International Conference on Learning Representations.
  • Chen et al. (2021) Sanxing Chen, Xiaodong Liu, Jianfeng Gao, Jian Jiao, Ruofei Zhang, and Yangfeng Ji. 2021. HittER: Hierarchical Transformers for Knowledge Graph Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10395–10407.
  • Chung (2007) Fan Chung. 2007. The heat kernel as the pagerank of a graph. Proceedings of the National Academy of Sciences 104, 50 (2007), 19735–19740.
  • Correia et al. (2019) Gonçalo M Correia, Vlad Niculae, and André FT Martins. 2019. Adaptively Sparse Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2174–2184.
  • Daud et al. (2020) Nur Nasuha Daud, Siti Hafizah Ab Hamid, Muntadher Saadoon, Firdaus Sahran, and Nor Badrul Anuar. 2020. Applications of link prediction in social networks: A review. Journal of Network and Computer Applications 166 (2020), 102716.
  • Dong et al. (2017) Yuxiao Dong, Reid A Johnson, Jian Xu, and Nitesh V Chawla. 2017. Structural diversity and homophily: A study across more than one hundred big networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 807–816.
  • Flajolet et al. (2007) Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science Proceedings (2007).
  • Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In International conference on machine learning. PMLR, 1263–1272.
  • Gleich et al. (2015) David F Gleich, Kyle Kloster, and Huda Nassar. 2015. Localization in seeded pagerank. arXiv preprint arXiv:1509.00016 (2015).
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017).
  • Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118–22133.
  • Huang et al. (2015) Hong Huang, Jie Tang, Lu Liu, JarDer Luo, and Xiaoming Fu. 2015. Triadic closure pattern analysis and prediction in social networks. IEEE Transactions on Knowledge and Data Engineering 27, 12 (2015), 3374–3389.
  • Huang et al. (2005) Zan Huang, Xin Li, and Hsinchun Chen. 2005. Link prediction approach to collaborative filtering. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries. 141–142.
  • Katz (1953) Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 1 (1953), 39–43.
  • Kim et al. (2022) **woo Kim, Dat Nguyen, Seonwoo Min, Sungjun Cho, Moontae Lee, Honglak Lee, and Seunghoon Hong. 2022. Pure transformers are powerful graph learners. Advances in Neural Information Processing Systems 35 (2022), 14582–14595.
  • Kipf and Welling (2016a) Thomas N Kipf and Max Welling. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Kipf and Welling (2016b) Thomas N Kipf and Max Welling. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
  • Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR).
  • Kreuzer et al. (2021) Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. 2021. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems 34 (2021), 21618–21629.
  • Li et al. (2023) Juanhui Li, Harry Shomer, Haitao Mao, Shenglai Zeng, Yao Ma, Neil Shah, Jiliang Tang, and Dawei Yin. 2023. Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking. arXiv preprint arXiv:2306.10453 (2023).
  • Li et al. (2020) Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. 2020. Distance encoding: Design provably more powerful neural networks for graph representation learning. Advances in Neural Information Processing Systems 33 (2020), 4465–4478.
  • Liben-Nowell and Kleinberg (2003) David Liben-Nowell and Jon Kleinberg. 2003. The link prediction problem for social networks. In Proceedings of the twelfth international conference on Information and knowledge management. 556–559.
  • Mao et al. (2024) Haitao Mao, Zhikai Chen, Wei **, Haoyu Han, Yao Ma, Tong Zhao, Neil Shah, and Jiliang Tang. 2024. Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All? Advances in Neural Information Processing Systems 36 (2024).
  • Mao et al. (2023) Haitao Mao, Juanhui Li, Harry Shomer, Bingheng Li, Wenqi Fan, Yao Ma, Tong Zhao, Neil Shah, and Jiliang Tang. 2023. Revisiting Link Prediction: A Data Perspective. arXiv:2310.00793 [cs.SI]
  • Maruf et al. (2019) Sameen Maruf, André FT Martins, and Gholamreza Haffari. 2019. Selective Attention for Context-aware Neural Machine Translation. In Proceedings of NAACL-HLT. 3092–3102.
  • Mialon et al. (2021) Grégoire Mialon, Dexiong Chen, Margot Selosse, and Julien Mairal. 2021. Graphit: Encoding graph structure in transformers. arXiv preprint arXiv:2106.05667 (2021).
  • Müller et al. (2023) Luis Müller, Mikhail Galkin, Christopher Morris, and Ladislav Rampášek. 2023. Attending to graph transformers. arXiv preprint arXiv:2302.04181 (2023).
  • Murase et al. (2019) Yohsuke Murase, Hang-Hyun Jo, János Török, János Kertész, and Kimmo Kaski. 2019. Structural transition in social networks: The role of homophily. Scientific reports 9, 1 (2019), 4310.
  • Nassar et al. (2015) Huda Nassar, Kyle Kloster, and David F Gleich. 2015. Strong Localization in Personalized PageRank Vectors. In Proceedings of the 12th International Workshop on Algorithms and Models for the Web Graph-Volume 9479. 190–202.
  • Newman (2001) Mark EJ Newman. 2001. Clustering and preferential attachment in growing networks. Physical review E 64, 2 (2001), 025102.
  • Nickel et al. (2014) Maximilian Nickel, Xueyan Jiang, and Volker Tresp. 2014. Reducing the rank in relational factorization models by including observable patterns. Advances in Neural Information Processing Systems 27 (2014).
  • Pahuja et al. (2023) Vardaan Pahuja, Boshi Wang, Hugo Latapie, Jayanth Srinivasa, and Yu Su. 2023. A retrieve-and-read framework for knowledge graph link prediction. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 1992–2002.
  • Rampášek et al. (2022) Ladislav Rampášek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. 2022. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems 35 (2022), 14501–14515.
  • Rozemberczki et al. (2021) Benedek Rozemberczki, Carl Allen, and Rik Sarkar. 2021. Multi-scale attributed node embedding. Journal of Complex Networks 9, 2 (2021), cnab014.
  • Srinivasan and Ribeiro (2019) Balasubramaniam Srinivasan and Bruno Ribeiro. 2019. On the Equivalence between Positional Node Embeddings and Structural Graph Representations. In International Conference on Learning Representations.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Velickovic et al. (2017) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. stat 1050 (2017), 20.
  • Wang et al. (2023) Xiyuan Wang, Haotong Yang, and Muhan Zhang. 2023. Neural Common Neighbor with Completion for Link Prediction. arXiv preprint arXiv:2302.00890 (2023).
  • Wu et al. (2022) Qitian Wu, Wentao Zhao, Zenan Li, David P Wipf, and Junchi Yan. 2022. Nodeformer: A scalable graph structure learning transformer for node classification. Advances in Neural Information Processing Systems 35 (2022), 27387–27401.
  • Yang et al. (2016) Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning. PMLR, 40–48.
  • Ying et al. (2021) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems 34 (2021), 28877–28888.
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 974–983.
  • Yun et al. (2021) Seongjun Yun, Seoyoon Kim, Junhyun Lee, Jaewoo Kang, and Hyunwoo J Kim. 2021. Neo-gnns: Neighborhood overlap-aware graph neural networks for link prediction. Advances in Neural Information Processing Systems 34 (2021), 13683–13694.
  • Zhang and Chen (2018) Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural networks. Advances in neural information processing systems 31 (2018).
  • Zhang et al. (2021a) Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long **. 2021a. Labeling trick: A theory of using graph neural networks for multi-node representation learning. Advances in Neural Information Processing Systems 34 (2021), 9061–9073.
  • Zhang et al. (2021b) Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long **. 2021b. Labeling trick: A theory of using graph neural networks for multi-node representation learning. Advances in Neural Information Processing Systems 34 (2021), 9061–9073.
  • Zhao et al. (2017) He Zhao, Lan Du, and Wray Buntine. 2017. Leveraging node attributes for incomplete relational data. In International conference on machine learning. PMLR, 4072–4081.
  • Zhou et al. (2009) Tao Zhou, Linyuan Lü, and Yi-Cheng Zhang. 2009. Predicting missing links via local information. The European Physical Journal B 71 (2009), 623–630.
  • Zhu et al. (2021) Zhaocheng Zhu, Zuobai Zhang, Louis-Pascal Xhonneux, and Jian Tang. 2021. Neural bellman-ford networks: A general graph neural network framework for link prediction. Advances in Neural Information Processing Systems 34 (2021), 29476–29490.

Appendix A Existing Formulations of Pairwise Encodings

In this section we give an overview of existing formulations of pairwise encodings using in DP-MPNNs. The standard formulation of DP-MPNNs is given in Eq. 3.1 where s(a,b)𝑠𝑎𝑏s(a,b)italic_s ( italic_a , italic_b ) is the pairwise encoding. We briefly describe other existing solutions below: NCN (Wang et al., 2023): NCN only considers the CNs of the target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) by summing the node representation of each. The pairwise encoding, s(a,b)𝑠𝑎𝑏s(a,b)italic_s ( italic_a , italic_b ), is written as:

(12) s(a,b)=u𝒩(a,b)CN𝐡u,𝑠𝑎𝑏subscript𝑢subscriptsuperscript𝒩CN𝑎𝑏subscript𝐡𝑢s(a,b)=\sum_{u\in\mathcal{N}^{\text{CN}}_{(a,b)}}\mathbf{h}_{u},italic_s ( italic_a , italic_b ) = ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_N start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ,

where 𝐡usubscript𝐡𝑢\mathbf{h}_{u}bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the node representation encoded by a MPNN.

NCNC (Wang et al., 2023): NCNC extends NCN by further considering the 1-hop neighbors of the node pair that aren’t CNs. To account for the difference, they are weighted by the probability of they themselves being CNs of the other node in the pair. This is given for a target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) as:

(13) s(a,b)=u𝒱w(a,b,u)𝐡u,𝑠𝑎𝑏subscript𝑢𝒱𝑤𝑎𝑏𝑢subscript𝐡𝑢s(a,b)=\sum_{u\in\mathcal{V}}w(a,b,u)\>\mathbf{h}_{u},italic_s ( italic_a , italic_b ) = ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_V end_POSTSUBSCRIPT italic_w ( italic_a , italic_b , italic_u ) bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ,

where

(14) w(a,b,u)={1,when u𝒩(a,b)CNNCN(A,X,b,u)when u𝒩(a)NCN(A,X,a,u)when u𝒩(b)0,else }.𝑤𝑎𝑏𝑢1when 𝑢subscriptsuperscript𝒩CN𝑎𝑏NCN𝐴𝑋𝑏𝑢when 𝑢𝒩𝑎NCN𝐴𝑋𝑎𝑢when 𝑢𝒩𝑏0else w(a,b,u)=\left\{\begin{array}[]{ll}1,&\text{when }u\in\mathcal{N}^{\text{CN}}_% {(a,b)}\\ \text{NCN}(A,X,b,u)&\text{when }u\in\mathcal{N}(a)\\ \text{NCN}(A,X,a,u)&\text{when }u\in\mathcal{N}(b)\\ 0,&\text{else }\end{array}\right\}.italic_w ( italic_a , italic_b , italic_u ) = { start_ARRAY start_ROW start_CELL 1 , end_CELL start_CELL when italic_u ∈ caligraphic_N start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL NCN ( italic_A , italic_X , italic_b , italic_u ) end_CELL start_CELL when italic_u ∈ caligraphic_N ( italic_a ) end_CELL end_ROW start_ROW start_CELL NCN ( italic_A , italic_X , italic_a , italic_u ) end_CELL start_CELL when italic_u ∈ caligraphic_N ( italic_b ) end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL else end_CELL end_ROW end_ARRAY } .

This weighting scheme ensures that CNs play a larger role in the pairwise information than non-CNs. BUDDY (Wang et al., 2023): BUDDY considers counting the number of nodes that correspond to different labels given by the double radius node labeling trick (Zhang et al., 2021a). We first define the number of nodes that are a distance dasubscript𝑑𝑎d_{a}italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and dbsubscript𝑑𝑏d_{b}italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT from nodes a𝑎aitalic_a and b𝑏bitalic_b as 𝒜ab[da,db]subscript𝒜𝑎𝑏subscript𝑑𝑎subscript𝑑𝑏\mathcal{A}_{ab}[d_{a},d_{b}]caligraphic_A start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ]. We further define the number of nodes where max(du,dv)>kmaxsubscript𝑑𝑢subscript𝑑𝑣𝑘\text{max}(d_{u},d_{v})>kmax ( italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) > italic_k as βab[d]subscript𝛽𝑎𝑏delimited-[]𝑑\beta_{ab}[d]italic_β start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT [ italic_d ]. The pairwise encoding concatenates the counts belonging to all combination of d=1k𝑑1𝑘d=1\cdots kitalic_d = 1 ⋯ italic_k. The counts are estimated using subgraph sketching algorithms (Flajolet et al., 2007; Broder, 1997) and are denoted 𝒜^^𝒜\hat{\mathcal{A}}over^ start_ARG caligraphic_A end_ARG and ^^\hat{\mathcal{B}}over^ start_ARG caligraphic_B end_ARG. The pairwise encoding for a target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) is given by the following where [k]={1k}delimited-[]𝑘1𝑘[k]=\{1\cdots k\}[ italic_k ] = { 1 ⋯ italic_k }:

(15) s𝒜^(a,b)superscript𝑠^𝒜𝑎𝑏\displaystyle s^{\hat{\mathcal{A}}}(a,b)italic_s start_POSTSUPERSCRIPT over^ start_ARG caligraphic_A end_ARG end_POSTSUPERSCRIPT ( italic_a , italic_b ) =\scalerelda,db[k]𝒜^ab[da,db],\displaystyle=\operatorname*{\scalerel*{\|}{\sum}}_{d_{a},d_{b}\in[k]}\hat{% \mathcal{A}}_{ab}[d_{a},d_{b}],= start_OPERATOR ∗ ∥ ∑ end_OPERATOR start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ [ italic_k ] end_POSTSUBSCRIPT over^ start_ARG caligraphic_A end_ARG start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] ,
(16) sβ^(a,b)superscript𝑠^𝛽𝑎𝑏\displaystyle s^{\hat{\mathcal{\beta}}}(a,b)italic_s start_POSTSUPERSCRIPT over^ start_ARG italic_β end_ARG end_POSTSUPERSCRIPT ( italic_a , italic_b ) =\scalereld[k]β^ab[d],\displaystyle=\operatorname*{\scalerel*{\|}{\sum}}_{d\in[k]}\hat{\beta}_{ab}[d],= start_OPERATOR ∗ ∥ ∑ end_OPERATOR start_POSTSUBSCRIPT italic_d ∈ [ italic_k ] end_POSTSUBSCRIPT over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT [ italic_d ] ,
(17) s(a,b)𝑠𝑎𝑏\displaystyle s(a,b)italic_s ( italic_a , italic_b ) =s𝒜^(a,b)\scalerelsβ^(a,b).\displaystyle=s^{\hat{\mathcal{A}}}(a,b)\operatorname*{\scalerel*{\|}{\sum}}s^% {\hat{\mathcal{\beta}}}(a,b).= italic_s start_POSTSUPERSCRIPT over^ start_ARG caligraphic_A end_ARG end_POSTSUPERSCRIPT ( italic_a , italic_b ) start_OPERATOR ∗ ∥ ∑ end_OPERATOR italic_s start_POSTSUPERSCRIPT over^ start_ARG italic_β end_ARG end_POSTSUPERSCRIPT ( italic_a , italic_b ) .

Neo-GNN (Wang et al., 2023): Neo-GNN considers the higher-order neighbor overlap between two nodes. This is done by first learning a structural representation for each node i𝑖iitalic_i, xistructsuperscriptsubscript𝑥𝑖𝑠𝑡𝑟𝑢𝑐𝑡x_{i}^{struct}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_r italic_u italic_c italic_t end_POSTSUPERSCRIPT. This is given by:

(18) xistruct=f1(j𝒩(i)f2(Aij)).superscriptsubscript𝑥𝑖𝑠𝑡𝑟𝑢𝑐𝑡subscript𝑓1subscript𝑗𝒩𝑖subscript𝑓2subscript𝐴𝑖𝑗x_{i}^{struct}=f_{1}\left(\sum_{j\in\mathcal{N}(i)}f_{2}\left(A_{ij}\right)% \right).italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_r italic_u italic_c italic_t end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N ( italic_i ) end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) .

To consider the L𝐿Litalic_L-hop structural information, the structural representations are diffused over L𝐿Litalic_L hops and weighted by a hyperparameter β𝛽\betaitalic_β:

(19) Z𝑍\displaystyle Zitalic_Z =MLP(l=1Lβl1AlXstruct),absentMLPsuperscriptsubscript𝑙1𝐿superscript𝛽𝑙1superscript𝐴𝑙superscript𝑋𝑠𝑡𝑟𝑢𝑐𝑡\displaystyle=\text{MLP}\left(\sum_{l=1}^{L}\beta^{l-1}A^{l}X^{struct}\right),= MLP ( ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT italic_s italic_t italic_r italic_u italic_c italic_t end_POSTSUPERSCRIPT ) ,
(20) whereX=diag(xstruct).where𝑋diagsuperscript𝑥𝑠𝑡𝑟𝑢𝑐𝑡\displaystyle\text{where}\>\>\>X=\text{diag}(x^{struct}).where italic_X = diag ( italic_x start_POSTSUPERSCRIPT italic_s italic_t italic_r italic_u italic_c italic_t end_POSTSUPERSCRIPT ) .

The pairwise encoding s(a,b)𝑠𝑎𝑏s(a,b)italic_s ( italic_a , italic_b ) is the dot product of both the final representations,

(21) s(a,b)=zaTzb.𝑠𝑎𝑏superscriptsubscript𝑧𝑎𝑇subscript𝑧𝑏s(a,b)=z_{a}^{T}z_{b}.italic_s ( italic_a , italic_b ) = italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT .

Appendix B Special Cases of the General Pairwise Encoding

In this section we demonstrate that multiple popular heuristics and pairwise encodings can be formulated as special cases of the general pairwise encoding given in Eq. (2). Common Neighbors (CNs) (Newman, 2001): The CNs of a pair of nodes (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) is defined the overlap** 1-hop neighbors of both nodes:

(22) 𝒩(a,b)CN=𝒩(a)𝒩(b).subscriptsuperscript𝒩CN𝑎𝑏𝒩𝑎𝒩𝑏\mathcal{N}^{\text{CN}}_{(a,b)}=\mathcal{N}(a)\cap\mathcal{N}(b).caligraphic_N start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT = caligraphic_N ( italic_a ) ∩ caligraphic_N ( italic_b ) .

Eq. (2) is equal to the CNs when h(a,b,u)=1𝑎𝑏𝑢1h(a,b,u)=1italic_h ( italic_a , italic_b , italic_u ) = 1 and w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) is:

(23) w(a,b,u)={1,when u𝒩(a)𝒩(b)0,else }.𝑤𝑎𝑏𝑢1when 𝑢𝒩𝑎𝒩𝑏0else w(a,b,u)=\left\{\begin{array}[]{ll}1,&\text{when }u\in\mathcal{N}(a)\cap% \mathcal{N}(b)\\ 0,&\text{else }\end{array}\right\}.italic_w ( italic_a , italic_b , italic_u ) = { start_ARRAY start_ROW start_CELL 1 , end_CELL start_CELL when italic_u ∈ caligraphic_N ( italic_a ) ∩ caligraphic_N ( italic_b ) end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL else end_CELL end_ROW end_ARRAY } .

Adamic-Adar (AA) (Adamic and Adar, 2003): AA is defined as the reciprocal log-degree weighted CN score where dusubscript𝑑𝑢d_{u}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the degree of node u𝑢uitalic_u:

(24) AA(a,b)=u𝒩(a,b)CN1log(du).AA𝑎𝑏subscript𝑢subscriptsuperscript𝒩CN𝑎𝑏1logsubscript𝑑𝑢\text{AA}(a,b)=\sum_{u\in\mathcal{N}^{\text{CN}}_{(a,b)}}\frac{1}{\text{log}(d% _{u})}.AA ( italic_a , italic_b ) = ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_N start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG log ( italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG .

Eq. (2) can be rewritten as the AA when h(a,b,u)=1/log(du)𝑎𝑏𝑢1logsubscript𝑑𝑢h(a,b,u)=1/\text{log}(d_{u})italic_h ( italic_a , italic_b , italic_u ) = 1 / log ( italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) and w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) is equal to Eq. (23). Resource Allocation (RA) (Zhou et al., 2009): RA is defined as the reciprocal degree weighted CN score:

(25) RA(a,b)=u𝒩(a,b)CN1du.RA𝑎𝑏subscript𝑢subscriptsuperscript𝒩CN𝑎𝑏1subscript𝑑𝑢\text{RA}(a,b)=\sum_{u\in\mathcal{N}^{\text{CN}}_{(a,b)}}\frac{1}{d_{u}}.RA ( italic_a , italic_b ) = ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_N start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG .

Eq. (2) can be rewritten as the AA when h(a,b,u)=1/du𝑎𝑏𝑢1subscript𝑑𝑢h(a,b,u)=1/d_{u}italic_h ( italic_a , italic_b , italic_u ) = 1 / italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) is equal to Eq. (23). Katz Index (Katz, 1953): The Katz index is a global structural measure. It is defined as weighted summation of the number of paths of different lengths connecting a𝑎aitalic_a and b𝑏bitalic_b. It is given by the following where the decay weight β[0,1]𝛽01\beta\in[0,1]italic_β ∈ [ 0 , 1 ],

(26) Katz(a,b)=l=1βlAa,bl.Katz𝑎𝑏superscriptsubscript𝑙1superscript𝛽𝑙superscriptsubscript𝐴𝑎𝑏𝑙\text{Katz}(a,b)=\sum_{l=1}^{\infty}\beta^{l}A_{a,b}^{l}.Katz ( italic_a , italic_b ) = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT .

This is equivalent to Eq. (2) when:

(27) w(a,b,u)=l=1βleaTAl,𝑤𝑎𝑏𝑢superscriptsubscript𝑙1superscript𝛽𝑙superscriptsubscript𝑒𝑎𝑇superscript𝐴𝑙w(a,b,u)=\sum_{l=1}^{\infty}\beta^{l}e_{a}^{T}A^{l},italic_w ( italic_a , italic_b , italic_u ) = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ,

where ei𝔹|𝒱|subscript𝑒𝑖superscript𝔹𝒱e_{i}\in\mathbb{B}^{\lvert\mathcal{V}\rvert}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_B start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT is a one-hot vector for a node i𝑖iitalic_i. We further set,

(28) h(a,b,u)={ebT,when u=b𝟎,else }.𝑎𝑏𝑢superscriptsubscript𝑒𝑏𝑇when 𝑢𝑏0else h(a,b,u)=\left\{\begin{array}[]{ll}e_{b}^{T},&\text{when }u=b\\ \mathbf{0},&\text{else }\end{array}\right\}.italic_h ( italic_a , italic_b , italic_u ) = { start_ARRAY start_ROW start_CELL italic_e start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , end_CELL start_CELL when italic_u = italic_b end_CELL end_ROW start_ROW start_CELL bold_0 , end_CELL start_CELL else end_CELL end_ROW end_ARRAY } .

Personalized Pagerank (PPR) Score (Brin and Page, 1998): The personalized pagerank score is the pagerank score localized to a root node u𝑢uitalic_u. The localization is via a teleportation probability α𝛼\alphaitalic_α that transports the random walk back to the root node. We show that Eq. (2) can be rewritten as the PPR score when setting h(a,b,u)𝑎𝑏𝑢h(a,b,u)italic_h ( italic_a , italic_b , italic_u ) equal to (28) and, following Chung (2007), setting w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) to:

(29) w(a,b,u)=αl=0(1α)leaT(D1A)l.𝑤𝑎𝑏𝑢𝛼superscriptsubscript𝑙0superscript1𝛼𝑙superscriptsubscript𝑒𝑎𝑇superscriptsuperscript𝐷1𝐴𝑙w(a,b,u)=\alpha\sum_{l=0}^{\infty}(1-\alpha)^{l}e_{a}^{T}(D^{-1}A)^{l}.italic_w ( italic_a , italic_b , italic_u ) = italic_α ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT .

Feature Similarity: The feature similarity of the pair of nodes (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) is expressed by dis(𝐱a,𝐱b)dissubscript𝐱𝑎subscript𝐱𝑏\text{dis}(\mathbf{x}_{a},\mathbf{x}_{b})dis ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) where 𝐱asubscript𝐱𝑎\mathbf{x}_{a}bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are the node features of node a𝑎aitalic_a and dis()dis\text{dis}(\cdot)dis ( ⋅ ) is a distance function (e.g., euclidean distance). This can be rewritten as Eq. (2) by substituting:

(30) w(a,b,u)=dis(𝐱a,𝐱u),𝑤𝑎𝑏𝑢dissubscript𝐱𝑎subscript𝐱𝑢w(a,b,u)=\text{dis}(\mathbf{x}_{a},\mathbf{x}_{u}),italic_w ( italic_a , italic_b , italic_u ) = dis ( bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ,

and h(a,b,u)=ebT𝑎𝑏𝑢superscriptsubscript𝑒𝑏𝑇h(a,b,u)=e_{b}^{T}italic_h ( italic_a , italic_b , italic_u ) = italic_e start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT where ei𝔹|𝒱|subscript𝑒𝑖superscript𝔹𝒱e_{i}\in\mathbb{B}^{\lvert\mathcal{V}\rvert}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_B start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT is a one-hot vector for a node i𝑖iitalic_i. NCN (Wang et al., 2023): The pairwise encoding used in NCN is defined as the summation of the representations for the CNs of a link. Eq. (2) can be rewritten as NCN when w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) is equal to Eq. (23). h(a,b,u)𝑎𝑏𝑢h(a,b,u)italic_h ( italic_a , italic_b , italic_u ) is equal to the node representation u𝑢uitalic_u encoded by a MPNN, i.e., h(a,b,u)=𝐡u𝑎𝑏𝑢subscript𝐡𝑢h(a,b,u)=\mathbf{h}_{u}italic_h ( italic_a , italic_b , italic_u ) = bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT where H=MPNN(A,X)𝐻MPNN𝐴𝑋H=\text{MPNN}(A,X)italic_H = MPNN ( italic_A , italic_X ). NCNC (Wang et al., 2023): NCNC extends NCNC by further weighting the 1-hop (non-CN) by their probability of linking to the other nodes. Given Eq. (2), the weight w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) is equal to following where 1-hop neighbors are weighted by their probability of linking with the other node:

(31) w(a,b,u)={1,when u𝒩(a,b)CNNCN(A,X,b,u)when u𝒩(a)NCN(A,X,a,u)when u𝒩(b)0,else }.𝑤𝑎𝑏𝑢1when 𝑢subscriptsuperscript𝒩CN𝑎𝑏NCN𝐴𝑋𝑏𝑢when 𝑢𝒩𝑎NCN𝐴𝑋𝑎𝑢when 𝑢𝒩𝑏0else w(a,b,u)=\left\{\begin{array}[]{ll}1,&\text{when }u\in\mathcal{N}^{\text{CN}}_% {(a,b)}\\ \text{NCN}(A,X,b,u)&\text{when }u\in\mathcal{N}(a)\\ \text{NCN}(A,X,a,u)&\text{when }u\in\mathcal{N}(b)\\ 0,&\text{else }\end{array}\right\}.italic_w ( italic_a , italic_b , italic_u ) = { start_ARRAY start_ROW start_CELL 1 , end_CELL start_CELL when italic_u ∈ caligraphic_N start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL NCN ( italic_A , italic_X , italic_b , italic_u ) end_CELL start_CELL when italic_u ∈ caligraphic_N ( italic_a ) end_CELL end_ROW start_ROW start_CELL NCN ( italic_A , italic_X , italic_a , italic_u ) end_CELL start_CELL when italic_u ∈ caligraphic_N ( italic_b ) end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL else end_CELL end_ROW end_ARRAY } .

NCN(A,X,a,u)NCN𝐴𝑋𝑎𝑢\text{NCN}(A,X,a,u)NCN ( italic_A , italic_X , italic_a , italic_u ) is the probability of a𝑎aitalic_a and u𝑢uitalic_u being linked using the NCN model. We further define h(a,b,u)=𝐡u𝑎𝑏𝑢subscript𝐡𝑢h(a,b,u)=\mathbf{h}_{u}italic_h ( italic_a , italic_b , italic_u ) = bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

Neo-GNN (Yun et al., 2021): The pairwise encoding used in Neo-GNN considers the higher-order neighborhood overlap between two nodes. The formulation is given in Section B. When l=1𝑙1l=1italic_l = 1, it can be expressed using Eq. (2) by setting:

(32) h(a,b,u)=f1(v𝒩(u)f2(Auv))2,𝑎𝑏𝑢subscript𝑓1superscriptsubscript𝑣𝒩𝑢subscript𝑓2subscript𝐴𝑢𝑣2h(a,b,u)=f_{1}\left(\sum_{v\in\mathcal{N}(u)}f_{2}\left(A_{uv}\right)\right)^{% 2},italic_h ( italic_a , italic_b , italic_u ) = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_N ( italic_u ) end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

and

(33) w(a,b,u)={1,when u𝒩(a,b)CN0,else }.𝑤𝑎𝑏𝑢1when 𝑢subscriptsuperscript𝒩CN𝑎𝑏0else w(a,b,u)=\left\{\begin{array}[]{ll}1,&\text{when }u\in\mathcal{N}^{\text{CN}}_% {(a,b)}\\ 0,&\text{else }\end{array}\right\}.italic_w ( italic_a , italic_b , italic_u ) = { start_ARRAY start_ROW start_CELL 1 , end_CELL start_CELL when italic_u ∈ caligraphic_N start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_a , italic_b ) end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL else end_CELL end_ROW end_ARRAY } .

Appendix C Proof of Proposition 1

See 1

Proof.

Per Chung (2007), the PPR vector for a root node s𝑠sitalic_s, prssubscriptpr𝑠\text{pr}_{s}pr start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, is equivalent to:

(34) prs=αk=0(1α)kWkxs,subscriptpr𝑠𝛼superscriptsubscript𝑘0superscript1𝛼𝑘superscript𝑊𝑘subscript𝑥𝑠\text{pr}_{s}=\alpha\sum_{k=0}^{\infty}(1-\alpha)^{k}W^{k}x_{s},pr start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_α ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ,

where W𝑊Witalic_W is a the random walk matrix and xssubscript𝑥𝑠x_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is a preference vector that is a one-hot vector for element s𝑠sitalic_s. We note that prs(t)subscriptpr𝑠𝑡\text{pr}_{s}(t)pr start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) represents the landing probability of node t𝑡titalic_t given the root node s𝑠sitalic_s. As such, by definition, prs(t)=ppr(s,t)subscriptpr𝑠𝑡ppr𝑠𝑡\text{pr}_{s}(t)=\text{ppr}(s,t)pr start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) = ppr ( italic_s , italic_t ). Furthermore, it is clear that rsk=Wkxs𝒱superscriptsubscript𝑟𝑠𝑘superscript𝑊𝑘subscript𝑥𝑠superscript𝒱r_{s}^{k}=W^{k}x_{s}\in\mathbb{R}^{\mathcal{V}}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT represents the probability of a walk of length k𝑘kitalic_k beginning at node s𝑠sitalic_s and stop all other nodes, individually. Also, the probabilities of all walks of length k𝑘kitalic_k are weighted by γk=α(1α)ksuperscript𝛾𝑘𝛼superscript1𝛼𝑘\gamma^{k}=\alpha(1-\alpha)^{k}italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_α ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Γ(a,b,u)Γ𝑎𝑏𝑢\Gamma\left(a,b,u\right)roman_Γ ( italic_a , italic_b , italic_u ) can be obtained by first taking the sum of the PPR vectors for nodes a𝑎aitalic_a and b𝑏bitalic_b,

pra+prbsubscriptpr𝑎subscriptpr𝑏\displaystyle\text{pr}_{a}+\text{pr}_{b}pr start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + pr start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT =αk=0(1α)kWkxa+αk=0(1α)kWkxb,absent𝛼superscriptsubscript𝑘0superscript1𝛼𝑘superscript𝑊𝑘subscript𝑥𝑎𝛼superscriptsubscript𝑘0superscript1𝛼𝑘superscript𝑊𝑘subscript𝑥𝑏\displaystyle=\alpha\sum_{k=0}^{\infty}(1-\alpha)^{k}W^{k}x_{a}+\alpha\sum_{k=% 0}^{\infty}(1-\alpha)^{k}W^{k}x_{b},= italic_α ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_α ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ,
(35) pra,bsubscriptpr𝑎𝑏\displaystyle\text{pr}_{a,b}pr start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT =αk=0(1α)kWk(xa+xb),absent𝛼superscriptsubscript𝑘0superscript1𝛼𝑘superscript𝑊𝑘subscript𝑥𝑎subscript𝑥𝑏\displaystyle=\alpha\sum_{k=0}^{\infty}(1-\alpha)^{k}W^{k}\left(x_{a}+x_{b}% \right),= italic_α ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ,

where pra,b=pra+prbsubscriptpr𝑎𝑏subscriptpr𝑎subscriptpr𝑏\text{pr}_{a,b}=\text{pr}_{a}+\text{pr}_{b}pr start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT = pr start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + pr start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. From this, we can express Γ(a,b,u)Γ𝑎𝑏𝑢\Gamma(a,b,u)roman_Γ ( italic_a , italic_b , italic_u ) as:

Γ(a,b,u)Γ𝑎𝑏𝑢\displaystyle\Gamma(a,b,u)roman_Γ ( italic_a , italic_b , italic_u ) =ppr(a,u)+ppr(b,u),absentppr𝑎𝑢ppr𝑏𝑢\displaystyle=\text{ppr}(a,u)+\text{ppr}(b,u),= ppr ( italic_a , italic_u ) + ppr ( italic_b , italic_u ) ,
(36) =pra,b(u),absentsubscriptpr𝑎𝑏𝑢\displaystyle=\text{pr}_{a,b}(u),= pr start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT ( italic_u ) ,
=pra(u)+prb(u),absentsubscriptpr𝑎𝑢subscriptpr𝑏𝑢\displaystyle=\text{pr}_{a}(u)+\text{pr}_{b}(u),= pr start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_u ) + pr start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_u ) ,

which as shown in Eq. (C) is equivalent to the probability of a walk that originates from either node a𝑎aitalic_a or b𝑏bitalic_b and terminates at node u𝑢uitalic_u. This completes the proof. ∎

Appendix D Attention Formulation

For a target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ), LPFormer attends to the nodes in the set 𝒱¯(a,b)¯𝒱𝑎𝑏\bar{\mathcal{V}}(a,b)over¯ start_ARG caligraphic_V end_ARG ( italic_a , italic_b ). The attention mechanism used in LPFormer is defined in Section 3 as follows where w(a,b,u)𝑤𝑎𝑏𝑢w(a,b,u)italic_w ( italic_a , italic_b , italic_u ) is the attention weight of u𝑢uitalic_u to the target link and 𝒱¯(a,b)=𝒱{a,b}¯𝒱𝑎𝑏𝒱𝑎𝑏\bar{\mathcal{V}}(a,b)=\mathcal{V}\setminus\{a,b\}over¯ start_ARG caligraphic_V end_ARG ( italic_a , italic_b ) = caligraphic_V ∖ { italic_a , italic_b }:

w~(a,b,u)=ϕ(𝐡a,𝐡b,𝐡u,𝐫𝐩𝐞(a,b,u)),~𝑤𝑎𝑏𝑢italic-ϕsubscript𝐡𝑎subscript𝐡𝑏subscript𝐡𝑢subscript𝐫𝐩𝐞𝑎𝑏𝑢\displaystyle\tilde{w}(a,b,u)=\phi\left(\mathbf{h}_{a},\mathbf{h}_{b},\mathbf{% h}_{u},\>\mathbf{rpe}_{(a,b,u)}\right),over~ start_ARG italic_w end_ARG ( italic_a , italic_b , italic_u ) = italic_ϕ ( bold_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_rpe start_POSTSUBSCRIPT ( italic_a , italic_b , italic_u ) end_POSTSUBSCRIPT ) ,
(37) w(a,b,u)=exp(w~(a,b,u))v𝒱¯(a,b)exp(w~(a,b,u)).𝑤𝑎𝑏𝑢exp~𝑤𝑎𝑏𝑢subscript𝑣¯𝒱𝑎𝑏exp~𝑤𝑎𝑏𝑢\displaystyle w(a,b,u)=\frac{\text{exp}(\tilde{w}(a,b,u))}{\sum_{v\in\bar{% \mathcal{V}}(a,b)}\text{exp}(\tilde{w}(a,b,u))}.italic_w ( italic_a , italic_b , italic_u ) = divide start_ARG exp ( over~ start_ARG italic_w end_ARG ( italic_a , italic_b , italic_u ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ over¯ start_ARG caligraphic_V end_ARG ( italic_a , italic_b ) end_POSTSUBSCRIPT exp ( over~ start_ARG italic_w end_ARG ( italic_a , italic_b , italic_u ) ) end_ARG .

The function ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) is modeled via the attention mechanism defined in GATv2 (Brody et al., 2022). We define a2d𝑎superscript2superscript𝑑a\in\mathbb{R}^{2d^{\prime}}italic_a ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and Wd×d𝑊superscript𝑑superscript𝑑W\in\mathbb{R}^{d\times d^{\prime}}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. The raw attention weights are then given by:

(38) w~(a,b,u)=𝐚TLeakyReLU[W𝐡a\scalerelW𝐡b\scalerelW𝐡u\scalerel𝐫𝐩𝐞(a,b,u)].\displaystyle\tilde{w}(a,b,u)=\mathbf{a}^{T}\>\text{LeakyReLU}\left[W\>\mathbf% {h}_{a}\operatorname*{\scalerel*{\|}{\sum}}W\>\mathbf{h}_{b}\operatorname*{% \scalerel*{\|}{\sum}}W\>\mathbf{h}_{u}\operatorname*{\scalerel*{\|}{\sum}}% \mathbf{rpe}_{(a,b,u)}\right].over~ start_ARG italic_w end_ARG ( italic_a , italic_b , italic_u ) = bold_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT LeakyReLU [ italic_W bold_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_OPERATOR ∗ ∥ ∑ end_OPERATOR italic_W bold_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_OPERATOR ∗ ∥ ∑ end_OPERATOR italic_W bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_OPERATOR ∗ ∥ ∑ end_OPERATOR bold_rpe start_POSTSUBSCRIPT ( italic_a , italic_b , italic_u ) end_POSTSUBSCRIPT ] .

The final attention weights, w(a,b,u)𝑤𝑎𝑏𝑢{w}(a,b,u)italic_w ( italic_a , italic_b , italic_u ), are given by passing w~(a,b,u)~𝑤𝑎𝑏𝑢\tilde{w}(a,b,u)over~ start_ARG italic_w end_ARG ( italic_a , italic_b , italic_u ) through a softmax activation layer.

Appendix E Additional Experimental Details

E.1. Planetoid splits

We note that for each of Cora, Citeseer, Pubmed we use a fixed split. This follows the recent work of (Li et al., 2023). Li et al. (2023) observe that for Cora, Citeseer, Pubmed there exists no unified data split between studies. They find that while recent work (Chamberlain et al., 2022; Wang et al., 2023) use 10 random splits, prior work (Zhu et al., 2021; Velickovic et al., 2017) use a fixed split and train over 10 random seeds. Furthermore, there exists discrepancies in the preprocessing between those works that use the random splits. Chamberlain et al. (2022) only use the largest connected component of each dataset while Wang et al. (2023) use the whole dataset. This makes any comparison of the published results difficult. Due to these discrepancies, we use the performance on the fixed split given by Li et al. (2023), as it’s the only split where all methods are evaluated and compared under the same setting.

E.2. Omission of ogbl-ddi under the Existing Evaluation

We further omit the results of ogbl-ddi in Table 2. This is due to the observation made by Li et al. (2023) that there exists a poor relationship between the validation and test performance. This extends to recent pairwise MPNNs, including NCN (Wang et al., 2023), Neo-GNN (Yun et al., 2021), and BUDDY (Chamberlain et al., 2022). This makes tuning on the validation set difficult, as it doesn’t guarantee good test performance. Due to this, they observe that when tuning on a fixed set of hyperparameter ranges, they are unable to achieve comparable results to the reported performance. Often they observe that the performance is actually much lower. Due to these concerns we believe ogbl-ddi is not suitable for the task of transductive link prediction and don’t report the performance. For more details and discussion, please see Appendix D in Li et al. (2023). However, they show that this problem does not afflict ogbl-ddi under the newly proposed HeaRT (Li et al., 2023) evaluation setting. As such, we further include the results for our method under HeaRT in Table 5.

E.3. Computation of the PPR Matrix

We compute the PPR matrix via the efficient approximation algorithm introduced by Andersen et al. (2006). The estimation is controlled by a tolerance parameter ϵitalic-ϵ\epsilonitalic_ϵ. The parameter ϵitalic-ϵ\epsilonitalic_ϵ controls both the speed of computation and the sparsity of the solution (i.e., a higher value of ϵitalic-ϵ\epsilonitalic_ϵ will produce a sparser PPR matrix). We use: ϵ=1e7italic-ϵ1superscript𝑒7\epsilon=1e^{-7}italic_ϵ = 1 italic_e start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT for Cora and Citeseer, ϵ=5e5italic-ϵ5superscript𝑒5\epsilon=5e^{-5}italic_ϵ = 5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for ogbl-collab and ogbl-ppa, ϵ=1e5italic-ϵ1𝑒5\epsilon=1e-5italic_ϵ = 1 italic_e - 5 for Pubmed, and ϵ=5e3italic-ϵ5superscript𝑒3\epsilon=5e^{-3}italic_ϵ = 5 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for ogbl-Citation2. The value of ϵitalic-ϵ\epsilonitalic_ϵ is chosen as a trade-off between accuracy and sparsity to allow for ease of storage in GPU memory.

E.4. Splitting Target Links by LP Factor

In Section 4.3 we demonstrate the performance on samples that correspond to a single LP factor. In this section we further detail the algorithm used to determine the set of samples corresponding to each factor. We consider the three main factors: local structural information, global structural information, and feature proximity. We measure each using a single representative heuristic: CNs (Newman, 2001) for local information, PPR (Brin and Page, 1998) for global information, and cosine feature similarity for feature proximity. For each sample, we check if the score is only high in one heuristic. In this way, it tells us that there is a dominant factor present in the pairwise information.

This determination is done by comparing the the heuristic scores of each target link against a threshold value. For a LP factor i𝑖iitalic_i and target link (a,b)𝑎𝑏(a,b)( italic_a , italic_b ), we denote the heuristic score as si(a,b)superscript𝑠𝑖𝑎𝑏s^{i}(a,b)italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_a , italic_b ). The threshold value for factor i𝑖iitalic_i is represented by s^isuperscript^𝑠𝑖\hat{s}^{i}over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and is chosen such that it corresponds to a higher score. We desire s^isuperscript^𝑠𝑖\hat{s}^{i}over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to be a higher score such that any score \geq than it indicates that a plethora of pairwise information exists corresponding to factor i𝑖iitalic_i. This is done by setting the threshold equal to the p𝑝pitalic_p-th percentile value for that heuristic among all target links. For example, for CNs, the 80th percentile score on one dataset may be 9. The value of p𝑝pitalic_p is chosen to be high (e.g., 80%) due to the aforementioned reasoning. Given these inputs, for each target link we compare the score for factor i𝑖iitalic_i against the threshold value of that factor. Continuing our example, if (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) only has 2 CNs, it is below the previously defined threshold. We only consider a sample as “belonging” to a single factor when it is si(a,b)s^isuperscript𝑠𝑖𝑎𝑏superscript^𝑠𝑖s^{i}(a,b)\geq\hat{s}^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_a , italic_b ) ≥ over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is true for one only one factor i𝑖iitalic_i. So if the heuristic score for (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) is below the p𝑝pitalic_p-th percentile threshold for CNs and PPR but above for feature similarity, then feature proximity will be considered the dominant LP factor. However, if it’s above the threshold for both local and structural information, it will not be assigned to any group. This is done as we want to isolate links that only highly express one LP factor. This allows us to better understand how certain methods can model that specific factor. The detailed algorithm is given in Algorithm 1.

We note that each target link may not belong to a category. This can be due to there being no or many dominant LP factor. We further set the percentile equal to 90% on all datasets except for ogbl-collab for which we use 80%. These values were chosen as we wanted the percentile to be suitably high such that we are confident that the corresponding factor is relevant to the target link. Furthermore, we use a lower value for ogbl-collab as we found it produced a more even distribution of links by factor.

Refer to caption
(a) Pubmed
Refer to caption
(b) ogbl-ppa
Figure 5. Performance for target links when there is only one LP factor strongly expressed. Results are on (a) Pubmed, (b) ogbl-ppa. We note that due the quality of features used, we omit the feature proximity factor for ogbl-ppa from our analysis

E.5. Additional Results for the LP Factor Experiments

In Section 4.3 we observed the performance of various methods on target links where only a single LP factor is expressed. This is done through the use of heuristic scores. We further demonstrate the results on the Pubmed and ogbl-ppa datasets. Of note is that for ogbl-ppa the initial node features are one-hot vectors that signify the species that the protein belongs to. We observe that due to the sparseness of these features, feature proximity measures are unable to properly predict any target links on their own. As such, the factor corresponding to feature proximity is not expressed. We therefore exclude that factor for this analysis on ogbl-ppa.

The results for both Pubmed and ogbl-ppa datasets are given in Figure 5. As shown earlier in Figure 3, LPFormer can most consistently perform well across each factor. This suggests that LPFormer is best able to both model a variety of factors and adapt accordingly for each target link.

E.6. Performance on Heterophilic Datasets

In this section we evaluate LPFormer on multiple heterophilic datasets. Heterophily refers to the tendency of dissimilar nodes to be connected. This is as opposed to homophily, in which nodes with similar attributed are more likely to be connected. Since most graphs used for benchmark datasets tend to contain homophilic patterns, heterophilic graphs present an interesting challenge regarding the effectiveness of graph-based methods. For a more detailed discussion on heterophilic graphs, please see (Mao et al., 2024).

We test on two prominent heterophilic datasets, Squirrel and Chameleon (Rozemberczki et al., 2021). The statistics for each are in Table 6. We limit our comparison to those LP methods that tend achieve the best results, including GCN, BUDDY, and NCNC. In Table 7, we report the MRR over five random seeds. Note that we test under the original evaluation setting and not HeaRT. We observe that LPFormer can achieve a large increase over other methods, with a 14% and 9% increase in performance on Squirrel and Chameleon, respectively. These results indicate the superior ability of LPFormer to accurately model LP on heterophilic graphs, as compared to other methods.

Table 6. Heterophilic Dataset Statistics.
Squirrel Chameleon
#Nodes 5201 2277
#Edges 198,353 31,371
Split Ratio 85/5/10 85/5/10
Table 7. Results on Heterophilic Datasets.
Method Squirrel Chameleon
GCN 22.77 ± 4.54 20.74 ± 8.08
BUDDY 9.69 ± 0.99 6.30 ± 2.40
NCNC 32.37 ± 5.46 26.24 ± 3.37
LPFormer 36.77 ± 2.77 28.61 ± 6.68
% Improvement 14% 9%

E.7. More Efficiently Incorporating the PPR Scores

In Figure 4 we compare the training time between LPFormer and NCNC. We observe that on the denser datasets, ogbl-ppa and ogbl-ddi, LPFormer is considerably more efficient. Furthermore, on ogbl-collab, both methods have a fast runtime. However, we find that LPFormer struggles on ogbl-citation2 in comparison to NCNC. We observe that this is due to the need of the PPR matrix, which while sparse, requires a large amount of memory and processing time. In the future, we plan to fix this problem by performing a simple and efficient pre-processing step. Specifically, before training, we can iterate over all target links and extract the relevant PPR scores. This would obviate the need to store the PPR matrix and determine the nodes for each link. Furthermore, this only needs to be done once before tuning the model. This would greatly reduce the storage and time needed to train LPFormer on all datasets and is an avenue we plan to explore in the future.

Algorithm 1 Determining Samples by LP Factor
1:
2:CN()CN\text{CN}(\cdot)CN ( ⋅ ) = Maps (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) to # of CNs of the pair
3:PPR()PPR\text{PPR}(\cdot)PPR ( ⋅ ) = Maps (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) to PPR score of the pair
4:FS()FS\text{FS}(\cdot)FS ( ⋅ ) = Maps (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) to feature cosine similarity of the pair
5:p𝑝pitalic_p = Percentile used to determine whether a factor is present
6:testsuperscripttest\mathcal{E}^{\text{test}}caligraphic_E start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT = Positive test links
7:// Compute the score corresponding to the p𝑝pitalic_p-th percentile for each heuristic
8:s^CN=Percentile(p,{CN(i,j)|(i,j)test})superscript^𝑠CNPercentile𝑝conditional-set𝐶𝑁𝑖𝑗𝑖𝑗superscripttest\hat{s}^{\text{CN}}=\text{Percentile}(p,\{CN(i,j)\>|\>(i,j)\in\mathcal{E}^{% \text{test}}\})over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT = Percentile ( italic_p , { italic_C italic_N ( italic_i , italic_j ) | ( italic_i , italic_j ) ∈ caligraphic_E start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT } )
9:s^FS=Percentile(p,{FS(i,j)|(i,j)test})superscript^𝑠FSPercentile𝑝conditional-set𝐹𝑆𝑖𝑗𝑖𝑗superscripttest\hat{s}^{\text{FS}}=\text{Percentile}(p,\{FS(i,j)\>|\>(i,j)\in\mathcal{E}^{% \text{test}}\})over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT FS end_POSTSUPERSCRIPT = Percentile ( italic_p , { italic_F italic_S ( italic_i , italic_j ) | ( italic_i , italic_j ) ∈ caligraphic_E start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT } )
10:s^PPR=Percentile(p,{PPR(i,j)|(i,j)test})superscript^𝑠PPRPercentile𝑝conditional-set𝑃𝑃𝑅𝑖𝑗𝑖𝑗superscripttest\hat{s}^{\text{PPR}}=\text{Percentile}(p,\{PPR(i,j)\>|\>(i,j)\in\mathcal{E}^{% \text{test}}\})over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT PPR end_POSTSUPERSCRIPT = Percentile ( italic_p , { italic_P italic_P italic_R ( italic_i , italic_j ) | ( italic_i , italic_j ) ∈ caligraphic_E start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT } )
11:Create empty lists LCNsuperscript𝐿CNL^{\text{CN}}italic_L start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT, LPPRsuperscript𝐿PPRL^{\text{PPR}}italic_L start_POSTSUPERSCRIPT PPR end_POSTSUPERSCRIPT, and LFSsuperscript𝐿FSL^{\text{FS}}italic_L start_POSTSUPERSCRIPT FS end_POSTSUPERSCRIPT
12:for (i,j)test𝑖𝑗superscripttest(i,j)\in\mathcal{E}^{\text{test}}( italic_i , italic_j ) ∈ caligraphic_E start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT do
13:    link-cn = CN(i,j)CN𝑖𝑗\text{CN}(i,j)CN ( italic_i , italic_j )
14:    link-fs = FS(i,j)FS𝑖𝑗\text{FS}(i,j)FS ( italic_i , italic_j )
15:    link-ppr = PPR(i,j)PPR𝑖𝑗\text{PPR}(i,j)PPR ( italic_i , italic_j )
16:    // Assign sample to corresponding list based on scores
17:    if link-cns^CNlink-cnsuperscript^𝑠CN\text{link-cn}\geq\hat{s}^{\text{CN}}link-cn ≥ over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT and link-fs<s^FSlink-fssuperscript^𝑠FS\text{link-fs}<\hat{s}^{\text{FS}}link-fs < over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT FS end_POSTSUPERSCRIPT and link-ppr<s^PPRlink-pprsuperscript^𝑠PPR\text{link-ppr}<\hat{s}^{\text{PPR}}link-ppr < over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT PPR end_POSTSUPERSCRIPT  then
18:         Append(LCNsuperscript𝐿CNL^{\text{CN}}italic_L start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT, (i,j)𝑖𝑗(i,j)( italic_i , italic_j ))
19:    else if link-cn<s^CNlink-cnsuperscript^𝑠CN\text{link-cn}<\hat{s}^{\text{CN}}link-cn < over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT and link-fss^FSlink-fssuperscript^𝑠FS\text{link-fs}\geq\hat{s}^{\text{FS}}link-fs ≥ over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT FS end_POSTSUPERSCRIPT and link-ppr<s^PPRlink-pprsuperscript^𝑠PPR\text{link-ppr}<\hat{s}^{\text{PPR}}link-ppr < over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT PPR end_POSTSUPERSCRIPT  then
20:         Append(LFSsuperscript𝐿FSL^{\text{FS}}italic_L start_POSTSUPERSCRIPT FS end_POSTSUPERSCRIPT, (i,j)𝑖𝑗(i,j)( italic_i , italic_j ))
21:    else if link-cn<s^CNlink-cnsuperscript^𝑠CN\text{link-cn}<\hat{s}^{\text{CN}}link-cn < over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT and link-fs<s^FSlink-fssuperscript^𝑠FS\text{link-fs}<\hat{s}^{\text{FS}}link-fs < over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT FS end_POSTSUPERSCRIPT and link-pprs^PPRlink-pprsuperscript^𝑠PPR\text{link-ppr}\geq\hat{s}^{\text{PPR}}link-ppr ≥ over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT PPR end_POSTSUPERSCRIPT  then
22:         Append(LPPRsuperscript𝐿PPRL^{\text{PPR}}italic_L start_POSTSUPERSCRIPT PPR end_POSTSUPERSCRIPT, (i,j)𝑖𝑗(i,j)( italic_i , italic_j ))
23:    end if
24:end for
25:return LCNsuperscript𝐿CNL^{\text{CN}}italic_L start_POSTSUPERSCRIPT CN end_POSTSUPERSCRIPT, LPPRsuperscript𝐿PPRL^{\text{PPR}}italic_L start_POSTSUPERSCRIPT PPR end_POSTSUPERSCRIPT, LFSsuperscript𝐿FSL^{\text{FS}}italic_L start_POSTSUPERSCRIPT FS end_POSTSUPERSCRIPT