Inference of Sequential Patterns for Neural Message Passing in Temporal Graphs

Jan von Pichowski   Vincenzo Perri   Lisi Qarkaxhija   Ingo Scholtes
Chair of Machine Learning for Complex Networks
Center for Artificial Intelligence and Data Science (CAIDAS)
Julius-Maximilians-Universität Würzburg, DE
[email protected]
Data Analytics Group, Department of Informatics, University of Zurich, Zurich, CH
Abstract

The modelling of temporal patterns in dynamic graphs is an important current research issue in the development of time-aware Graph Neural Networks (GNNs). However, whether or not a specific sequence of events in a temporal graph constitutes a temporal pattern not only depends on the frequency of its occurrence. We must also consider whether it deviates from what is expected in a temporal graph where timestamps are randomly shuffled. While accounting for such a random baseline is important to model temporal patterns, it has mostly been ignored by current temporal graph neural networks. To address this issue we propose HYPA-DBGNN, a novel two-step approach that combines (i) the inference of anomalous sequential patterns in time series data on graphs based on a statistically principled null model, with (ii) a neural message passing approach that utilizes a higher-order De Bruijn graph whose edges capture overrepresented sequential patterns. Our method leverages hypergeometric graph ensembles to identify anomalous edges within both first- and higher-order De Bruijn graphs, which encode the temporal ordering of events. Consequently, the model introduces an inductive bias that enhances model interpretability.

We evaluate our approach for static node classification using established benchmark datasets and a synthetic dataset that showcases its ability to incorporate the observed inductive bias regarding over- and under-represented temporal edges. Furthermore, we demonstrate the framework’s effectiveness in detecting similar patterns within empirical datasets, resulting in superior performance compared to baseline methods in node classification tasks. To the best of our knowledge, our work is the first to introduce statistically informed GNNs that leverage temporal and causal sequence anomalies. HYPA-DBGNN represents a promising path for bridging the gap between statistical graph inference and neural graph representation learning, with potential applications to static GNNs.

1 Introduction

Graphs are powerful representations of complex data. Not surprisingly, there is a growing collection of successful methods for learning on graphs [29, 55, 22]. These methods are versatile and are widely used in bioinformatics [60], social sciences [44], and pharmacy [52]. While many methods assume a static graph, real-world scenarios often involve dynamic systems, such as evolving interactions in social networks. Although known techniques for static graphs can be applied to dynamic graphs [32], important patterns may be missed [57]. Recently, several approaches have incorporated temporal dynamics to obtain time-aware graph neural networks. These methods are applied to different tasks such as static node classification [45], link prediction or continuous node property prediction [47].

A common theme between static and temporal GNNs is that the observed graphs are usually directly used for message passing. Recently data augmentation techniques have been proposed to improve the generalizability of GNNs. Such data augmentation techniques have been considered for a variety of reasons such as to reduce oversquashing [53], improve class homophily for node classification [33], foster diffusion [62], or include non-dyadic relation-ships [45]. Another motivation that has recently been highlighted in [64] is the presence of noise in empirically observed graphs. This motivates augmentation techniques for GNNs that ideally prune spuriously observed edges, while adding erroneously unobserved edges.

However, addressing noise in observed graphs arguably requires graph correction methods accounting for a “random baseline” that allows to distinguish significant patterns from noise, rather than augmentation methods that are based on heuristics or adjust the graph based on ground truth node classes. Moreover, the application of GNNs to temporal graphs introduces unique challenges for data augmentation as we typically want to focus on temporal patterns that are due to the time-ordered sequence of events. To the best of our knowledge, no existing works have considered graph correction methods that combine a statistically principled inference of sequential patterns with temporal GNNs.

Addressing this research gap, in this work we propose HYPA-DBGNN, a novel two-step approach for temporal graph learning: In a first step we infer anomalous sequential patterns in time series data on graphs based on a statistical ensemble of temporal graphs, i.e. a null model of random graphs that preserves the frequency of time-stamped edges but randomizes the temporal ordering in which those edges occur. Building on the HYPA framework [31], our method leverages hypergeometric graph ensembles. This allows us to analytically calculate expected frequencies of node sequences on time-respecting paths, which is the basis to identify anomalous sequential patterns in temporal graphs. In a second step we apply neural message passing on an augmented higher-order De Bruijn graph, whose edges capture overrepresented sequential patterns in a temporal graph, thus introducing an inductive bias that emphasizes sequential patterns over mere edge frequencies. The contributions of our work are as follows:

  1. (i)

    We propose a novel approach to augment message passing based on a statistical null model. This allows us to infer which temporal sequences in a time-stamped interaction sequence are over- or under-represented compared to a random baseline temporal graph in which the frequency of edges are preserved while their temporal ordering is shuffled.

  2. (ii)

    Building on this statistical inference approach, we propose HYPA-DBGNN, a time-aware temporal graph neural network architecture that specifically captures temporal patterns that deviate from a random baseline.

  3. (iii)

    We demonstrate our approach in synthetic temporal graphs sampled from a model that generates heterogeneously distributed temporal sequences of events in such a way that node classes are associated with the over- or underrepresentation of temporal events compared to random temporal orderings rather than mere frequencies.

  4. (iv)

    We demonstrate the practical relevance of our method by evaluating node classification in five empirical temporal graphs capturing time-stamped proximity events between humans. A comparison of HYPA-DBGNN with standard De Bruijn Graph Neural Networks without our HYPA-based inference reveals that our approach yields an improved accuracy in all five data sets. Moreover, a comparison to seven baseline techniques shows that our method yields the best performance in all empirical data.

  5. (v)

    We finally show that the distribution of HYPA scores in the augmented message passing graph, which captures the degree to which frequencies of temporal sequences deviate from a random baseline, enables us to explain why HYPA-DBGNN yields larger performance improvements on some data sets compared to others.

Different from prior works, with our work we propose a statistically principled data augmentation for temporal graph neural networks that uses a statistical ensemble of temporal graphs with a given weighted topology. Apart from improving temporal GNNs, we further argue that the general approach of utilizing well-known statistical ensembles of graphs from network science for graph correction could help to improve the performance of GNNs in data affected by noise.

2 Related Work

Data augmentation for graphs has been explored from various directions with the goal of allowing machine learning models to better generalize and attend to signal over noise [66]. Many methods have utilized heuristic graph modification strategies like randomly removing nodes [59, 13], edges [46], or subgraphs [56, 59] to improve performance and generalizability. Other works have considered adding virtual nodes [43, 24] or rewiring the network topology, which also addresses oversquashing [53, 1], with graph transformers operating on a fully connected topology representing an extreme case [37, 58, 30]. Additionally, it has been shown that using graph diffusion convolutions instead of raw neighborhoods alleviates problems from noisy and arbitrarily defined edges in real-world graphs [16]. Network data augmentation has also been explored by going beyond pairwise connections, either through mediating node interactions via subgraphs [39, 3, 63, 9] or by utilizing higher-order graphs. Examples of higher-order approaches include simplicial networks [5], cellular complexes [4, 20], hypergraphs [23, 8, 17], and time-respecting node sequences [45]. Another area of research focused on learning the graph augmentations from the data. One approach is to perform graph augmentation as a preprocessing step, completely separate from the downstream task, where the graph structure is cleaned before being used as input to the GNN [27, 65]. Other works embed the augmentation strategy into an end-to-end differentiable pipeline, jointly learning the optimal graph representation and the downstream task [26, 35, 15, 12, 28].

As our work addresses temporal graph data, it is related to the field of temporal GNNs. Temporal GNNs have been developed for both discrete and continuous time settings [34]. Discrete-time approaches segment the temporal data into time windows [49, 21], thus aggregating interactions and losing information on time-respecting paths within those time windows. In contrast, continuous-time approaches produce time-evolving node embeddings, focusing on the temporal variability of network activity at different time points rather than on the patterns occurring across temporally-ordered interaction sequences [57, 47]. The work most similar to our perspective is DBGNN [45], which learns from sequential correlations in high-resolution timestamped data. Our approach diverges from DBGNN by considering a more nuanced notion of the relevance of time-respecting paths. Rather than relying on the raw frequency of interactions, HYPA-DBGNN uses a statistically grounded anomaly score. This score quantifies the over- and under-expression of time-respect paths, making the model less susceptible to noise while basing contribution of paths on their statistical significance.

3 Background

A graph G=(V,E)𝐺𝑉𝐸G=\left(V,E\right)italic_G = ( italic_V , italic_E ) is defined as a set of nodes V𝑉Vitalic_V representing the elements of the system, and a set of edges EV×V𝐸𝑉𝑉E\subseteq V\times Vitalic_E ⊆ italic_V × italic_V representing their direct connections. However, it is often important to consider how nodes influence one another through a path, which is an ordered sequence (v1,v2,,vl)subscript𝑣1subscript𝑣2subscript𝑣𝑙(v_{1},v_{2},\dots,v_{l})( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) of nodes viVsubscript𝑣𝑖𝑉v_{i}\in Vitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V. In a path, all node transitions must correspond to edges in the graph, i.e., ei=(vi,vi+1)E;i[0,l1]formulae-sequencesubscript𝑒𝑖subscript𝑣𝑖subscript𝑣𝑖1𝐸for-all𝑖0𝑙1e_{i}=(v_{i},v_{i+1})\in E;\forall i\in[0,l-1]italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ∈ italic_E ; ∀ italic_i ∈ [ 0 , italic_l - 1 ]. Paths are often inferred from edges based on a transitivity assumption. This assumption states that if there is an edge (v0,v1)subscript𝑣0subscript𝑣1(v_{0},v_{1})( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) with transition probability α𝛼\alphaitalic_α, and an edge (v1,v2)subscript𝑣1subscript𝑣2(v_{1},v_{2})( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with transition probability β𝛽\betaitalic_β, then the path (v0,v1,v2)subscript𝑣0subscript𝑣1subscript𝑣2(v_{0},v_{1},v_{2})( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) will be observed with probability αβ𝛼𝛽\alpha\cdot\betaitalic_α ⋅ italic_β. In other words, the transitions are considered to be independent. The transitivity assumption simplifies the modeling of a path by expressing its probability as the product of the individual edge transition probabilities. However, this assumption often fails in temporal networks Gt=(V,Et)superscript𝐺𝑡𝑉superscript𝐸𝑡G^{t}=(V,E^{t})italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( italic_V , italic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), where EtV×V×superscript𝐸𝑡𝑉𝑉E^{t}\subseteq V\times V\times\mathbb{N}italic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊆ italic_V × italic_V × blackboard_N as edges have timestamps. In temporal networks, the ordering of edges can play an important role in determining the likelihood of observing certain paths. A time-respecting path is defined as a sequence of edges ((v0,v1,t1),,(vi,vi+1,ti),,(vn1,vn,tn))subscript𝑣0subscript𝑣1subscript𝑡1subscript𝑣𝑖subscript𝑣𝑖1subscript𝑡𝑖subscript𝑣𝑛1subscript𝑣𝑛subscript𝑡𝑛((v_{0},v_{1},t_{1}),\ldots,(v_{i},v_{i+1},t_{i}),\ldots,(v_{n-1},v_{n},t_{n}))( ( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , … , ( italic_v start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) that i[0,l1]for-all𝑖0𝑙1\forall i\in[0,l-1]∀ italic_i ∈ [ 0 , italic_l - 1 ] respects two conditions: (i) transitions respect the order of time ti>ti1subscript𝑡𝑖subscript𝑡𝑖1t_{i}>t_{i-1}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, and (ii) titi1δsubscript𝑡𝑖subscript𝑡𝑖1𝛿t_{i}-t_{i-1}\leq\deltaitalic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ≤ italic_δ, where δ𝛿\deltaitalic_δ is a parameter controlling the maximum time distance for considering interactions temporally adjacent. Therefore, different from what we would get by discarding time and using the transitivity assumption, the two edges (v,w,t1)𝑣𝑤subscript𝑡1(v,w,t_{1})( italic_v , italic_w , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (u,v,t2)𝑢𝑣subscript𝑡2(u,v,t_{2})( italic_u , italic_v , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) form a time-respecting path only if t2>t1subscript𝑡2subscript𝑡1t_{2}>t_{1}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. To capture time-respecting sequential patterns, higher-order De Bruijn graphs model the probabilities of path sequences explicitly. These models construct a representation that respects the topology of the original graph and the frequencies of observed paths of a given length k𝑘kitalic_k. Specifically, a higher-order network of the k-th order is defined as an ordered pair G(k)=(V(k),E(k))superscript𝐺𝑘superscript𝑉𝑘superscript𝐸𝑘G^{(k)}=(V^{(k)},E^{(k)})italic_G start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = ( italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ), where V(k)Vksuperscript𝑉𝑘superscript𝑉𝑘V^{(k)}\subseteq V^{k}italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ⊆ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are the higher-order vertices, and E(k)V(k)×V(k)superscript𝐸𝑘superscript𝑉𝑘superscript𝑉𝑘E^{(k)}\subseteq V^{(k)}\times V^{(k)}italic_E start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ⊆ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT × italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT are the higher-order edges. Each higher-order vertex v=:v0v1vk1V(k)v=:\langle v_{0}v_{1}\ldots v_{k-1}\rangle\in V^{(k)}italic_v = : ⟨ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ ∈ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is an ordered tuple of k𝑘kitalic_k vertices viVsubscript𝑣𝑖𝑉v_{i}\in Vitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V from the original graph. The higher-order edges connect higher-order nodes that overlap in exactly k1𝑘1k-1italic_k - 1 vertices, similar to the construction of high-dimensional De Bruijn graphs [10]. The weights of the higher-order edges in G(k)superscript𝐺𝑘G^{(k)}italic_G start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT represent the frequency of paths of length k𝑘kitalic_k in the original graph. Specifically, the weight of the edge (v0vk1,v1vk)delimited-⟨⟩subscript𝑣0subscript𝑣𝑘1delimited-⟨⟩subscript𝑣1subscript𝑣𝑘(\langle v_{0}\ldots v_{k-1}\rangle,\langle v_{1}\ldots v_{k}\rangle)( ⟨ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ , ⟨ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ) counts how often the path v0vkdelimited-⟨⟩subscript𝑣0subscript𝑣𝑘\langle v_{0}\ldots v_{k}\rangle⟨ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ of length k𝑘kitalic_k occurs. By explicitly modeling the probabilities of these higher-order path sequences, the higher-order network representation can capture patterns and dependencies that may be missed when relying on the transitivity assumption [51].

Detection of Path Anomalies

Defining anomalies requires a reference base. In our case, the transitivity assumption provides the null model that serves as this baseline. Anomalies occur in sequences that deviate from this baseline, likely due to correlations and interdependencies not captured by the transitivity assumption. First, we discuss how the hypergeometric ensemble allows testing for anomalous edge frequencies based on node activity, i.e., their in- and out-degrees. Building on this, we then outline how this methodology is extended to test if the frequencies of paths of length k𝑘kitalic_k are anomalous given those of paths of length k1𝑘1k-1italic_k - 1.

Configuration models [38] provide randomization methods for graphs that shuffle edges while preserving vertex degrees. In a nutshell, first, they disassemble the graph, leaving nodes with in- an out-stubs. Then, a new network is reassembled by connecting pairs of in- and out- are picked with equal probability. This procedure is algorithmically straightforward but can be computationally expensive. To address this, Casiraghi and Nanumyan [6] contributed a closed-form expression for the soft configuration model, which fixes the expected vertex degrees rather than the exact degree sequence. In their formulation, the sampling of edges is equated to sampling from an urn. The authors introduce a combinatorial matrix 𝚵n×n𝚵superscript𝑛superscript𝑛\mathbf{\Xi}\in\mathbb{N}^{n}\times\mathbb{N}^{n}bold_Ξ ∈ blackboard_N start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_N start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where Ξij=dioutdjinsubscriptΞ𝑖𝑗subscriptsuperscript𝑑𝑜𝑢𝑡𝑖subscriptsuperscript𝑑𝑖𝑛𝑗\Xi_{ij}=d^{out}_{i}\cdot d^{in}_{j}roman_Ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_d start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_d start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT encodes the product of the out-degree of node i𝑖iitalic_i and the in-degree of node j𝑗jitalic_j in the original graph G𝐺Gitalic_G. The total number of possible edge placements is then M=ijΞij𝑀subscript𝑖𝑗subscriptΞ𝑖𝑗M=\sum_{ij}\Xi_{ij}italic_M = ∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_Ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. A network is sampled from this ensemble by drawing m=idiout=idiin𝑚subscript𝑖subscriptsuperscript𝑑𝑜𝑢𝑡𝑖subscript𝑖subscriptsuperscript𝑑𝑖𝑛𝑖m=\sum_{i}d^{out}_{i}=\sum_{i}d^{in}_{i}italic_m = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT edges without replacement from the M𝑀Mitalic_M possible edge placements. The probability of observing Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT edges between nodes i𝑖iitalic_i and j𝑗jitalic_j is then given by the hypergeometric distribution: P(Aij)=(Mm)1(ΞijAij)(MΞijmAij).𝑃subscript𝐴𝑖𝑗superscriptbinomial𝑀𝑚1binomialsubscriptΞ𝑖𝑗subscript𝐴𝑖𝑗binomial𝑀subscriptΞ𝑖𝑗𝑚subscript𝐴𝑖𝑗P(A_{ij})=\binom{M}{m}^{-1}\binom{\Xi_{ij}}{A_{ij}}\binom{M-\Xi_{ij}}{m-A_{ij}}.italic_P ( italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = ( FRACOP start_ARG italic_M end_ARG start_ARG italic_m end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( FRACOP start_ARG roman_Ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG ) ( FRACOP start_ARG italic_M - roman_Ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_m - italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG ) . Having this probability mass function, we can use the equation above to quantify the anomalousness of the frequency of an edge. This closed-form expression and the sampling process that generates it provides a principled null model that preserves the expected degree sequence, which will be crucial for our subsequent analysis of anomalous path patterns in the network.

Our concept of path anomalies, introduced by  LaRock et al. [31], provides a statistical framework for identifying paths through a graph that are traversed with anomalous frequencies. The key idea is to define a null model of order k1𝑘1k-1italic_k - 1 that captures the expected frequencies of paths of length k𝑘kitalic_k, and then identify paths that deviate significantly from this null model. To construct the null model, one must establish a statistical ensemble of k𝑘kitalic_k-th order De Bruijn graphs. The starting point is the hypergeometric ensemble outlined in the previous paragraph, which preserves the total in- and out-degrees of nodes while shuffling the edge weights. For a De Bruijn graph of order k𝑘kitalic_k, the nodes’ degrees are determined by the frequencies of paths of length k1𝑘1k-1italic_k - 1, i.e., by the edge frequencies of De Bruijn graph of order k1𝑘1k-1italic_k - 1. A hypergeometric ensemble of the De Bruijn graph presents one additional difficulty. Specifically, an edge between two k-th order nodes is valid only if their path representations overlap in k1𝑘1k-1italic_k - 1 first-order nodes. This implies that some of the 𝚵𝚵\mathbf{\Xi}bold_Ξ matrix entries represent invalid paths. HYPA handles this by zeroing out impossible entries and redistributing their values through an optimization procedure, as detailed in the original work.

4 HYPA De Bruijn Graph Neural Network Architecture

We now introduce the HYPA-DBGNN architecture 111A reference implementation, data sets and benchmarks are given at https://github.com/jvpichowski/HYPA-DBGNN. that relies on statistical principled graph augmentation. The temporal dynamics of the sequential patterns are encoded in first- and higher-order De Bruijn graphs. Graphs corrections are inferred that include anomaly statistics in the graph topology. We then present a multi-order augmented message passing scheme that relies on the inferred graphs with induced bias. Although we adapt the message passing procedure of Graph Convolution Networks (GCN) from Kipf and Welling [29], our architecture is generalizable to other message passing schemes due to the selective additions.

Statistical Principled Graph Augmentation

As outlined before, the k-th-order De Bruijn graphs capture the observed frequencies of the k-th-order sequences through the edges between k-th-order nodes. This potentially biased representation yields the foundation for hypergeometric ensembles whose edge frequencies are induced by the k-1-th-order sub-sequences. The HYPA score [31], defined as HYPA(k)(u,v)=Pr(Xuvf(u,v))𝐻𝑌𝑃superscript𝐴𝑘𝑢𝑣Prsubscript𝑋𝑢𝑣𝑓𝑢𝑣HYPA^{(k)}(u,v)=\Pr(X_{uv}\leq f(u,v))italic_H italic_Y italic_P italic_A start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_u , italic_v ) = roman_Pr ( italic_X start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ≤ italic_f ( italic_u , italic_v ) ), uses these to describe how probable an observed edge has a higher frequency than in any random realization. A large HYPA score encodes a highly represented edge whereas a HYPA score approaching zero describes edges that are observed less than expected. Leveraging the HYPA scores as adjacency matrix Auv(k)=HYPA(k)(u,v)superscriptsubscript𝐴𝑢𝑣𝑘𝐻𝑌𝑃superscript𝐴𝑘𝑢𝑣A_{uv}^{(k)}=HYPA^{(k)}(u,v)italic_A start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_H italic_Y italic_P italic_A start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_u , italic_v ) leads to corrected graphs with reduced under-represented edges and balanced over-represented edges encoding the temporal pattern.

Message Passing for Higher-Order De Burijn Graphs with Induced Bias

For layer l𝑙litalic_l, we define the update rule of the message passing as

hvk,l=σ(𝐖k,l{uV(k):(u,v)E(k)}{v}HYPA(k)(u,v)huk,l1H(v)H(u)),superscriptsubscript𝑣𝑘𝑙𝜎superscript𝐖𝑘𝑙subscriptconditional-set𝑢superscript𝑉𝑘𝑢𝑣superscript𝐸𝑘𝑣𝐻𝑌𝑃superscript𝐴𝑘𝑢𝑣subscriptsuperscript𝑘𝑙1𝑢𝐻𝑣𝐻𝑢\displaystyle\vec{h}_{v}^{k,l}=\sigma\left(\mathbf{W}^{k,l}\sum_{\{u\in V^{(k)% }:(u,v)\in E^{(k)}\}\cup\{v\}}\frac{HYPA^{(k)}(u,v)\cdot\vec{h}^{k,l-1}_{u}}{% \sqrt{H(v)\cdot H(u)}}\right),over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT = italic_σ ( bold_W start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT { italic_u ∈ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT : ( italic_u , italic_v ) ∈ italic_E start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } ∪ { italic_v } end_POSTSUBSCRIPT divide start_ARG italic_H italic_Y italic_P italic_A start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_u , italic_v ) ⋅ over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_k , italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_H ( italic_v ) ⋅ italic_H ( italic_u ) end_ARG end_ARG ) , (1)

with the previous hidden representation huk,l1superscriptsubscript𝑢𝑘𝑙1\vec{h}_{u}^{k,l-1}over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_l - 1 end_POSTSUPERSCRIPT of node uV(k)𝑢superscript𝑉𝑘u\in V^{(k)}italic_u ∈ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, the inferred HYPA score HYPA(k)(u,v)𝐻𝑌𝑃superscript𝐴𝑘𝑢𝑣HYPA^{(k)}(u,v)italic_H italic_Y italic_P italic_A start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_u , italic_v ) of the given edge (u,v)E(k)𝑢𝑣superscript𝐸𝑘(u,v)\in E^{(k)}( italic_u , italic_v ) ∈ italic_E start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT (capturing the induced bias), the trainable weight matrices 𝐖k,lHl×Hl1superscript𝐖𝑘𝑙superscriptsuperscript𝐻𝑙superscript𝐻𝑙1\mathbf{W}^{k,l}\in\mathbb{R}^{H^{l}\times H^{l-1}}bold_W start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, the normalization factor based on the HYPA score sum of incoming edges H(v):={uV(k):(u,v)E(k)}{v}HYPA(k)(u,v)assign𝐻𝑣subscriptconditional-set𝑢superscript𝑉𝑘𝑢𝑣superscript𝐸𝑘𝑣𝐻𝑌𝑃superscript𝐴𝑘𝑢𝑣H(v):=\sum_{\{u\in V^{(k)}:(u,v)\in E^{(k)}\}\cup\{v\}}HYPA^{(k)}(u,v)italic_H ( italic_v ) := ∑ start_POSTSUBSCRIPT { italic_u ∈ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT : ( italic_u , italic_v ) ∈ italic_E start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } ∪ { italic_v } end_POSTSUBSCRIPT italic_H italic_Y italic_P italic_A start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_u , italic_v ), and the non-linear activation function σ𝜎\sigmaitalic_σ, here ReLU.

Depending on order k𝑘kitalic_k, the message passing for different higher order graphs is based on different higher order node sets V(k)superscript𝑉𝑘V^{(k)}italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT whose nodes v𝑣vitalic_v have their own hidden representations hvk,lsubscriptsuperscript𝑘𝑙𝑣\vec{h}^{k,l}_{v}over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT for every layer l𝑙litalic_l. An initial feature encoding is only provided for the first order (k=1𝑘1k=1italic_k = 1) as hv1,0subscriptsuperscript10𝑣\vec{h}^{1,0}_{v}over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT 1 , 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. To transfer the features to higher-order nodes and to merge the hidden representations, we introduce two bipartite map**s.

The initial first-order feature set hu1,0subscriptsuperscript10𝑢\vec{h}^{1,0}_{u}over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT 1 , 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is mapped to the higher-order node representations hvk,1subscriptsuperscript𝑘1𝑣\vec{h}^{k,1}_{v}over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_k , 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT using the bipartite graph Gb0=(V(k)V,Eb0V×V(k))superscript𝐺subscript𝑏0superscript𝑉𝑘𝑉superscript𝐸subscript𝑏0superscript𝑉superscript𝑉𝑘G^{b_{0}}=\left(V^{(k)}\cup V,E^{b_{0}}\subseteq V^{\times}V^{(k)}\right)italic_G start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = ( italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∪ italic_V , italic_E start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊆ italic_V start_POSTSUPERSCRIPT × end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) with euvEb0subscript𝑒𝑢𝑣superscript𝐸subscript𝑏0e_{uv}\in E^{b_{0}}italic_e start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ∈ italic_E start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT if v=(v0,,vk1)V(k)𝑣subscript𝑣0subscript𝑣𝑘1superscript𝑉𝑘v=(v_{0},\dots,v_{k-1})\in V^{(k)}italic_v = ( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∈ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT and v0=usubscript𝑣0𝑢v_{0}=uitalic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u in analogy to interpreting message passing layers as higher-order Markov chains. Multiple first-order representations are aggregated using the function \mathcal{F}caligraphic_F (in our case MEAN) and transformed with the learnable weight matrix Wb0H1,0×Hk,0superscriptWsubscript𝑏0superscriptsuperscript𝐻10superscript𝐻𝑘0\textbf{W}^{b_{0}}\in\mathbb{R}^{H^{1,0}\times H^{k,0}}W start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT 1 , 0 end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT italic_k , 0 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to the higher-order feature space.

hvk,1=σ(Wb0({hu1,0 : for uV(1) with (u,v)Eb0}))subscriptsuperscript𝑘1𝑣𝜎superscriptWsubscript𝑏0subscriptsuperscript10𝑢 : for 𝑢superscript𝑉1 with 𝑢𝑣superscript𝐸subscript𝑏0\displaystyle\vec{h}^{k,1}_{v}=\sigma\left(\textbf{W}^{b_{0}}\mathcal{F}\left(% \left\{\vec{h}^{1,0}_{u}\text{ : for }u\in V^{(1)}\text{ with }(u,v)\in E^{b_{% 0}}\right\}\right)\right)over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_k , 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_σ ( W start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_F ( { over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT 1 , 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : for italic_u ∈ italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT with ( italic_u , italic_v ) ∈ italic_E start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ) ) (2)

The second map** is defined as by Qarkaxhija et al. [45]. It is the counterpart to the first bipartite layer. Here, the higher-order node representations are summed with the first-order node representations (requiring matching representation dimensions Fg=Hlsuperscript𝐹𝑔superscript𝐻𝑙F^{g}=H^{l}italic_F start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = italic_H start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT) if the last path entry uk1subscript𝑢𝑘1u_{k-1}italic_u start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT of the higher-order node u=(u0,,uk1)V(k)𝑢subscript𝑢0subscript𝑢𝑘1superscript𝑉𝑘u=(u_{0},\dots,u_{k-1})\in V^{(k)}italic_u = ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∈ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT equals the first-order node v=uk1𝑣subscript𝑢𝑘1v=u_{k-1}italic_v = italic_u start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT. The bipartite graph is given as Gb=(V(k)V,EbV(k)×V)superscript𝐺𝑏superscript𝑉𝑘𝑉superscript𝐸𝑏superscript𝑉𝑘𝑉G^{b}=\left(V^{(k)}\cup V,E^{b}\subseteq V^{(k)}\times V\right)italic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = ( italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∪ italic_V , italic_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ⊆ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT × italic_V ) and leads to a first-order node representation hvbsubscriptsuperscript𝑏𝑣\vec{h}^{b}_{v}over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT with vV𝑣𝑉v\in Vitalic_v ∈ italic_V and the learnable matrix WbFg×HlsuperscriptW𝑏superscriptsuperscript𝐹𝑔superscript𝐻𝑙\textbf{W}^{b}\in\mathbb{R}^{F^{g}\times H^{l}}W start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

hvb=σ(Wb({huk,l+hv1,g : for uV(k) with (u,v)Eb}))subscriptsuperscript𝑏𝑣𝜎superscriptW𝑏subscriptsuperscript𝑘𝑙𝑢subscriptsuperscript1𝑔𝑣 : for 𝑢superscript𝑉𝑘 with 𝑢𝑣superscript𝐸𝑏\displaystyle\vec{h}^{b}_{v}=\sigma\left(\textbf{W}^{b}\mathcal{F}\left(\left% \{\vec{h}^{k,l}_{u}+\vec{h}^{1,g}_{v}\text{ : for }u\in V^{(k)}\text{ with }(u% ,v)\in E^{b}\right\}\right)\right)over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_σ ( W start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT caligraphic_F ( { over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_k , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + over→ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT 1 , italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT : for italic_u ∈ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT with ( italic_u , italic_v ) ∈ italic_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT } ) ) (3)

An overview of the inference process and the proposed neural network architecture for the first- and second-order graph is shown in Figure 1. A one-hot encoding is used as first-order feature set and transferred to the higher-order nodes using the first bipartite layer. Multiple message passing steps are performed independently for the two given graph topologies. The number of message passing rounds and the dimensions of layers may vary in the two parts. After performing the message passing in parallel and merging the features with the second bipartite layer a final classification layer is applied. We discuss the computational complexity in Appendix B. However, it is upper-bounded by the complexity of DBGNN due to the edges removed in graph correction.

Refer to caption
Figure 1: Inference procedure leading to the dynamic graph used for neural message passing. (a) Example of sequence data adapted from LaRock et al. [31]. (b) First- (blue) and higher-order (orange) De Bruijn graphs encoding temporal ordered time-stamped edges are compared to random graph ensemble null model with shuffled time-stamped k-1-order edges. (c) The graphs are corrected by introducing a statistical principled bias that revalues all edges (wAXCwBXD>wBXCsubscript𝑤delimited-⟨⟩𝐴𝑋𝐶subscript𝑤delimited-⟨⟩𝐵𝑋𝐷subscript𝑤delimited-⟨⟩𝐵𝑋𝐶w_{\langle AXC\rangle}\approx w_{\langle BXD\rangle}>w_{\langle BXC\rangle}italic_w start_POSTSUBSCRIPT ⟨ italic_A italic_X italic_C ⟩ end_POSTSUBSCRIPT ≈ italic_w start_POSTSUBSCRIPT ⟨ italic_B italic_X italic_D ⟩ end_POSTSUBSCRIPT > italic_w start_POSTSUBSCRIPT ⟨ italic_B italic_X italic_C ⟩ end_POSTSUBSCRIPT) and removes under-represented edges, i.e. edges that appear with a high probability less than expected (AXDdelimited-⟨⟩𝐴𝑋𝐷\langle AXD\rangle⟨ italic_A italic_X italic_D ⟩). (d) The multi-order graph neural network is trained respecting the inferred graphs.

5 Experimental Evaluation

We compare our architecture with graph representation learning methods (EVO [2], HONEM [48], DeepWalk [42] and Node2Vec [18]) and deep graph learning methods (GCN [29], LGNN [7] and DBGNN [45]). For the representation learning models Node2Vec and EVO we adhere to the original configurations, i.e. we use an embedding size of d=128𝑑128d=128italic_d = 128 and a random walk length of l=80𝑙80l=80italic_l = 80, repeated r=10𝑟10r=10italic_r = 10 times. As context size we use k=10𝑘10k=10italic_k = 10. For Node2Vec we select the return parameter (p𝑝pitalic_p) and the in-out parameter (q𝑞qitalic_q) from the set 0.25,0.5,1,2,40.250.5124{0.25,0.5,1,2,4}0.25 , 0.5 , 1 , 2 , 4. The deep learning models (GCN, LGNN, DBGNN, and our proposed model) consist of three layers. Following the approach of [45], we set the size of the last layer to h2=16subscript216h_{2}=16italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 16, while the sizes of the preceding layers are determined during model selection. The study range for h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and h1subscript1h_{1}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT encompasses 4,8,16,32481632{4,8,16,32}4 , 8 , 16 , 32 over a maximum of 5000 epochs as per [45]. The higher-order path length is fixed to k=2𝑘2k=2italic_k = 2 for HYPA-DBGNN and DBGNN because it is shown as optimal by Qarkaxhija et al. [45] for the given data sets. Stochastic Gradient Descent (SGD) serves as our optimization function, with the learning rate set to 0.0010.0010.0010.001, which showed the best performances. We use dropout regularization with a dropout rate of 0.40.40.40.4 to mitigate overfitting and we incorporate class weights in the loss function to address imbalanced training datasets.

The data sets used do not have an independent test set, nor are they large enough to define a robust dedicated test set. This makes it challenging to directly evaluate model performance on unknown data. To compare various Graph Neural Network (GNN) architectures, we adopt a conventional approach as documented in literature [11, 40, 25]. For the assessment of model generalizability, we employ a nested cross-validation strategy with N=10𝑁10N=10italic_N = 10 repetitions. The data undergoes stratified partitioning into nine training and one testing fold, further divided into stratified training and validation subsets (80/20%) within each repetition. Subsequently, we select the best-performing model and epoch based on its validation set performance. Finally, we evaluate the chosen model’s performance on the test set, reporting the mean and standard deviation of the respective metric across all N repetitions. For comparability, we use the same folds and splits for all experiments. Besides the random splits, the random initialization of the model also contributes to the variability captured by the standard deviation. For reproducibility, we fix the random splits and reuse a common seed in every repetition for the random initialization of model weights and dropout candidates.

5.1 Experimental Results for Synthetic Data Sets

We use synthetic data with two classes of nodes C={A,B}𝐶𝐴𝐵C=\{A,B\}italic_C = { italic_A , italic_B } to demonstrate the type of patterns that only our model can learn. The characteristic properties and its derivation of the configuration model are detailed in section A.1. Importantly, it contains a heterogeneous sequence (e.g. v0,v1,v2fsubscriptsubscript𝑣0subscript𝑣1subscript𝑣2𝑓\langle v_{0},v_{1},v_{2}\rangle_{f}⟨ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT) distribution of time-stamped edges or events (here: (v0,v1)t0,(v1,v2)t1,t0<t1subscriptsubscript𝑣0subscript𝑣1subscript𝑡0subscriptsubscript𝑣1subscript𝑣2subscript𝑡1subscript𝑡0subscript𝑡1(v_{0},v_{1})_{t_{0}},(v_{1},v_{2})_{t_{1}},t_{0}<t_{1}( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) between nodes. The learnable sequential pattern is an increased class-assortativity, i.e. edges are temporally ordered such that same class events are preferred followed by each other (e.g. leading to A,A,A,B𝐴𝐴𝐴𝐵\langle A,A,A,B\rangle⟨ italic_A , italic_A , italic_A , italic_B ⟩). Hence, these higher-order sequences with nodes from the same group are over-expressed compared to what we would expect by shuffling the temporal-order of the timestamped-edges between the nodes (e.g. A,B,A,A𝐴𝐵𝐴𝐴\langle A,B,A,A\rangle⟨ italic_A , italic_B , italic_A , italic_A ⟩).

The pattern is only discernible by higher-order models due to its restriction to higher-order sequences. For a homogeneous sequence distribution the pattern of overrepresented sequences would be reflected in the mere frequencies. However, due to the initial heterogeneous distribution, overrepresented sequences can also have low frequencies (e.g. A,X,C𝐴𝑋𝐶\langle A,X,C\rangle⟨ italic_A , italic_X , italic_C ⟩ in Figure 1). Thus, they are also unobservable by higher-order baselines only including mere frequencies like DBGNN. However, the comparison of the observed frequencies with a null model that preserves the frequency of time stamped-edges but randomizes the temporal order reveals the sequential pattern.

We use two synthetic data sets with the same distribution of time-stamped edges. Unweighted Sampling contains sequences with randomized temporal order of time-stamped edges whereas Weighted Sampling contains sequences with increased class-assortativity. An unintended correlation between the obtained graph topology and the event classes is not excluded for both data sets. Weighted Sampling additionally contains the preferential chaining pattern.

The results in Table 1 show the capabilities of the models in terms of accuracy in solving the respective binary node classification tasks. The different methods yield varying results for synthetic data set without intended pattern. All representation learning methods with a horizon of l=80𝑙80l=80italic_l = 80, except EVO, perform better than the deep graph learning methods with a smaller horizon of l=3𝑙3l=3italic_l = 3. Our approach performs as good as DBGNN that shares similarities, like two distinct message passing modules, in its architecture.

The Weighted Sampling highlights the ability of the methods to learn the intended pattern. For the second data set GCN performs worse and all other baselines methods perform equal as for the first one. In contrast, HYPA-DBGNN improves by 55%percent5555\%55 % and reaches an accuracy of 100%percent100100\%100 %. These observations lead to the result that some of current baselines are able to learn an unintended pattern in both data sets. However, they fail in learning the implanted increased class-assortativity pattern whereas HYPA-DBGNN is able to learn this pattern.

Table 1: Comparison of HYPA-DBGNN baselines for the synthetic data sets. The table presents the balanced accuracy and its standard deviation for the static node classification task on dynamic graphs as obtained through the outlined experiments. The Unweighted Sampling data set contains a heterogeneous sequence distribution of time-stamped edges with shuffled temporal order. The adapted distribution of sequences in Weighted Sampling encodes a sequential pattern such that time-stamped edges between nodes of the same class are overrepresented but not necessarily very frequent.
Representation Learning EVO HONEM DeepWalk Node2Vec
Unweighted Sampling 40.00 ± 31.62 80.00 ± 25.82 60.00 ± 21.08 60.00 ± 21.08
Weighted Sampling 40.00 ± 31.62 80.00 ± 25.82 60.00 ± 21.08 60.00 ± 21.08
Deep Graph Learning GCN LGNN DBGNN HYPA-DBGNN
Unweighted Sampling 50.00 ± 33.33 50.00 ± 0.00 45.00 ± 28.38 45.00 ± 15.81
Weighted Sampling 45.00 ± 28.38 50.00 ± 0.00 45.00 ± 15.81 100.00 ± 0.00

5.2 Experimental Results for Empirical Data Sets

Our experiments leverage the five empirical time series datasets on dynamic graphs from [45]. This work also provides the optimal order of the higher-order model and the δ𝛿\deltaitalic_δ value (the maximum time difference for edges to be considered part of a causal walk) for generating the time respecting paths within each dataset. The data sets are Highschool2011 and Highschool2012 [14], Hospital [54], StudentSMS [50], and Workplace2016 [19]. This section addresses the question of how our architecture compares to the described baselines with respect to the named empirical data sets. The mean balanced accuracy and its standard deviation is reported in Table 2.

We reproduce the superior results of DBGNN compared to other baselines for all data sets except Workplace2016 for which LGNN performs better than shown in Qarkaxhija et al. [45]. The obtained standard deviations are also comparable to the work of [45].

However, HYPA-DBGNN outperforms all baselines, including DBGNN. For Highschool2011 and Highschool2012, the gain is smallest with 2.77%percent2.772.77\%2.77 % and 2.27%percent2.272.27\%2.27 %, respectively. For StudentSMS and Workplace2016, the gain is about twice as large at 5.09%percent5.095.09\%5.09 % and 4.58%percent4.584.58\%4.58 %, respectively. It is noteworthy that the baseline results for Workplace2016 are already at least 20%percent2020\%20 % better than for the other data sets, so the gain of 4.58%percent4.584.58\%4.58 % is harder to achieve and brings the balanced accuracy close to the optimum. A remarkable result is the gain of 45.50%percent45.5045.50\%45.50 % for Hospital. Here, the baselines are the weakest compared to the other data sets, while for our approach only Workspace2016 is better solvable.

The inclusion of path anomalies is beneficial for all empirical data sets considered, but the gain depends on the particular data set. Here, the results for Hospital and Workplace2016 stand out.

Table 2: Comparison of HYPA-DBGNN with node representation learning and deep graph learning baselines for dynamic graphs. The table presents the balanced accuracy and its standard deviation for the models on empirical static node classification tasks for dynamic graphs that is obtained through the outlined experiments. The best results are marked. Results with different metrics are attached in Appendix F.
Model Highschool2011 Highschool2012 Hospital StudentSMS Workplace2016
EVO 43.68 ± 10.91 50.05 ± 7.30 25.83 ± 8.29 55.05 ± 6.39 26.50 ± 12.08
HONEM 59.00 ± 10.61 50.49 ± 9.31 39.44 ± 17.57 53.81 ± 7.28 83.17 ± 11.14
DeepWalk 54.64 ± 17.70 49.65 ± 12.97 24.58 ± 10.92 52.78 ± 7.83 20.54 ± 9.51
Node2Vec 54.64 ± 17.70 49.65 ± 12.97 24.58 ± 10.92 52.31 ± 7.70 20.54 ± 9.51
GCN 55.00 ± 13.37 59.35 ± 11.13 43.47 ± 9.03 54.50 ± 6.40 73.33 ± 12.60
LGNN 57.72 ± 9.85 51.43 ± 17.94 44.03 ± 9.03 52.71 ± 6.63 84.83 ± 14.77
DBGNN 61.54 ± 11.13 64.93 ± 15.26 52.50 ± 19.27 57.72 ± 5.29 84.42 ± 15.59
HYPA-DBGNN 63.25 ± 16.18 66.41 ± 10.24 76.39 ± 17.12 60.66 ± 6.11 88.29 ± 10.51

5.3 Similarities in Temporal Sequences Between Empirical and Synthetic Data

The synthetic data set encodes a pattern of increased class-assortativity that is learned by HYPA-DBGNN. Figure 2 shows the deviation from the expected edge frequencies in terms of HYPA scores for the used data sets regarding the incident nodes, i.e. for each node the distribution of the average HYPA score of incident edges is plotted.

The second-order plot shows the increased class-assortativity for nodes of class 0 in the Weighted Sampling data set. The incident second-order edges have on average a larger HYPA score and thus are more often overrepresented compared to edges incident to nodes of class 1. Due to the statistic principled inferred graph, HYPA-DBGNN is able to learn this pattern.

Also, Hospital and Workplace2016 emit such under- and overrepresented sequential patterns in both graphs that are related to distinct node classes. In Hospital second-order edges incident to nodes of class 0 and 1 are overly often overrepresented. However, the first-order edges incident to nodes of class 0 and 1 differ in its statistics. This observed connection between node classes and the sequential patterns containing the respective nodes supports the superior performance of HYPA-DBGNN for Hospital and Workplace2016.

We also study the impact and different formulations of statistical information in an ablation study in Appendix D that support the relevance of statistical information.

Refer to caption
Refer to caption
Figure 2: Distribution of average HYPA scores of incident edges. For each node vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT the average HYPA score is determined with HYPA¯(k)(vj)=1|S(vj)|(vi,vj)S(vj)HYPA(k)(vi,vj)superscript¯𝐻𝑌𝑃𝐴𝑘subscript𝑣𝑗1𝑆subscript𝑣𝑗subscriptsubscript𝑣𝑖subscript𝑣𝑗𝑆subscript𝑣𝑗𝐻𝑌𝑃superscript𝐴𝑘subscript𝑣𝑖subscript𝑣𝑗\overline{HYPA}^{(k)}(v_{j})=\frac{1}{|S(v_{j})|}\sum_{(v_{i},v_{j})\in S(v_{j% })}HYPA^{(k)}(v_{i},v_{j})over¯ start_ARG italic_H italic_Y italic_P italic_A end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | italic_S ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | end_ARG ∑ start_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_S ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_H italic_Y italic_P italic_A start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) with the incident edges S(vj)={(vi,vj)E(k):viV(k)}𝑆subscript𝑣𝑗conditional-setsubscript𝑣𝑖subscript𝑣𝑗superscript𝐸𝑘subscript𝑣𝑖superscript𝑉𝑘S(v_{j})=\{(v_{i},v_{j})\in E^{(k)}:v_{i}\in V^{(k)}\}italic_S ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_E start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT : italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT }. The box plots show the distribution of theses scores with respect to node classes. The synthetic data set is Weighted Sampling.

6 Conclusion

In this work, we propose HYPA-DBGNN, a novel deep graph learning architecture that accounts for time-respecting paths in temporal graph data with high temporal resolution. Different from existing graph learning methods that employ neural message passing along time-respecting paths, we introduce a two-step approach which first infers anomalous sequential patterns based on an analytically tractable null model for time-respecting paths that preserves both the topology and the frequency, but not the temporal ordering, of time-stamped edges. In a second step, we apply neural message passing on an augmented higher-order De Bruijn graph, whose edges capture time-respecting paths that are overrepresented compared to the expectation from that random baseline. An experimental evaluation of our approach in a synthetic model and five empirical data sets on temporal graphs reveals that our proposed method considerably improves node classification compared to seven baseline methods in all studied data sets, with performance gains ranging from 2.27 % to 45.5 %. An investigation of HYPA scores – which capture the degree to which time-respecting path statistics deviate from what is expected in a null model – as well as an ablation study show that the correlation between node classes and the magnitude of the deviations from the random expectation is particularly pronounced for those empirical temporal graphs where we also observe the largest performance gains for our method. This finding highlights that the innovative combination of statistical inference and neural message passing, which is the key contribution of our work, leads to considerable advantages for temporal graph learning.

Despite these contributions, our work raises a number of open questions that we did not address within the scope of this work. First, in order to isolate the influence of sequential patterns in temporal graphs, here we solely focused on the sequence of time-stamped edges, thus neglecting additional node attributes and edge features. Future studies building on our work could thus additionally consider richer node and edge information, which is likely to further improve the performance of our model. Moreover, the framework of hypergeometric statistical ensembles allows to include non-homogeneous “edge propensities” based, e.g., on a homophily of nodes with similar attributes. This could possibly be used to generate domain-specific null models leading to a graph learning architecture that includes a non-trivial inductive bias, which we did not explore in this work. Bridging the gap between the application of statistical graph ensembles in network science and deep graph learning, we finally argue that our work opens broader perspectives for the integration of statistical graph inference, graph augmentation, and neural message passing. In particular, applying our method to the inference of (first-order) edges in static graphs could be a promising approach to address the issue that empirical graphs are rarely unspoiled reflections of reality, but are often subject to measurement errors and noise. The need to combine graph inference techniques with neural message passing [36, 41, 61] has recently been identified as a major challenge for deep graph learning, and our work can be seen as a step in this direction.

Acknowledgement

Jan von Pichowski, Lisi Qarkaxhija, and Ingo Scholtes acknowledge funding from the German Federal Ministry of Education and Research (BMBF) via the Project "Software Campus 3.0", Grant No. (FKZ) 01IS24030, which is running from 01.04.2024 to 31.03.2026. Ingo Scholtes acknowledges funding through the Swiss National Science Foundation (SNF), Grant No. 176938.

References

  • Barbero et al. [2023] F. Barbero, A. Velingker, A. Saberi, M. Bronstein, and F. Di Giovanni. Locality-aware graph-rewiring in gnns. arXiv preprint arXiv:2310.01668, 2023.
  • Belth et al. [2020] C. Belth, F. Kamran, D. Tjandra, and D. Koutra. When to remember where you came from: node representation learning in higher-order networks. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’19, page 222–225, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450368681. doi: 10.1145/3341161.3342911. URL https://doi.org/10.1145/3341161.3342911.
  • Bevilacqua et al. [2021] B. Bevilacqua, F. Frasca, D. Lim, B. Srinivasan, C. Cai, G. Balamurugan, M. M. Bronstein, and H. Maron. Equivariant subgraph aggregation networks. arXiv preprint arXiv:2110.02910, 2021.
  • Bodnar et al. [2021a] C. Bodnar, F. Frasca, N. Otter, Y. Wang, P. Lio, G. F. Montufar, and M. Bronstein. Weisfeiler and lehman go cellular: Cw networks. Advances in neural information processing systems, 34:2625–2640, 2021a.
  • Bodnar et al. [2021b] C. Bodnar, F. Frasca, Y. Wang, N. Otter, G. F. Montufar, P. Lio, and M. Bronstein. Weisfeiler and lehman go topological: Message passing simplicial networks. In International Conference on Machine Learning, pages 1026–1037. PMLR, 2021b.
  • Casiraghi and Nanumyan [2021] G. Casiraghi and V. Nanumyan. Configuration models as an urn problem. Scientific Reports, 11(1), June 2021. ISSN 2045-2322. doi: 10.1038/s41598-021-92519-y. URL http://dx.doi.org/10.1038/s41598-021-92519-y.
  • Chen et al. [2020] Z. Chen, X. Li, and J. Bruna. Supervised community detection with line graph neural networks, 2020.
  • Chien et al. [2021] E. Chien, C. Pan, J. Peng, and O. Milenkovic. You are allset: A multiset function framework for hypergraph neural networks. arXiv preprint arXiv:2106.13264, 2021.
  • Cotta et al. [2021] L. Cotta, C. Morris, and B. Ribeiro. Reconstruction for powerful graph representations. Advances in Neural Information Processing Systems, 34:1713–1726, 2021.
  • De Bruijn [1946] N. G. De Bruijn. A combinatorial problem. Proceedings of the Section of Sciences of the Koninklijke Nederlandse Akademie van Wetenschappen te Amsterdam, 49(7):758–764, 1946.
  • Errica et al. [2019] F. Errica, M. Podda, D. Bacciu, and A. Micheli. A fair comparison of graph neural networks for graph classification. CoRR, abs/1912.09893, 2019. URL http://arxiv.longhoe.net/abs/1912.09893.
  • Fatemi et al. [2021] B. Fatemi, L. El Asri, and S. M. Kazemi. Slaps: Self-supervision improves structure learning for graph neural networks. Advances in Neural Information Processing Systems, 34:22667–22681, 2021.
  • Feng et al. [2020] W. Feng, J. Zhang, Y. Dong, Y. Han, H. Luan, Q. Xu, Q. Yang, E. Kharlamov, and J. Tang. Graph random neural networks for semi-supervised learning on graphs. Advances in neural information processing systems, 33:22092–22103, 2020.
  • Fournet and Barrat [2014] J. Fournet and A. Barrat. Contact patterns among high school students. PLoS ONE, 9(9):e107878, Sept. 2014. ISSN 1932-6203. doi: 10.1371/journal.pone.0107878. URL http://dx.doi.org/10.1371/journal.pone.0107878.
  • Franceschi et al. [2019] L. Franceschi, M. Niepert, M. Pontil, and X. He. Learning discrete structures for graph neural networks. In International conference on machine learning, pages 1972–1982. PMLR, 2019.
  • Gasteiger et al. [2019] J. Gasteiger, S. Weißenberger, and S. Günnemann. Diffusion improves graph learning. Advances in neural information processing systems, 32, 2019.
  • Georgiev et al. [2022] D. Georgiev, M. Brockschmidt, and M. Allamanis. Heat: Hyperedge attention networks. arXiv preprint arXiv:2201.12113, 2022.
  • Grover and Leskovec [2016] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
  • Génois et al. [2015] M. Génois, C. L. Vestergraad, J. Fournet, A. Panisson, I. Bonmarin, and A. Barrat. Data on face-to-face contacts in an office building suggest a low-cost vaccination strategy based on community linkers. Network Science, 3(3):326–347, Mar. 2015. ISSN 2050-1250. doi: 10.1017/nws.2015.10. URL http://dx.doi.org/10.1017/nws.2015.10.
  • Hajij et al. [2022] M. Hajij, G. Zamzmi, T. Papamarkou, N. Miolane, A. Guzmán-Sáenz, K. N. Ramamurthy, T. Birdal, T. K. Dey, S. Mukherjee, S. N. Samaga, et al. Topological deep learning: Going beyond graph data. arXiv preprint arXiv:2206.00606, 2022.
  • Hajiramezanali et al. [2019] E. Hajiramezanali, A. Hasanzadeh, K. Narayanan, N. Duffield, M. Zhou, and X. Qian. Variational graph recurrent neural networks. Advances in neural information processing systems, 32, 2019.
  • Hamilton et al. [2018] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs, 2018.
  • Huang and Yang [2021] J. Huang and J. Yang. Unignn: a unified framework for graph and hypergraph neural networks. arXiv preprint arXiv:2105.00956, 2021.
  • Hwang et al. [2021] E. Hwang, V. Thost, S. S. Dasgupta, and T. Ma. Revisiting virtual nodes in graph neural networks for link prediction. 2021.
  • James et al. [2013] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer, 2013. URL https://faculty.marshall.usc.edu/gareth-james/ISL/.
  • Jiang et al. [2019] B. Jiang, Z. Zhang, D. Lin, J. Tang, and B. Luo. Semi-supervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11313–11320, 2019.
  • ** et al. [2020] W. **, Y. Ma, X. Liu, X. Tang, S. Wang, and J. Tang. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 66–74, 2020.
  • Kazi et al. [2022] A. Kazi, L. Cosmo, S.-A. Ahmadi, N. Navab, and M. M. Bronstein. Differentiable graph module (dgm) for graph convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1606–1617, 2022.
  • Kipf and Welling [2017] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks, 2017.
  • Kreuzer et al. [2021] D. Kreuzer, D. Beaini, W. Hamilton, V. Létourneau, and P. Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618–21629, 2021.
  • LaRock et al. [2020] T. LaRock, V. Nanumyan, I. Scholtes, G. Casiraghi, T. Eliassi-Rad, and F. Schweitzer. HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks, pages 460–468. Society for Industrial and Applied Mathematics, Jan. 2020. doi: 10.1137/1.9781611976236.52. URL http://dx.doi.org/10.1137/1.9781611976236.52.
  • Liben-Nowell and Kleinberg [2007] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7):1019–1031, 2007. ISSN 1532-2890. doi: 10.1002/asi.20591. URL http://dx.doi.org/10.1002/asi.20591.
  • Liu et al. [2022] Y. Liu, Z. Zhang, Y. Liu, and Y. Zhu. Gatsmote: Improving imbalanced node classification on graphs via attention and homophily. Mathematics, 10(11), 2022. ISSN 2227-7390. doi: 10.3390/math10111799. URL https://www.mdpi.com/2227-7390/10/11/1799.
  • Longa et al. [2023] A. Longa, V. Lachi, G. Santin, M. Bianchini, B. Lepri, P. Lio, F. Scarselli, and A. Passerini. Graph neural networks for temporal graphs: State of the art, open challenges, and opportunities. arXiv preprint arXiv:2302.01018, 2023.
  • Lu et al. [2024] J. Lu, Y. Xu, H. Wang, Y. Bai, and Y. Fu. Latent graph inference with limited supervision. Advances in Neural Information Processing Systems, 36, 2024.
  • Ma et al. [2019] J. Ma, W. Tang, J. Zhu, and Q. Mei. A flexible generative framework for graph-based semi-supervised learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Mialon et al. [2021] G. Mialon, D. Chen, M. Selosse, and J. Mairal. Graphit: Encoding graph structure in transformers. arXiv preprint arXiv:2106.05667, 2021.
  • Molloy and Reed [1995] M. Molloy and B. Reed. A critical point for random graphs with a given degree sequence. Random Struct. Alg., 6(2-3):161–180, Mar. 1995. ISSN 1098-2418. doi: 10.1002/rsa.3240060204.
  • Monti et al. [2018] F. Monti, K. Otness, and M. M. Bronstein. Motifnet: a motif-based graph convolutional network for directed graphs. In 2018 IEEE data science workshop (DSW), pages 225–228. IEEE, 2018.
  • Morris et al. [2020] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. CoRR, abs/2007.08663, 2020. URL https://arxiv.longhoe.net/abs/2007.08663.
  • Pal et al. [2020] S. Pal, S. Malekmohammadi, F. Regol, Y. Zhang, Y. Xu, and M. Coates. Non parametric graph learning for bayesian graph neural networks. In Conference on uncertainty in artificial intelligence, pages 1318–1327. PMLR, 2020.
  • Perozzi et al. [2014] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’14. ACM, Aug. 2014. doi: 10.1145/2623330.2623732. URL http://dx.doi.org/10.1145/2623330.2623732.
  • Pham et al. [2017] T. Pham, T. Tran, H. Dam, and S. Venkatesh. Graph classification via deep learning with virtual nodes. arXiv preprint arXiv:1708.04357, 2017.
  • Phan et al. [2023] H. T. Phan, N. T. Nguyen, and D. Hwang. Fake news detection: A survey of graph neural network methods. Appl. Soft Comput., 139(C), may 2023. ISSN 1568-4946. doi: 10.1016/j.asoc.2023.110235. URL https://doi.org/10.1016/j.asoc.2023.110235.
  • Qarkaxhija et al. [2022] L. Qarkaxhija, V. Perri, and I. Scholtes. De bruijn goes neural: Causality-aware graph neural networks for time series data on dynamic graphs, 2022.
  • Rong et al. [2019] Y. Rong, W. Huang, T. Xu, and J. Huang. Dropedge: Towards deep graph convolutional networks on node classification. arXiv preprint arXiv:1907.10903, 2019.
  • Rossi et al. [2020] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, and M. Bronstein. Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637, 2020.
  • Saebi et al. [2020] M. Saebi, G. L. Ciampaglia, L. M. Kaplan, and N. V. Chawla. Honem: Learning embedding for higher order networks. Big Data, 8(4):255–269, Aug. 2020. ISSN 2167-647X. doi: 10.1089/big.2019.0169. URL http://dx.doi.org/10.1089/big.2019.0169.
  • Sankar et al. [2020] A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. In Proceedings of the 13th international conference on web search and data mining, pages 519–527, 2020.
  • Sapiezynski et al. [2019] P. Sapiezynski, A. Stopczynski, D. Lassen, and S. Lehmann. Interaction data from the copenhagen networks study. Scientific data, 6, Dec. 2019. ISSN 2052-4463. doi: 10.1038/s41597-019-0325-x.
  • Scholtes [2017] I. Scholtes. When is a network a network? multi-order graphical model selection in pathways and temporal networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1037–1046, 2017.
  • Stokes et al. [2020] J. M. Stokes, K. Yang, K. Swanson, W. **, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair, S. French, L. A. Carfrae, Z. Bloom-Ackermann, V. M. Tran, A. Chiappino-Pepe, A. H. Badran, I. W. Andrews, E. J. Chory, G. M. Church, E. D. Brown, T. S. Jaakkola, R. Barzilay, and J. J. Collins. A deep learning approach to antibiotic discovery. Cell, 180(4):688–702.e13, 2020. ISSN 0092-8674. doi: https://doi.org/10.1016/j.cell.2020.01.021. URL https://www.sciencedirect.com/science/article/pii/S0092867420301021.
  • Top** et al. [2021] J. Top**, F. Di Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. arXiv preprint arXiv:2111.14522, 2021.
  • Vanhems et al. [2013] P. Vanhems, A. Barrat, C. Cattuto, J.-F. Pinton, N. Khanafer, C. Régis, B.-a. Kim, B. Comte, and N. Voirin. Estimating potential infection transmission routes in hospital wards using wearable proximity sensors. PLoS ONE, 8(9):e73970, Sept. 2013. ISSN 1932-6203. doi: 10.1371/journal.pone.0073970. URL http://dx.doi.org/10.1371/journal.pone.0073970.
  • Veličković et al. [2018] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks, 2018.
  • Wang et al. [2020] Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi. Graphcrop: Subgraph crop** for graph classification. arXiv preprint arXiv:2009.10564, 2020.
  • Xu et al. [2020] D. Xu, C. Ruan, E. Körpeoglu, S. Kumar, and K. Achan. Inductive representation learning on temporal graphs. CoRR, abs/2002.07962, 2020. URL https://arxiv.longhoe.net/abs/2002.07962.
  • Ying et al. [2021] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu. Do transformers really perform badly for graph representation? Advances in neural information processing systems, 34:28877–28888, 2021.
  • You et al. [2020] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823, 2020.
  • Zhang et al. [2021] X.-M. Zhang, L. Liang, L. Liu, and M.-J. Tang. Graph neural networks and their current applications in bioinformatics. Frontiers in Genetics, 12, 2021. ISSN 1664-8021. doi: 10.3389/fgene.2021.690049. URL https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.690049.
  • Zhang et al. [2019] Y. Zhang, S. Pal, M. Coates, and D. Ustebay. Bayesian graph convolutional neural networks for semi-supervised classification. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5829–5836, 2019.
  • Zhao et al. [2021a] J. Zhao, Y. Dong, M. Ding, E. Kharlamov, and J. Tang. Adaptive diffusion in graph neural networks. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021a. URL https://openreview.net/forum?id=0Kb33DHJ1g.
  • Zhao et al. [2021b] L. Zhao, W. **, L. Akoglu, and N. Shah. From stars to subgraphs: Uplifting any gnn with local structure awareness. arXiv preprint arXiv:2110.03753, 2021b.
  • Zhao et al. [2021c] T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah. Data augmentation for graph neural networks. In Proceedings of the aaai conference on artificial intelligence, volume 35, pages 11015–11023, 2021c.
  • Zhao et al. [2021d] T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah. Data augmentation for graph neural networks. In Proceedings of the aaai conference on artificial intelligence, volume 35, pages 11015–11023, 2021d.
  • Zhao et al. [2022] T. Zhao, W. **, Y. Liu, Y. Wang, G. Liu, S. Günnemann, N. Shah, and M. Jiang. Graph data augmentation for graph machine learning: A survey. arXiv preprint arXiv:2202.08871, 2022.

Appendix A Data

In this section, we give information about the synthetic data set creation, its chracteristics and the properties of used empirical data sets.

A.1 Synthetic Data Creation Procedure

We use two synthetic data sets that are created with the following procedure. Figure 3 gives an overview of the procedure.

The algorithm consists of two main parts aimed at constructing the first-order and second-order topology of the network, respectively. Initially, the algorithm receives as input parameters the set of nodes, a node-to-class map**, a bias parameter, and the desired number of paths of length k𝑘kitalic_k (k-th order edges) to generate.

In the first part, we assign the node degrees, and consequently the values of the ΞΞ\Xiroman_Ξ matrix as Ξ=kinkoutΞsubscript𝑘𝑖𝑛subscript𝑘𝑜𝑢𝑡\Xi=k_{in}\cdot k_{out}roman_Ξ = italic_k start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ⋅ italic_k start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT, To do this, we give each node a random weight sampled from a continuous uniform distribution 𝒰[0,1]𝒰01\mathcal{U}[0,1]caligraphic_U [ 0 , 1 ]. Next, for each node, we sample a number of (unweighted) edge stubs from a multinomial distribution. The number of categories in the multinomial distribution equals the number of nodes, and the probability for each category, respectively edge stub, is proportional to the previously assigned node weight. The number of stubs we sample equals the desired number of paths of length k𝑘kitalic_k given in input. Once we have this, we randomly connect the in and out stubs, thus getting the multi-set of multi-edges and the first-order topolgy. Notice that the multi-edges created in this step also yields the higher-order nodes, and that the multi-edge frequencies correspond to their in- and out-weighted degrees.

In the second part, an iterative process creates higher-order edges. First, an out-stub (v0v1vk1,)delimited-⟨⟩subscript𝑣0subscript𝑣1subscript𝑣𝑘1(\langle v_{0}v_{1}\dots v_{k-1}\rangle,\,\cdot\,)( ⟨ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ , ⋅ ) is sampled proportional to its weighted out-degree. Subsequently, a set P of potential in-stubs (,v1vk1vk)delimited-⟨⟩subscript𝑣1subscript𝑣𝑘1subscript𝑣𝑘(\,\cdot\,,\langle v_{1}\dots v_{k-1}v_{k}\rangle)( ⋅ , ⟨ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ) is identified, ensuring valid connections between higher-order nodes by applying the de Bruijn condition that requires the last k1𝑘1k-1italic_k - 1 elements of (v0v1vk1,)delimited-⟨⟩subscript𝑣0subscript𝑣1subscript𝑣𝑘1(\langle v_{0}v_{1}\dots v_{k-1}\rangle,\,\cdot\,)( ⟨ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ , ⋅ ) to match the first k1𝑘1k-1italic_k - 1 elements of (,v1vk1vk)delimited-⟨⟩subscript𝑣1subscript𝑣𝑘1subscript𝑣𝑘(\,\cdot\,,\langle v_{1}\dots v_{k-1}v_{k}\rangle)( ⋅ , ⟨ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ). The sampling process for successor in-stubs from P is biased based on the classes of the first-order nodes v0subscript𝑣0v_{0}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, v1vk1subscript𝑣1subscript𝑣𝑘1v_{1}\dots v_{k-1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT, and vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Specifically, counts are artificially inflated by the bias parameter for in-stubs where all k𝑘kitalic_k nodes belong to the same class, encoding the desired pattern of preferential attachment. The selected out-stub (v0v1vk1,)delimited-⟨⟩subscript𝑣0subscript𝑣1subscript𝑣𝑘1(\langle v_{0}v_{1}\dots v_{k-1}\rangle,\,\cdot\,)( ⟨ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ , ⋅ ) and in-stub (,v1vk1vk)delimited-⟨⟩subscript𝑣1subscript𝑣𝑘1subscript𝑣𝑘(\,\cdot\,,\langle v_{1}\dots v_{k-1}v_{k}\rangle)( ⋅ , ⟨ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ) form a higher-order edge (v0v1vk1,v1vk1vk)delimited-⟨⟩subscript𝑣0subscript𝑣1subscript𝑣𝑘1delimited-⟨⟩subscript𝑣1subscript𝑣𝑘1subscript𝑣𝑘(\langle v_{0}v_{1}\dots v_{k-1}\rangle,\langle v_{1}\dots v_{k-1}v_{k}\rangle)( ⟨ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ , ⟨ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ) in the final network. This iterative process continues until all stubs are connected, resulting in paths of length k𝑘kitalic_k that predominantly connect nodes within the same class, with the degree of class-assortativity controlled by the bias parameter.

Refer to caption
Figure 3: This figure presents the sampling procedure for the synthetic path data. It consists of five steps (left to right): (1) sampling of first-order nodes (uniform distribution) from a set with two classes (blue and orange); (2) combining the sampled nodes into second order nodes; (3) sampling out-connection candidates from the set of second-order nodes (e.g., (A,A,)𝐴𝐴(\langle A,A\rangle,\,\cdot\,)( ⟨ italic_A , italic_A ⟩ , ⋅ ) highlighted in green). (4) sampling in-connections for every out-stub we sample a valid in-stub (e.g., from (A,A,)𝐴𝐴(\langle A,A\rangle,\,\cdot\,)( ⟨ italic_A , italic_A ⟩ , ⋅ ): (,A,C)𝐴𝐶(\,\cdot\,,\langle A,C\rangle)( ⋅ , ⟨ italic_A , italic_C ⟩ ) or (,A,A)𝐴𝐴(\,\cdot\,,\langle A,A\rangle)( ⋅ , ⟨ italic_A , italic_A ⟩ ) – highlighted in grey). Valid in-stubs whose nodes belong to the same group have a 5% increased probability of being sampled ((,A,A)𝐴𝐴(\,\cdot\,,\langle A,A\rangle)( ⋅ , ⟨ italic_A , italic_A ⟩ ) gets the bonus while (,A,C)𝐴𝐶(\,\cdot\,,\langle A,C\rangle)( ⋅ , ⟨ italic_A , italic_C ⟩ ) does not). (5) the edges are saved as paths (A,A,A𝐴𝐴𝐴\langle A,A,A\rangle⟨ italic_A , italic_A , italic_A ⟩).

A.2 Synthetic Data Characteristics

We use two synthetic data sets with n=222𝑛superscript222n=2^{22}italic_n = 2 start_POSTSUPERSCRIPT 22 end_POSTSUPERSCRIPT paths. The paths emit first and second-order graphs with heterogeneous edge statistics. Figure 4 presents the the edge statistics for the synthetic data sets. For the given resolution, the graph for the synthetic data set with implanted pattern looks identical to the one without pattern due to the construction through the reused expected first-order statistics that defines the ΞΞ\Xiroman_Ξ-matrix for the second-order statistics and a sufficient small bias parameter during sampling. However, the emitted edge frequencies vary between the emitted graphs due to the random sampling procedure.

Refer to caption
(a) First-order edge statistics. They are equal for both the Weighted and the Unweighted data set. The differences are not observable by comparing them with the expected first-order edge statistics. The heterogeneous distribution is clearly visible.
Refer to caption
(b) Second-order edge statistics. The frequencies in the Weighted and Unweighted data sets differ but it is not visible due to the heterogeneous distribution. They also differ from the expected frequencies. We also distinguish between paths connecting same class nodes and different class nodes. Here it becomes clear that the mere frequencies – that are skewed – are not enough to distinguish between both cases.
Figure 4: Edge frequencies of the emitted graphs for the synthetic data sets. The plots show that due to the heterogeneous distribution overrepresented paths do not become visible. Figure 5 gives a zoomed in view to show the differences exploited by HYPA-DBGNN.

Figure 5 presents the absolute difference of the first- and second-order edge frequencies between the two data sets. Notable, all edges whose incident nodes are predominantly connected are sampled more often due to the bias parameter. This class-assortativity needs to be learned by the machine learning model.

Refer to caption
Figure 5: This plot presents the absolute difference of the second-order edge frequencies of the Weighted and Unweighted data set. Due to the random sampling there are edges that have a higher frequency in on or the other data set. This trend increases with for edges that have more candidates in the urn. The edges that represent paths connecting nodes from the same class are mostly more often sampled and thus overrepresented in the Weighted data set. However, compared to the absolute frequencies in Figure 4 the deviations are minor such that edges with low frequencies can be overrepresented. HYPA-DBGNN learns this pattern.

The comparison in Figure 6 of the frequencies with the the expected frequencies given by the ΞΞ\Xiroman_Ξ-matrix supports the differences between the two synthetic data sets and highlights the encoded class-assortativity in the data set with the biased sampling.

Refer to caption
(a) Relative frequency difference of the same second-order paths between the Unweighted Sampling and Weighted Sampling data. Paths connecting same class nodes on average have a higher frequency in the Weighted Sampling data set. This is consistent with Figure 5.
   
Refer to caption
(b) Relative frequency difference of the same second-order paths between the Unweighted Sampling data and the expected path frequencies. Here, no bias parameter is applied. Thus, the frequencies of paths connecting same class nodes are vary as much from the expected frequencies than the other paths.
Refer to caption
(c) Relative frequency difference of the same second-order paths between the Weighted Sampling data and the expected path frequencies. An increased bias parameter is applied. Thus, the frequencies of paths connecting the same class nodes appear more often with respect to the expected frequencies than the other paths.
Figure 6: Box plots showing how the distribution of second-order path frequencies vary in comparison between the two synthetic data sets and in comparison to the expected path frequencies. Only in the Weighted Sampling data set, the paths connecting same class nodes appear more often than the remaining paths.

A.3 Properties of Empirical Data

Table 3: Overview of time series data and ground truth node classes used in the experiments. δ𝛿\deltaitalic_δ describes the maximum time difference for edges to be considered part of a casual walk.
Data Set Ref. |V|𝑉|V|| italic_V | |E|𝐸|E|| italic_E | |V(2)|superscript𝑉2|V^{(2)}|| italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT | |E(2)|superscript𝐸2|E^{(2)}|| italic_E start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT | Classes (Sizes) δ𝛿\deltaitalic_δ
Highschool2011 [14] 126 3355 3042 17141 2 (85/41) 4
Highschool2012 [14] 180 4399 3965 20614 2 (132/48) 4
Hospital [54] 75 2052 2028 15500 4 (29/27/11/8) 4
StudentSMS [50] 429 1160 733 846 2 (314/115) 40
Workplace2016 [19] 92 1491 1431 7121 5 (34/26/15/13/4) 4

Appendix B Comments on Computational Complexity

There are two distinct steps to be considered when arguing about the complexity of our approach. First, there is the preprocessing step that creates the augmented graphs,i.e., the competition of the HYPA scores and the removal of the under-represented paths. Second, the graph neural network is trained on that graphs. For both steps, the complexity is determined by the number of edges in the higher-order De Bruijn graph. In the preprocessing, we calculate the HYPA score for higher-order edges.

The worst-case for the number of higher-order edges is given by the number of different sequences of length k𝑘kitalic_k, i.e., |V|ksuperscript𝑉𝑘|V|^{k}| italic_V | start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for a network with |V|𝑉|V|| italic_V | nodes. However, two arguments show that we can expect much lower complexity in real-world data. First of all, real-world networks are usually sparse, which implies that most sequences cannot occur as they would otherwise violate the network topology.

LaRock et al. [31] use this argument, and prove that the complexity of their algorithm can be tightened with ΔG(k)|V|2λ1kΔsuperscript𝐺𝑘superscript𝑉2superscriptsubscript𝜆1𝑘\Delta G^{(k)}\leq|V|^{2}\lambda_{1}^{k}roman_Δ italic_G start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ≤ | italic_V | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where |V|𝑉|V|| italic_V | denotes the number of nodes in the first-order graph G𝐺Gitalic_G and λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the leading eigenvalue of the binary adjacency matrix of G𝐺Gitalic_G. They conclude, that the HYPA score calculation scales linearly with the number of paths N𝑁Nitalic_N in the given data set for sparse real-world graphs, a moderate order k𝑘kitalic_k, and a sufficiently large N𝑁Nitalic_N. [45] also uses the argument of sparsity to further limit the complexity of the De Bruijn graph. They note that the number of walks of length k𝑘kitalic_k becoming higher-order edges in the higher-order De Bruijn graph is also limited by ijAijk|V|ksubscript𝑖𝑗subscriptsuperscript𝐴𝑘𝑖𝑗superscript𝑉𝑘\sum_{ij}A^{k}_{ij}\leq|V|^{k}∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ | italic_V | start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where Aksuperscript𝐴𝑘A^{k}italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is the k-th power of the binary adjacency matrix A𝐴Aitalic_A of G𝐺Gitalic_G.

Furthermore, higher-order networks are even sparser than what we would expect based on the first-order topology. This is because the number of different time-respecting paths occurring on a network is generally much lower than the number of possible paths. [45] demonstrate this (see in the appendix) by plotting the number of realized walks at each length and showing that in empirical graphs only a small fraction of walks is realized due to the restriction to time-respecting paths. By studying the complexity of the used empirical data set, they argue that De Bruijn graphs are applicable to real-world tasks.

We consider a path data set S𝑆Sitalic_S with N𝑁Nitalic_N entries. The number of edges in the k-th-order De Bruijn graph is denoted as ΔG(k)Δsuperscript𝐺𝑘\Delta G^{(k)}roman_Δ italic_G start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. LaRock et al. [31] state that the asymptotic runtime of HYPA is O(N+ΔG(k))𝑂𝑁Δsuperscript𝐺𝑘O(N+\Delta G^{(k)})italic_O ( italic_N + roman_Δ italic_G start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ). A trivial upper-bound for ΔG(k)Δsuperscript𝐺𝑘\Delta G^{(k)}roman_Δ italic_G start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the fully connected case with |V|k+1superscript𝑉𝑘1|V|^{k+1}| italic_V | start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT. This trivial case is also considered by [45] when they argue that the complexity of message passing on the De Bruijn graph is bounded.

Appendix C Variants of HYPA-DBGNN

In this section, we present other variations of our main HYPA-DBGNN architecture.

C.1 Base Architecture without Anomalies (HYPA-DBGNN-)

Replacing the HYPA scores with the absolute edge frequencies in the message passing procedure leads to the original message passing layers proposed by Kipf and Welling [29]. The overall structure including the bipartite layers is kept. The comparison of this model (HYPA-DBGNN-) with HYPA-DBGNN reinforces the understanding of the significance of HYPA scores.

C.2 Edge Embedded HYPA Scores (HYPA-DBGNNE)

For HYPA-DBGNN the HYPA scores are used in a graph model selection step to enhance the message passing. Whereas for HYPA-DBGNNE the HYPA scores are understood as additional edge attributes whose significance is learned by an adapted graph convolution operation that embeds the edge attributes into the incident node attributes during message passing in the first graph neural network layers. The augmented propagation rule is given as

hvik,1=σ(j1cij(hvjk,0Wk,1+heijkWk,e)),superscriptsubscriptsubscript𝑣𝑖𝑘1𝜎subscript𝑗1subscript𝑐𝑖𝑗superscriptsubscriptsubscript𝑣𝑗𝑘0superscript𝑊𝑘1subscriptsuperscriptsubscript𝑒𝑖𝑗𝑘superscript𝑊𝑘𝑒\displaystyle\vec{h}_{v_{i}}^{k,1}=\sigma\left(\sum_{j}\frac{1}{c_{ij}}\left(% \vec{h}_{v_{j}}^{k,0}W^{k,1}+\vec{h}_{e_{ij}^{k}}W^{k,e}\right)\right),over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , 1 end_POSTSUPERSCRIPT = italic_σ ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG ( over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , 0 end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k , 1 end_POSTSUPERSCRIPT + over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_k , italic_e end_POSTSUPERSCRIPT ) ) , (4)

with the first hidden representation hvjk,0superscriptsubscriptsubscript𝑣𝑗𝑘0\vec{h}_{v_{j}}^{k,0}over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , 0 end_POSTSUPERSCRIPT of node uV(k)𝑢superscript𝑉𝑘u\in V^{(k)}italic_u ∈ italic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, the inferred HYPA scores in heijksubscriptsuperscriptsubscript𝑒𝑖𝑗𝑘\vec{h}_{e_{ij}^{k}}over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for the k𝑘kitalic_k-th-order edge eijkE(k)subscriptsuperscript𝑒𝑘𝑖𝑗superscript𝐸𝑘e^{k}_{ij}\in E^{(k)}italic_e start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ italic_E start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, the trainable weight matrices Wk,1H1×H0superscript𝑊𝑘1superscriptsuperscript𝐻1superscript𝐻0W^{k,1}\in\mathbb{R}^{H^{1}\times H^{0}}italic_W start_POSTSUPERSCRIPT italic_k , 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for the nodes and Wk,eH1×1superscript𝑊𝑘𝑒superscriptsuperscript𝐻11W^{k,e}\in\mathbb{R}^{H^{1}\times 1}italic_W start_POSTSUPERSCRIPT italic_k , italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT × 1 end_POSTSUPERSCRIPT for the edges and the normalization factor cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT as defined by Kipf and Welling [29].

C.3 Z-Score as Replacement for HYPA Scores (HYPA-DBGNNZ)

The HYPA scores are based on the CDF. A a replacement for the CDF, a transformed Z-score instead of the HYPA score is implemented in HYPA-DBGNNZ. The underlying soft configuration model provides the needed expected value and variance with

𝔼[Xij]=mΞijM𝔼delimited-[]subscript𝑋𝑖𝑗𝑚subscriptΞ𝑖𝑗𝑀\displaystyle\mathbb{E}[X_{ij}]=m\frac{\Xi_{ij}}{M}blackboard_E [ italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] = italic_m divide start_ARG roman_Ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_M end_ARG (5)

and

Var[Xij]=mMmM1ΞijM𝑉𝑎𝑟delimited-[]subscript𝑋𝑖𝑗𝑚𝑀𝑚𝑀1subscriptΞ𝑖𝑗𝑀\displaystyle Var[X_{ij}]=m\frac{M-m}{M-1}\frac{\Xi_{ij}}{M}italic_V italic_a italic_r [ italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] = italic_m divide start_ARG italic_M - italic_m end_ARG start_ARG italic_M - 1 end_ARG divide start_ARG roman_Ξ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_M end_ARG (6)

needed to define the Z-score as

z(Aij)=Aij𝔼[Xij]Var[Xij].𝑧subscript𝐴𝑖𝑗subscript𝐴𝑖𝑗𝔼delimited-[]subscript𝑋𝑖𝑗𝑉𝑎𝑟delimited-[]subscript𝑋𝑖𝑗\displaystyle z(A_{ij})=\frac{A_{ij}-\mathbb{E}[X_{ij}]}{\sqrt{Var[X_{ij}]}}.italic_z ( italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = divide start_ARG italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - blackboard_E [ italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] end_ARG start_ARG square-root start_ARG italic_V italic_a italic_r [ italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] end_ARG end_ARG . (7)

Opposing to the HYPA score the Z-score is unbounded and possibly negative. Edges with negative Z-score are excluded because they are under-represented. Likewise in HYPA-DBGNN in most cases under-represented edges are removed, too, because their HYPA scores is approximately zero. Additionally, edges with a Z-score smaller than one are removed with the same argument of not having an unexpected large contribution to the graph and only beeing larger than 0 due to noisy fluctuations in the frequencies. The resulting restricted Z-score is logarithmically transformed due to observed large spread in empirical data, leading to the final replacement for the HYPA-score:

z(eij)={0 if z(eij)<1,log(z(eij)) otherwisesuperscript𝑧subscript𝑒𝑖𝑗cases0 if 𝑧subscript𝑒𝑖𝑗1𝑧subscript𝑒𝑖𝑗 otherwise\displaystyle z^{\prime}(e_{ij})=\begin{cases}0&\text{ if }z(e_{ij})<1,\\ \log(z(e_{ij}))&\text{ otherwise}\end{cases}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = { start_ROW start_CELL 0 end_CELL start_CELL if italic_z ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) < 1 , end_CELL end_ROW start_ROW start_CELL roman_log ( italic_z ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) end_CELL start_CELL otherwise end_CELL end_ROW (8)

Appendix D Ablation Study: Impact of Statistical Information

We conduct an ablation study in which we compare our architectures HYPA-DBGNN, HYPA-DBGNNE and HYPA-DBGNNZ to the base architecture HYPA-DBGNN- that is not using statistical information. We aim to answer the question of what effect the addition of statistical information has on the prediction capability of the architectures in Table 4.

By comparing HYPA-DBGNN to HYPA-DBGNN- we see that the statistical information play an important role for all data sets but most importantly it becomes visible that the improvements for Hospital are indeed related to the additional information.

HYPA-DBGNNE with edge encoded statistical features performs better than the uninformed baseline but is most of the time significant weaker than HYPA-DBGNN. The structural graph correction applied in HYPA-DBGNN is still missing even when the edge encoder is able to learn the significance of the HYPA scores. HYPA-DBGNNZ performs weak for data sets where we don’t see direct patterns in the analysis but works well for Hospital. It needs to be explored why the Z-score is more susceptible for data sets with weak or no patterns.

Table 4: Ablation study for HYPA-DBGNN. The best results are marked.
Model Highschool2011 Highschool2012 Hospital StudentSMS Workplace2016
HYPA-DBGNN 63.25 ± 16.18 66.41 ± 10.24 76.39 ± 17.12 60.66 ± 6.11 88.29 ± 10.51
HYPA-DBGNNE 61.54 ± 13.62 64.94 ± 17.71 59.03 ± 12.72 60.46 ± 9.42 88.50 ± 13.57
HYPA-DBGNNZ 53.97 ± 17.59 59.63 ± 15.74 69.31 ± 11.74 53.45 ± 7.50 88.42 ± 10.88
HYPA-DBGNN- 57.67 ± 17.16 64.49 ± 15.27 55.83 ± 19.27 56.23 ± 10.41 86.46 ± 12.65

Appendix E Experiment Resources and Reproducibility

We performed the experiments on a single PC with an NVIDIA GeForce RTX 3070 with 8 GB memory. On average one single experiment repetition takes approximately 5 minutes depending on the method and the data set. We run 4 experiments in parallel. We test the 11 methods (8 in the main study, 3 in the ablation study) with a parameter search over at most 25 variants on 7 data sets (5 empirical, 2 synthetic). All in all, the estimated time for the experiments is approximately 400 hours, excluding pre-studies. While this is only a rough estimate it reflects the order of magnitude of time needed to run all experiments.

To reproduce the experiments, we provide a reference implementation at https://github.com/jvpichowski/HYPA-DBGNN together with synthetic and empirical data sets and their splits and licenses. For the implementations of the baselines we attribute the reused implementations from the DBGNN reference paper [45]. They also parse and provide the used empirical data sets.

We include a self-containing benchmark to compare HYPA-DBGNN to other methods including strong candidates like GCN and DBGNN following the described evaluation procedure. The benchmark is as concise as possible to let the reader focus on the main contributions. This benchmark can be used to reproduce presented results.

Appendix F Additional Results

Table 5: Comparison of our architectures (HYPA-DBGNN, HYPA-DBGNN-, HYPA-DBGNNE, HYPA-DBGNNZ) with different machine learning models. The balanced accuracy is given in Table 1, Table 2 and Table 4. The results are obtained as described in Section 5. The best results are marked.
Data Set Model F1-score-macro Precision-macro Recall-macro
Highschool2011 EVO 39.51 ± 11.50 39.38 ± 19.64 43.68 ± 10.91
HONEM 57.54 ± 11.52 58.19 ± 13.09 59.00 ± 10.61
DeepWalk 53.70 ± 18.55 53.47 ± 19.61 54.64 ± 17.70
Node2Vec 53.70 ± 18.55 53.47 ± 19.61 54.64 ± 17.70
GCN 48.55 ± 15.49 49.45 ± 18.52 55.00 ± 13.37
LGNN 52.66 ± 14.71 53.57 ± 15.97 57.72 ± 9.85
DBGNN 57.08 ± 11.35 61.78 ± 10.75 61.54 ± 11.13
HYPA-DBGNN 59.60 ± 15.04 62.55 ± 14.38 63.25 ± 16.18
HYPA-DBGNN- 55.92 ± 17.41 56.85 ± 16.26 57.67 ± 17.16
HYPA-DBGNNE 57.30 ± 15.77 63.29 ± 14.85 61.54 ± 13.62
HYPA-DBGNNZ 49.63 ± 17.56 52.23 ± 19.82 53.97 ± 17.59
Highschool2012 EVO 46.83 ± 9.44 47.97 ± 18.15 50.05 ± 7.30
HONEM 50.58 ± 9.49 53.89 ± 15.27 50.49 ± 9.31
DeepWalk 48.79 ± 13.02 49.75 ± 13.77 49.65 ± 12.97
Node2Vec 48.79 ± 13.02 49.75 ± 13.77 49.65 ± 12.97
GCN 54.53 ± 10.82 56.94 ± 12.00 59.35 ± 11.13
LGNN 45.32 ± 16.88 51.43 ± 14.63 51.43 ± 17.94
DBGNN 60.22 ± 13.73 63.18 ± 12.57 64.93 ± 15.26
HYPA-DBGNN 60.58 ± 12.12 66.23 ± 13.01 66.41 ± 10.24
HYPA-DBGNN- 61.26 ± 16.13 64.37 ± 15.44 64.49 ± 15.27
HYPA-DBGNNE 61.53 ± 17.30 64.22 ± 15.56 64.94 ± 17.71
HYPA-DBGNNZ 56.00 ± 15.24 58.46 ± 14.32 59.63 ± 15.74
Hospital EVO 20.05 ± 6.64 19.12 ± 9.20 25.00 ± 7.86
HONEM 34.88 ± 18.22 36.88 ± 23.53 37.50 ± 17.35
DeepWalk 20.00 ± 9.53 18.76 ± 9.68 23.89 ± 10.91
Node2Vec 20.00 ± 9.53 18.76 ± 9.68 23.89 ± 10.91
GCN 37.38 ± 8.67 33.83 ± 8.00 43.47 ± 9.03
LGNN 35.81 ± 8.96 32.75 ± 10.64 44.03 ± 9.03
DBGNN 47.87 ± 20.02 48.21 ± 21.79 51.67 ± 20.34
HYPA-DBGNN 71.80 ± 19.18 71.50 ± 20.95 74.31 ± 17.45
HYPA-DBGNN- 51.91 ± 20.77 50.83 ± 22.33 55.00 ± 20.49
HYPA-DBGNNE 52.08 ± 13.10 52.25 ± 13.41 59.03 ± 12.72
HYPA-DBGNNZ 65.66 ± 13.39 66.79 ± 16.20 69.31 ± 11.74
StudentSMS EVO 54.62 ± 7.73 55.63 ± 9.53 55.05 ± 6.39
HONEM 52.46 ± 9.71 55.65 ± 14.29 53.81 ± 7.28
DeepWalk 52.08 ± 7.19 53.18 ± 7.61 52.78 ± 7.83
Node2Vec 51.87 ± 7.39 52.13 ± 6.90 52.31 ± 7.70
GCN 53.85 ± 6.39 54.39 ± 6.27 54.50 ± 6.40
LGNN 46.79 ± 5.27 52.70 ± 6.07 52.71 ± 6.63
DBGNN 56.87 ± 5.05 58.55 ± 5.58 57.72 ± 5.29
HYPA-DBGNN 60.47 ± 6.68 61.40 ± 7.00 60.66 ± 6.11
HYPA-DBGNN- 54.58 ± 9.12 55.66 ± 8.88 56.23 ± 10.41
HYPA-DBGNNE 59.31 ± 9.08 59.97 ± 9.24 60.46 ± 9.42
HYPA-DBGNNZ 52.60 ± 6.74 54.24 ± 9.03 53.45 ± 7.50
Workplace2016 EVO 22.74 ± 12.34 21.84 ± 14.18 26.50 ± 12.08
HONEM 77.75 ± 11.70 79.53 ± 13.50 79.46 ± 10.32
DeepWalk 17.23 ± 8.77 16.30 ± 9.42 20.54 ± 9.51
Node2Vec 17.23 ± 8.77 16.30 ± 9.42 20.54 ± 9.51
GCN 68.56 ± 14.78 66.21 ± 16.88 73.33 ± 12.60
LGNN 82.96 ± 15.65 84.32 ± 15.04 84.83 ± 14.77
DBGNN 81.16 ± 19.16 81.33 ± 20.14 84.42 ± 15.59
HYPA-DBGNN 85.82 ± 12.23 85.42 ± 13.75 88.29 ± 10.51
HYPA-DBGNN- 82.75 ± 14.26 83.25 ± 15.21 84.71 ± 13.66
HYPA-DBGNNE 86.47 ± 16.28 86.00 ± 17.36 88.50 ± 13.57
HYPA-DBGNNZ 87.67 ± 11.99 88.83 ± 13.10 88.42 ± 10.88