Inference of Sequential Patterns for Neural Message Passing in Temporal Graphs

Jan von Pichowski Vincenzo Perri Lisi Qarkaxhija Ingo Scholtes
Chair of Machine Learning for Complex Networks
Center for Artificial Intelligence and Data Science (CAIDAS)
Julius-Maximilians-Universität Würzburg, DE
[email protected]
Data Analytics Group, Department of Informatics, University of Zurich, Zurich, CH

Abstract

The modelling of temporal patterns in dynamic graphs is an important current research issue in the development of time-aware Graph Neural Networks (GNNs). However, whether or not a specific sequence of events in a temporal graph constitutes a temporal pattern not only depends on the frequency of its occurrence. We must also consider whether it deviates from what is expected in a temporal graph where timestamps are randomly shuffled. While accounting for such a random baseline is important to model temporal patterns, it has mostly been ignored by current temporal graph neural networks. To address this issue we propose HYPA-DBGNN, a novel two-step approach that combines (i) the inference of anomalous sequential patterns in time series data on graphs based on a statistically principled null model, with (ii) a neural message passing approach that utilizes a higher-order De Bruijn graph whose edges capture overrepresented sequential patterns. Our method leverages hypergeometric graph ensembles to identify anomalous edges within both first- and higher-order De Bruijn graphs, which encode the temporal ordering of events. Consequently, the model introduces an inductive bias that enhances model interpretability.

We evaluate our approach for static node classification using established benchmark datasets and a synthetic dataset that showcases its ability to incorporate the observed inductive bias regarding over- and under-represented temporal edges. Furthermore, we demonstrate the framework’s effectiveness in detecting similar patterns within empirical datasets, resulting in superior performance compared to baseline methods in node classification tasks. To the best of our knowledge, our work is the first to introduce statistically informed GNNs that leverage temporal and causal sequence anomalies. HYPA-DBGNN represents a promising path for bridging the gap between statistical graph inference and neural graph representation learning, with potential applications to static GNNs.

1 Introduction

Graphs are powerful representations of complex data. Not surprisingly, there is a growing collection of successful methods for learning on graphs [29, 55, 22]. These methods are versatile and are widely used in bioinformatics [60], social sciences [44], and pharmacy [52]. While many methods assume a static graph, real-world scenarios often involve dynamic systems, such as evolving interactions in social networks. Although known techniques for static graphs can be applied to dynamic graphs [32], important patterns may be missed [57]. Recently, several approaches have incorporated temporal dynamics to obtain time-aware graph neural networks. These methods are applied to different tasks such as static node classification [45], link prediction or continuous node property prediction [47].

A common theme between static and temporal GNNs is that the observed graphs are usually directly used for message passing. Recently data augmentation techniques have been proposed to improve the generalizability of GNNs. Such data augmentation techniques have been considered for a variety of reasons such as to reduce oversquashing [53], improve class homophily for node classification [33], foster diffusion [62], or include non-dyadic relation-ships [45]. Another motivation that has recently been highlighted in [64] is the presence of noise in empirically observed graphs. This motivates augmentation techniques for GNNs that ideally prune spuriously observed edges, while adding erroneously unobserved edges.

However, addressing noise in observed graphs arguably requires graph correction methods accounting for a “random baseline” that allows to distinguish significant patterns from noise, rather than augmentation methods that are based on heuristics or adjust the graph based on ground truth node classes. Moreover, the application of GNNs to temporal graphs introduces unique challenges for data augmentation as we typically want to focus on temporal patterns that are due to the time-ordered sequence of events. To the best of our knowledge, no existing works have considered graph correction methods that combine a statistically principled inference of sequential patterns with temporal GNNs.

Addressing this research gap, in this work we propose HYPA-DBGNN, a novel two-step approach for temporal graph learning: In a first step we infer anomalous sequential patterns in time series data on graphs based on a statistical ensemble of temporal graphs, i.e. a null model of random graphs that preserves the frequency of time-stamped edges but randomizes the temporal ordering in which those edges occur. Building on the HYPA framework [31], our method leverages hypergeometric graph ensembles. This allows us to analytically calculate expected frequencies of node sequences on time-respecting paths, which is the basis to identify anomalous sequential patterns in temporal graphs. In a second step we apply neural message passing on an augmented higher-order De Bruijn graph, whose edges capture overrepresented sequential patterns in a temporal graph, thus introducing an inductive bias that emphasizes sequential patterns over mere edge frequencies. The contributions of our work are as follows:

(i)

We propose a novel approach to augment message passing based on a statistical null model. This allows us to infer which temporal sequences in a time-stamped interaction sequence are over- or under-represented compared to a random baseline temporal graph in which the frequency of edges are preserved while their temporal ordering is shuffled.
(ii)

Building on this statistical inference approach, we propose HYPA-DBGNN, a time-aware temporal graph neural network architecture that specifically captures temporal patterns that deviate from a random baseline.
(iii)

We demonstrate our approach in synthetic temporal graphs sampled from a model that generates heterogeneously distributed temporal sequences of events in such a way that node classes are associated with the over- or underrepresentation of temporal events compared to random temporal orderings rather than mere frequencies.
(iv)

We demonstrate the practical relevance of our method by evaluating node classification in five empirical temporal graphs capturing time-stamped proximity events between humans. A comparison of HYPA-DBGNN with standard De Bruijn Graph Neural Networks without our HYPA-based inference reveals that our approach yields an improved accuracy in all five data sets. Moreover, a comparison to seven baseline techniques shows that our method yields the best performance in all empirical data.
(v)

We finally show that the distribution of HYPA scores in the augmented message passing graph, which captures the degree to which frequencies of temporal sequences deviate from a random baseline, enables us to explain why HYPA-DBGNN yields larger performance improvements on some data sets compared to others.

Different from prior works, with our work we propose a statistically principled data augmentation for temporal graph neural networks that uses a statistical ensemble of temporal graphs with a given weighted topology. Apart from improving temporal GNNs, we further argue that the general approach of utilizing well-known statistical ensembles of graphs from network science for graph correction could help to improve the performance of GNNs in data affected by noise.

2 Related Work

Data augmentation for graphs has been explored from various directions with the goal of allowing machine learning models to better generalize and attend to signal over noise [66]. Many methods have utilized heuristic graph modification strategies like randomly removing nodes [59, 13], edges [46], or subgraphs [56, 59] to improve performance and generalizability. Other works have considered adding virtual nodes [43, 24] or rewiring the network topology, which also addresses oversquashing [53, 1], with graph transformers operating on a fully connected topology representing an extreme case [37, 58, 30]. Additionally, it has been shown that using graph diffusion convolutions instead of raw neighborhoods alleviates problems from noisy and arbitrarily defined edges in real-world graphs [16]. Network data augmentation has also been explored by going beyond pairwise connections, either through mediating node interactions via subgraphs [39, 3, 63, 9] or by utilizing higher-order graphs. Examples of higher-order approaches include simplicial networks [5], cellular complexes [4, 20], hypergraphs [23, 8, 17], and time-respecting node sequences [45]. Another area of research focused on learning the graph augmentations from the data. One approach is to perform graph augmentation as a preprocessing step, completely separate from the downstream task, where the graph structure is cleaned before being used as input to the GNN [27, 65]. Other works embed the augmentation strategy into an end-to-end differentiable pipeline, jointly learning the optimal graph representation and the downstream task [26, 35, 15, 12, 28].

As our work addresses temporal graph data, it is related to the field of temporal GNNs. Temporal GNNs have been developed for both discrete and continuous time settings [34]. Discrete-time approaches segment the temporal data into time windows [49, 21], thus aggregating interactions and losing information on time-respecting paths within those time windows. In contrast, continuous-time approaches produce time-evolving node embeddings, focusing on the temporal variability of network activity at different time points rather than on the patterns occurring across temporally-ordered interaction sequences [57, 47]. The work most similar to our perspective is DBGNN [45], which learns from sequential correlations in high-resolution timestamped data. Our approach diverges from DBGNN by considering a more nuanced notion of the relevance of time-respecting paths. Rather than relying on the raw frequency of interactions, HYPA-DBGNN uses a statistically grounded anomaly score. This score quantifies the over- and under-expression of time-respect paths, making the model less susceptible to noise while basing contribution of paths on their statistical significance.

3 Background

A graph $G=\left(V,E\right)$ is defined as a set of nodes $V$ representing the elements of the system, and a set of edges $E\subseteq V\times V$ representing their direct connections. However, it is often important to consider how nodes influence one another through a path, which is an ordered sequence $(v_{1},v_{2},\dots,v_{l})$ of nodes $v_{i}\in V$ . In a path, all node transitions must correspond to edges in the graph, i.e., $e_{i}=(v_{i},v_{i+1})\in E;\forall i\in[0,l-1]$ . Paths are often inferred from edges based on a transitivity assumption. This assumption states that if there is an edge $(v_{0},v_{1})$ with transition probability $\alpha$ , and an edge $(v_{1},v_{2})$ with transition probability $\beta$ , then the path $(v_{0},v_{1},v_{2})$ will be observed with probability $\alpha\cdot\beta$ . In other words, the transitions are considered to be independent. The transitivity assumption simplifies the modeling of a path by expressing its probability as the product of the individual edge transition probabilities. However, this assumption often fails in temporal networks $G^{t}=(V,E^{t})$ , where $E^{t}\subseteq V\times V\times\mathbb{N}$ as edges have timestamps. In temporal networks, the ordering of edges can play an important role in determining the likelihood of observing certain paths. A time-respecting path is defined as a sequence of edges $((v_{0},v_{1},t_{1}),\ldots,(v_{i},v_{i+1},t_{i}),\ldots,(v_{n-1},v_{n},t_{n}))$ that $\forall i\in[0,l-1]$ respects two conditions: (i) transitions respect the order of time $t_{i}>t_{i-1}$ , and (ii) $t_{i}-t_{i-1}\leq\delta$ , where $\delta$ is a parameter controlling the maximum time distance for considering interactions temporally adjacent. Therefore, different from what we would get by discarding time and using the transitivity assumption, the two edges $(v,w,t_{1})$ and $(u,v,t_{2})$ form a time-respecting path only if $t_{2}>t_{1}$ . To capture time-respecting sequential patterns, higher-order De Bruijn graphs model the probabilities of path sequences explicitly. These models construct a representation that respects the topology of the original graph and the frequencies of observed paths of a given length $k$ . Specifically, a higher-order network of the k-th order is defined as an ordered pair $G^{(k)}=(V^{(k)},E^{(k)})$ , where $V^{(k)}\subseteq V^{k}$ are the higher-order vertices, and $E^{(k)}\subseteq V^{(k)}\times V^{(k)}$ are the higher-order edges. Each higher-order vertex $v=:\langle v_{0}v_{1}\ldots v_{k-1}\rangle\in V^{(k)}$ is an ordered tuple of $k$ vertices $v_{i}\in V$ from the original graph. The higher-order edges connect higher-order nodes that overlap in exactly $k-1$ vertices, similar to the construction of high-dimensional De Bruijn graphs [10]. The weights of the higher-order edges in $G^{(k)}$ represent the frequency of paths of length $k$ in the original graph. Specifically, the weight of the edge $(\langle v_{0}\ldots v_{k-1}\rangle,\langle v_{1}\ldots v_{k}\rangle)$ counts how often the path $\langle v_{0}\ldots v_{k}\rangle$ of length $k$ occurs. By explicitly modeling the probabilities of these higher-order path sequences, the higher-order network representation can capture patterns and dependencies that may be missed when relying on the transitivity assumption [51].

Detection of Path Anomalies

Defining anomalies requires a reference base. In our case, the transitivity assumption provides the null model that serves as this baseline. Anomalies occur in sequences that deviate from this baseline, likely due to correlations and interdependencies not captured by the transitivity assumption. First, we discuss how the hypergeometric ensemble allows testing for anomalous edge frequencies based on node activity, i.e., their in- and out-degrees. Building on this, we then outline how this methodology is extended to test if the frequencies of paths of length $k$ are anomalous given those of paths of length $k-1$ .

Configuration models [38] provide randomization methods for graphs that shuffle edges while preserving vertex degrees. In a nutshell, first, they disassemble the graph, leaving nodes with in- an out-stubs. Then, a new network is reassembled by connecting pairs of in- and out- are picked with equal probability. This procedure is algorithmically straightforward but can be computationally expensive. To address this, Casiraghi and Nanumyan [6] contributed a closed-form expression for the soft configuration model, which fixes the expected vertex degrees rather than the exact degree sequence. In their formulation, the sampling of edges is equated to sampling from an urn. The authors introduce a combinatorial matrix $\mathbf{\Xi}\in\mathbb{N}^{n}\times\mathbb{N}^{n}$ , where $\Xi_{ij}=d^{out}_{i}\cdot d^{in}_{j}$ encodes the product of the out-degree of node $i$ and the in-degree of node $j$ in the original graph $G$ . The total number of possible edge placements is then $M=\sum_{ij}\Xi_{ij}$ . A network is sampled from this ensemble by drawing $m=\sum_{i}d^{out}_{i}=\sum_{i}d^{in}_{i}$ edges without replacement from the $M$ possible edge placements. The probability of observing $A_{ij}$ edges between nodes $i$ and $j$ is then given by the hypergeometric distribution: $P(A_{ij})=\binom{M}{m}^{-1}\binom{\Xi_{ij}}{A_{ij}}\binom{M-\Xi_{ij}}{m-A_{ij}}.$ Having this probability mass function, we can use the equation above to quantify the anomalousness of the frequency of an edge. This closed-form expression and the sampling process that generates it provides a principled null model that preserves the expected degree sequence, which will be crucial for our subsequent analysis of anomalous path patterns in the network.

Our concept of path anomalies, introduced by LaRock et al. [31], provides a statistical framework for identifying paths through a graph that are traversed with anomalous frequencies. The key idea is to define a null model of order $k-1$ that captures the expected frequencies of paths of length $k$ , and then identify paths that deviate significantly from this null model. To construct the null model, one must establish a statistical ensemble of $k$ -th order De Bruijn graphs. The starting point is the hypergeometric ensemble outlined in the previous paragraph, which preserves the total in- and out-degrees of nodes while shuffling the edge weights. For a De Bruijn graph of order $k$ , the nodes’ degrees are determined by the frequencies of paths of length $k-1$ , i.e., by the edge frequencies of De Bruijn graph of order $k-1$ . A hypergeometric ensemble of the De Bruijn graph presents one additional difficulty. Specifically, an edge between two k-th order nodes is valid only if their path representations overlap in $k-1$ first-order nodes. This implies that some of the $\mathbf{\Xi}$ matrix entries represent invalid paths. HYPA handles this by zeroing out impossible entries and redistributing their values through an optimization procedure, as detailed in the original work.

4 HYPA De Bruijn Graph Neural Network Architecture

We now introduce the HYPA-DBGNN architecture ¹¹1A reference implementation, data sets and benchmarks are given at https://github.com/jvpichowski/HYPA-DBGNN. that relies on statistical principled graph augmentation. The temporal dynamics of the sequential patterns are encoded in first- and higher-order De Bruijn graphs. Graphs corrections are inferred that include anomaly statistics in the graph topology. We then present a multi-order augmented message passing scheme that relies on the inferred graphs with induced bias. Although we adapt the message passing procedure of Graph Convolution Networks (GCN) from Kipf and Welling [29], our architecture is generalizable to other message passing schemes due to the selective additions.

Statistical Principled Graph Augmentation

As outlined before, the k-th-order De Bruijn graphs capture the observed frequencies of the k-th-order sequences through the edges between k-th-order nodes. This potentially biased representation yields the foundation for hypergeometric ensembles whose edge frequencies are induced by the k-1-th-order sub-sequences. The HYPA score [31], defined as $HYPA^{(k)}(u,v)=\Pr(X_{uv}\leq f(u,v))$ , uses these to describe how probable an observed edge has a higher frequency than in any random realization. A large HYPA score encodes a highly represented edge whereas a HYPA score approaching zero describes edges that are observed less than expected. Leveraging the HYPA scores as adjacency matrix $A_{uv}^{(k)}=HYPA^{(k)}(u,v)$ leads to corrected graphs with reduced under-represented edges and balanced over-represented edges encoding the temporal pattern.

Message Passing for Higher-Order De Burijn Graphs with Induced Bias

For layer $l$ , we define the update rule of the message passing as

\displaystyle\vec{h}_{v}^{k,l}=\sigma\left(\mathbf{W}^{k,l}\sum_{\{u\in V^{(k)% }:(u,v)\in E^{(k)}\}\cup\{v\}}\frac{HYPA^{(k)}(u,v)\cdot\vec{h}^{k,l-1}_{u}}{% \sqrt{H(v)\cdot H(u)}}\right),

(1)

with the previous hidden representation $\vec{h}_{u}^{k,l-1}$ of node $u\in V^{(k)}$ , the inferred HYPA score $HYPA^{(k)}(u,v)$ of the given edge $(u,v)\in E^{(k)}$ (capturing the induced bias), the trainable weight matrices $\mathbf{W}^{k,l}\in\mathbb{R}^{H^{l}\times H^{l-1}}$ , the normalization factor based on the HYPA score sum of incoming edges $H(v):=\sum_{\{u\in V^{(k)}:(u,v)\in E^{(k)}\}\cup\{v\}}HYPA^{(k)}(u,v)$ , and the non-linear activation function $\sigma$ , here ReLU.

Depending on order $k$ , the message passing for different higher order graphs is based on different higher order node sets $V^{(k)}$ whose nodes $v$ have their own hidden representations $\vec{h}^{k,l}_{v}$ for every layer $l$ . An initial feature encoding is only provided for the first order ( $k=1$ ) as $\vec{h}^{1,0}_{v}$ . To transfer the features to higher-order nodes and to merge the hidden representations, we introduce two bipartite map**s.

The initial first-order feature set $\vec{h}^{1,0}_{u}$ is mapped to the higher-order node representations $\vec{h}^{k,1}_{v}$ using the bipartite graph $G^{b_{0}}=\left(V^{(k)}\cup V,E^{b_{0}}\subseteq V^{\times}V^{(k)}\right)$ with $e_{uv}\in E^{b_{0}}$ if $v=(v_{0},\dots,v_{k-1})\in V^{(k)}$ and $v_{0}=u$ in analogy to interpreting message passing layers as higher-order Markov chains. Multiple first-order representations are aggregated using the function $\mathcal{F}$ (in our case MEAN) and transformed with the learnable weight matrix $\textbf{W}^{b_{0}}\in\mathbb{R}^{H^{1,0}\times H^{k,0}}$ to the higher-order feature space.

\displaystyle\vec{h}^{k,1}_{v}=\sigma\left(\textbf{W}^{b_{0}}\mathcal{F}\left(% \left\{\vec{h}^{1,0}_{u}\text{ : for }u\in V^{(1)}\text{ with }(u,v)\in E^{b_{% 0}}\right\}\right)\right)

(2)

The second map** is defined as by Qarkaxhija et al. [45]. It is the counterpart to the first bipartite layer. Here, the higher-order node representations are summed with the first-order node representations (requiring matching representation dimensions $F^{g}=H^{l}$ ) if the last path entry $u_{k-1}$ of the higher-order node $u=(u_{0},\dots,u_{k-1})\in V^{(k)}$ equals the first-order node $v=u_{k-1}$ . The bipartite graph is given as $G^{b}=\left(V^{(k)}\cup V,E^{b}\subseteq V^{(k)}\times V\right)$ and leads to a first-order node representation $\vec{h}^{b}_{v}$ with $v\in V$ and the learnable matrix $\textbf{W}^{b}\in\mathbb{R}^{F^{g}\times H^{l}}$ .

\displaystyle\vec{h}^{b}_{v}=\sigma\left(\textbf{W}^{b}\mathcal{F}\left(\left% \{\vec{h}^{k,l}_{u}+\vec{h}^{1,g}_{v}\text{ : for }u\in V^{(k)}\text{ with }(u% ,v)\in E^{b}\right\}\right)\right)

(3)

An overview of the inference process and the proposed neural network architecture for the first- and second-order graph is shown in Figure 1. A one-hot encoding is used as first-order feature set and transferred to the higher-order nodes using the first bipartite layer. Multiple message passing steps are performed independently for the two given graph topologies. The number of message passing rounds and the dimensions of layers may vary in the two parts. After performing the message passing in parallel and merging the features with the second bipartite layer a final classification layer is applied. We discuss the computational complexity in Appendix B. However, it is upper-bounded by the complexity of DBGNN due to the edges removed in graph correction.

Refer to caption — Figure 1: Inference procedure leading to the dynamic graph used for neural message passing. (a) Example of sequence data adapted from LaRock et al. [31]. (b) First- (blue) and higher-order (orange) De Bruijn graphs encoding temporal ordered time-stamped edges are compared to random graph ensemble null model with shuffled time-stamped k-1-order edges. (c) The graphs are corrected by introducing a statistical principled bias that revalues all edges ( $w_{\langle AXC\rangle}\approx w_{\langle BXD\rangle}>w_{\langle BXC\rangle}$ ) and removes under-represented edges, i.e. edges that appear with a high probability less than expected ( $\langle AXD\rangle$ ). (d) The multi-order graph neural network is trained respecting the inferred graphs.

5 Experimental Evaluation

We compare our architecture with graph representation learning methods (EVO [2], HONEM [48], DeepWalk [42] and Node2Vec [18]) and deep graph learning methods (GCN [29], LGNN [7] and DBGNN [45]). For the representation learning models Node2Vec and EVO we adhere to the original configurations, i.e. we use an embedding size of $d=128$ and a random walk length of $l=80$ , repeated $r=10$ times. As context size we use $k=10$ . For Node2Vec we select the return parameter ( $p$ ) and the in-out parameter ( $q$ ) from the set ${0.25,0.5,1,2,4}$ . The deep learning models (GCN, LGNN, DBGNN, and our proposed model) consist of three layers. Following the approach of [45], we set the size of the last layer to $h_{2}=16$ , while the sizes of the preceding layers are determined during model selection. The study range for $h_{0}$ and $h_{1}$ encompasses ${4,8,16,32}$ over a maximum of 5000 epochs as per [45]. The higher-order path length is fixed to $k=2$ for HYPA-DBGNN and DBGNN because it is shown as optimal by Qarkaxhija et al. [45] for the given data sets. Stochastic Gradient Descent (SGD) serves as our optimization function, with the learning rate set to $0.001$ , which showed the best performances. We use dropout regularization with a dropout rate of $0.4$ to mitigate overfitting and we incorporate class weights in the loss function to address imbalanced training datasets.

The data sets used do not have an independent test set, nor are they large enough to define a robust dedicated test set. This makes it challenging to directly evaluate model performance on unknown data. To compare various Graph Neural Network (GNN) architectures, we adopt a conventional approach as documented in literature [11, 40, 25]. For the assessment of model generalizability, we employ a nested cross-validation strategy with $N=10$ repetitions. The data undergoes stratified partitioning into nine training and one testing fold, further divided into stratified training and validation subsets (80/20%) within each repetition. Subsequently, we select the best-performing model and epoch based on its validation set performance. Finally, we evaluate the chosen model’s performance on the test set, reporting the mean and standard deviation of the respective metric across all N repetitions. For comparability, we use the same folds and splits for all experiments. Besides the random splits, the random initialization of the model also contributes to the variability captured by the standard deviation. For reproducibility, we fix the random splits and reuse a common seed in every repetition for the random initialization of model weights and dropout candidates.

5.1 Experimental Results for Synthetic Data Sets

We use synthetic data with two classes of nodes $C=\{A,B\}$ to demonstrate the type of patterns that only our model can learn. The characteristic properties and its derivation of the configuration model are detailed in section A.1. Importantly, it contains a heterogeneous sequence (e.g. $\langle v_{0},v_{1},v_{2}\rangle_{f}$ ) distribution of time-stamped edges or events (here: $(v_{0},v_{1})_{t_{0}},(v_{1},v_{2})_{t_{1}},t_{0}<t_{1}$ ) between nodes. The learnable sequential pattern is an increased class-assortativity, i.e. edges are temporally ordered such that same class events are preferred followed by each other (e.g. leading to $\langle A,A,A,B\rangle$ ). Hence, these higher-order sequences with nodes from the same group are over-expressed compared to what we would expect by shuffling the temporal-order of the timestamped-edges between the nodes (e.g. $\langle A,B,A,A\rangle$ ).

The pattern is only discernible by higher-order models due to its restriction to higher-order sequences. For a homogeneous sequence distribution the pattern of overrepresented sequences would be reflected in the mere frequencies. However, due to the initial heterogeneous distribution, overrepresented sequences can also have low frequencies (e.g. $\langle A,X,C\rangle$ in Figure 1). Thus, they are also unobservable by higher-order baselines only including mere frequencies like DBGNN. However, the comparison of the observed frequencies with a null model that preserves the frequency of time stamped-edges but randomizes the temporal order reveals the sequential pattern.

We use two synthetic data sets with the same distribution of time-stamped edges. Unweighted Sampling contains sequences with randomized temporal order of time-stamped edges whereas Weighted Sampling contains sequences with increased class-assortativity. An unintended correlation between the obtained graph topology and the event classes is not excluded for both data sets. Weighted Sampling additionally contains the preferential chaining pattern.

The results in Table 1 show the capabilities of the models in terms of accuracy in solving the respective binary node classification tasks. The different methods yield varying results for synthetic data set without intended pattern. All representation learning methods with a horizon of $l=80$ , except EVO, perform better than the deep graph learning methods with a smaller horizon of $l=3$ . Our approach performs as good as DBGNN that shares similarities, like two distinct message passing modules, in its architecture.

The Weighted Sampling highlights the ability of the methods to learn the intended pattern. For the second data set GCN performs worse and all other baselines methods perform equal as for the first one. In contrast, HYPA-DBGNN improves by $55\%$ and reaches an accuracy of $100\%$ . These observations lead to the result that some of current baselines are able to learn an unintended pattern in both data sets. However, they fail in learning the implanted increased class-assortativity pattern whereas HYPA-DBGNN is able to learn this pattern.

Table 1: Comparison of HYPA-DBGNN baselines for the synthetic data sets. The table presents the balanced accuracy and its standard deviation for the static node classification task on dynamic graphs as obtained through the outlined experiments. The Unweighted Sampling data set contains a heterogeneous sequence distribution of time-stamped edges with shuffled temporal order. The adapted distribution of sequences in Weighted Sampling encodes a sequential pattern such that time-stamped edges between nodes of the same class are overrepresented but not necessarily very frequent.

Representation Learning	EVO	HONEM	DeepWalk	Node2Vec
Unweighted Sampling	40.00 ± 31.62	80.00 ± 25.82	60.00 ± 21.08	60.00 ± 21.08
Weighted Sampling	40.00 ± 31.62	80.00 ± 25.82	60.00 ± 21.08	60.00 ± 21.08
Deep Graph Learning	GCN	LGNN	DBGNN	HYPA-DBGNN
Unweighted Sampling	50.00 ± 33.33	50.00 ± 0.00	45.00 ± 28.38	45.00 ± 15.81
Weighted Sampling	45.00 ± 28.38	50.00 ± 0.00	45.00 ± 15.81	100.00 ± 0.00

5.2 Experimental Results for Empirical Data Sets

Our experiments leverage the five empirical time series datasets on dynamic graphs from [45]. This work also provides the optimal order of the higher-order model and the $\delta$ value (the maximum time difference for edges to be considered part of a causal walk) for generating the time respecting paths within each dataset. The data sets are Highschool2011 and Highschool2012 [14], Hospital [54], StudentSMS [50], and Workplace2016 [19]. This section addresses the question of how our architecture compares to the described baselines with respect to the named empirical data sets. The mean balanced accuracy and its standard deviation is reported in Table 2.

We reproduce the superior results of DBGNN compared to other baselines for all data sets except Workplace2016 for which LGNN performs better than shown in Qarkaxhija et al. [45]. The obtained standard deviations are also comparable to the work of [45].

However, HYPA-DBGNN outperforms all baselines, including DBGNN. For Highschool2011 and Highschool2012, the gain is smallest with $2.77\%$ and $2.27\%$ , respectively. For StudentSMS and Workplace2016, the gain is about twice as large at $5.09\%$ and $4.58\%$ , respectively. It is noteworthy that the baseline results for Workplace2016 are already at least $20\%$ better than for the other data sets, so the gain of $4.58\%$ is harder to achieve and brings the balanced accuracy close to the optimum. A remarkable result is the gain of $45.50\%$ for Hospital. Here, the baselines are the weakest compared to the other data sets, while for our approach only Workspace2016 is better solvable.

The inclusion of path anomalies is beneficial for all empirical data sets considered, but the gain depends on the particular data set. Here, the results for Hospital and Workplace2016 stand out.

Table 2: Comparison of HYPA-DBGNN with node representation learning and deep graph learning baselines for dynamic graphs. The table presents the balanced accuracy and its standard deviation for the models on empirical static node classification tasks for dynamic graphs that is obtained through the outlined experiments. The best results are marked. Results with different metrics are attached in Appendix F.

Model	Highschool2011	Highschool2012	Hospital	StudentSMS	Workplace2016
EVO	43.68 ± 10.91	50.05 ± 7.30	25.83 ± 8.29	55.05 ± 6.39	26.50 ± 12.08
HONEM	59.00 ± 10.61	50.49 ± 9.31	39.44 ± 17.57	53.81 ± 7.28	83.17 ± 11.14
DeepWalk	54.64 ± 17.70	49.65 ± 12.97	24.58 ± 10.92	52.78 ± 7.83	20.54 ± 9.51
Node2Vec	54.64 ± 17.70	49.65 ± 12.97	24.58 ± 10.92	52.31 ± 7.70	20.54 ± 9.51
GCN	55.00 ± 13.37	59.35 ± 11.13	43.47 ± 9.03	54.50 ± 6.40	73.33 ± 12.60
LGNN	57.72 ± 9.85	51.43 ± 17.94	44.03 ± 9.03	52.71 ± 6.63	84.83 ± 14.77
DBGNN	61.54 ± 11.13	64.93 ± 15.26	52.50 ± 19.27	57.72 ± 5.29	84.42 ± 15.59
HYPA-DBGNN	63.25 ± 16.18	66.41 ± 10.24	76.39 ± 17.12	60.66 ± 6.11	88.29 ± 10.51

5.3 Similarities in Temporal Sequences Between Empirical and Synthetic Data

The synthetic data set encodes a pattern of increased class-assortativity that is learned by HYPA-DBGNN. Figure 2 shows the deviation from the expected edge frequencies in terms of HYPA scores for the used data sets regarding the incident nodes, i.e. for each node the distribution of the average HYPA score of incident edges is plotted.

The second-order plot shows the increased class-assortativity for nodes of class 0 in the Weighted Sampling data set. The incident second-order edges have on average a larger HYPA score and thus are more often overrepresented compared to edges incident to nodes of class 1. Due to the statistic principled inferred graph, HYPA-DBGNN is able to learn this pattern.

Also, Hospital and Workplace2016 emit such under- and overrepresented sequential patterns in both graphs that are related to distinct node classes. In Hospital second-order edges incident to nodes of class 0 and 1 are overly often overrepresented. However, the first-order edges incident to nodes of class 0 and 1 differ in its statistics. This observed connection between node classes and the sequential patterns containing the respective nodes supports the superior performance of HYPA-DBGNN for Hospital and Workplace2016.

We also study the impact and different formulations of statistical information in an ablation study in Appendix D that support the relevance of statistical information.

6 Conclusion

In this work, we propose HYPA-DBGNN, a novel deep graph learning architecture that accounts for time-respecting paths in temporal graph data with high temporal resolution. Different from existing graph learning methods that employ neural message passing along time-respecting paths, we introduce a two-step approach which first infers anomalous sequential patterns based on an analytically tractable null model for time-respecting paths that preserves both the topology and the frequency, but not the temporal ordering, of time-stamped edges. In a second step, we apply neural message passing on an augmented higher-order De Bruijn graph, whose edges capture time-respecting paths that are overrepresented compared to the expectation from that random baseline. An experimental evaluation of our approach in a synthetic model and five empirical data sets on temporal graphs reveals that our proposed method considerably improves node classification compared to seven baseline methods in all studied data sets, with performance gains ranging from 2.27 % to 45.5 %. An investigation of HYPA scores – which capture the degree to which time-respecting path statistics deviate from what is expected in a null model – as well as an ablation study show that the correlation between node classes and the magnitude of the deviations from the random expectation is particularly pronounced for those empirical temporal graphs where we also observe the largest performance gains for our method. This finding highlights that the innovative combination of statistical inference and neural message passing, which is the key contribution of our work, leads to considerable advantages for temporal graph learning.

Despite these contributions, our work raises a number of open questions that we did not address within the scope of this work. First, in order to isolate the influence of sequential patterns in temporal graphs, here we solely focused on the sequence of time-stamped edges, thus neglecting additional node attributes and edge features. Future studies building on our work could thus additionally consider richer node and edge information, which is likely to further improve the performance of our model. Moreover, the framework of hypergeometric statistical ensembles allows to include non-homogeneous “edge propensities” based, e.g., on a homophily of nodes with similar attributes. This could possibly be used to generate domain-specific null models leading to a graph learning architecture that includes a non-trivial inductive bias, which we did not explore in this work. Bridging the gap between the application of statistical graph ensembles in network science and deep graph learning, we finally argue that our work opens broader perspectives for the integration of statistical graph inference, graph augmentation, and neural message passing. In particular, applying our method to the inference of (first-order) edges in static graphs could be a promising approach to address the issue that empirical graphs are rarely unspoiled reflections of reality, but are often subject to measurement errors and noise. The need to combine graph inference techniques with neural message passing [36, 41, 61] has recently been identified as a major challenge for deep graph learning, and our work can be seen as a step in this direction.

Acknowledgement

Jan von Pichowski, Lisi Qarkaxhija, and Ingo Scholtes acknowledge funding from the German Federal Ministry of Education and Research (BMBF) via the Project "Software Campus 3.0", Grant No. (FKZ) 01IS24030, which is running from 01.04.2024 to 31.03.2026. Ingo Scholtes acknowledges funding through the Swiss National Science Foundation (SNF), Grant No. 176938.

References

Barbero et al. [2023] F. Barbero, A. Velingker, A. Saberi, M. Bronstein, and F. Di Giovanni. Locality-aware graph-rewiring in gnns. arXiv preprint arXiv:2310.01668, 2023.
Belth et al. [2020] C. Belth, F. Kamran, D. Tjandra, and D. Koutra. When to remember where you came from: node representation learning in higher-order networks. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’19, page 222–225, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450368681. doi: 10.1145/3341161.3342911. URL https://doi.org/10.1145/3341161.3342911.
Bevilacqua et al. [2021] B. Bevilacqua, F. Frasca, D. Lim, B. Srinivasan, C. Cai, G. Balamurugan, M. M. Bronstein, and H. Maron. Equivariant subgraph aggregation networks. arXiv preprint arXiv:2110.02910, 2021.
Bodnar et al. [2021a] C. Bodnar, F. Frasca, N. Otter, Y. Wang, P. Lio, G. F. Montufar, and M. Bronstein. Weisfeiler and lehman go cellular: Cw networks. Advances in neural information processing systems, 34:2625–2640, 2021a.
Bodnar et al. [2021b] C. Bodnar, F. Frasca, Y. Wang, N. Otter, G. F. Montufar, P. Lio, and M. Bronstein. Weisfeiler and lehman go topological: Message passing simplicial networks. In International Conference on Machine Learning, pages 1026–1037. PMLR, 2021b.
Casiraghi and Nanumyan [2021] G. Casiraghi and V. Nanumyan. Configuration models as an urn problem. Scientific Reports, 11(1), June 2021. ISSN 2045-2322. doi: 10.1038/s41598-021-92519-y. URL http://dx.doi.org/10.1038/s41598-021-92519-y.
Chen et al. [2020] Z. Chen, X. Li, and J. Bruna. Supervised community detection with line graph neural networks, 2020.
Chien et al. [2021] E. Chien, C. Pan, J. Peng, and O. Milenkovic. You are allset: A multiset function framework for hypergraph neural networks. arXiv preprint arXiv:2106.13264, 2021.
Cotta et al. [2021] L. Cotta, C. Morris, and B. Ribeiro. Reconstruction for powerful graph representations. Advances in Neural Information Processing Systems, 34:1713–1726, 2021.
De Bruijn [1946] N. G. De Bruijn. A combinatorial problem. Proceedings of the Section of Sciences of the Koninklijke Nederlandse Akademie van Wetenschappen te Amsterdam, 49(7):758–764, 1946.
Errica et al. [2019] F. Errica, M. Podda, D. Bacciu, and A. Micheli. A fair comparison of graph neural networks for graph classification. CoRR, abs/1912.09893, 2019. URL http://arxiv.longhoe.net/abs/1912.09893.
Fatemi et al. [2021] B. Fatemi, L. El Asri, and S. M. Kazemi. Slaps: Self-supervision improves structure learning for graph neural networks. Advances in Neural Information Processing Systems, 34:22667–22681, 2021.
Feng et al. [2020] W. Feng, J. Zhang, Y. Dong, Y. Han, H. Luan, Q. Xu, Q. Yang, E. Kharlamov, and J. Tang. Graph random neural networks for semi-supervised learning on graphs. Advances in neural information processing systems, 33:22092–22103, 2020.
Fournet and Barrat [2014] J. Fournet and A. Barrat. Contact patterns among high school students. PLoS ONE, 9(9):e107878, Sept. 2014. ISSN 1932-6203. doi: 10.1371/journal.pone.0107878. URL http://dx.doi.org/10.1371/journal.pone.0107878.
Franceschi et al. [2019] L. Franceschi, M. Niepert, M. Pontil, and X. He. Learning discrete structures for graph neural networks. In International conference on machine learning, pages 1972–1982. PMLR, 2019.
Gasteiger et al. [2019] J. Gasteiger, S. Weißenberger, and S. Günnemann. Diffusion improves graph learning. Advances in neural information processing systems, 32, 2019.
Georgiev et al. [2022] D. Georgiev, M. Brockschmidt, and M. Allamanis. Heat: Hyperedge attention networks. arXiv preprint arXiv:2201.12113, 2022.
Grover and Leskovec [2016] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
Génois et al. [2015] M. Génois, C. L. Vestergraad, J. Fournet, A. Panisson, I. Bonmarin, and A. Barrat. Data on face-to-face contacts in an office building suggest a low-cost vaccination strategy based on community linkers. Network Science, 3(3):326–347, Mar. 2015. ISSN 2050-1250. doi: 10.1017/nws.2015.10. URL http://dx.doi.org/10.1017/nws.2015.10.
Hajij et al. [2022] M. Hajij, G. Zamzmi, T. Papamarkou, N. Miolane, A. Guzmán-Sáenz, K. N. Ramamurthy, T. Birdal, T. K. Dey, S. Mukherjee, S. N. Samaga, et al. Topological deep learning: Going beyond graph data. arXiv preprint arXiv:2206.00606, 2022.
Hajiramezanali et al. [2019] E. Hajiramezanali, A. Hasanzadeh, K. Narayanan, N. Duffield, M. Zhou, and X. Qian. Variational graph recurrent neural networks. Advances in neural information processing systems, 32, 2019.
Hamilton et al. [2018] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs, 2018.
Huang and Yang [2021] J. Huang and J. Yang. Unignn: a unified framework for graph and hypergraph neural networks. arXiv preprint arXiv:2105.00956, 2021.
Hwang et al. [2021] E. Hwang, V. Thost, S. S. Dasgupta, and T. Ma. Revisiting virtual nodes in graph neural networks for link prediction. 2021.
James et al. [2013] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer, 2013. URL https://faculty.marshall.usc.edu/gareth-james/ISL/.
Jiang et al. [2019] B. Jiang, Z. Zhang, D. Lin, J. Tang, and B. Luo. Semi-supervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11313–11320, 2019.
** et al. [2020] W. **, Y. Ma, X. Liu, X. Tang, S. Wang, and J. Tang. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 66–74, 2020.
Kazi et al. [2022] A. Kazi, L. Cosmo, S.-A. Ahmadi, N. Navab, and M. M. Bronstein. Differentiable graph module (dgm) for graph convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1606–1617, 2022.
Kipf and Welling [2017] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks, 2017.
Kreuzer et al. [2021] D. Kreuzer, D. Beaini, W. Hamilton, V. Létourneau, and P. Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618–21629, 2021.
LaRock et al. [2020] T. LaRock, V. Nanumyan, I. Scholtes, G. Casiraghi, T. Eliassi-Rad, and F. Schweitzer. HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks, pages 460–468. Society for Industrial and Applied Mathematics, Jan. 2020. doi: 10.1137/1.9781611976236.52. URL http://dx.doi.org/10.1137/1.9781611976236.52.
Liben-Nowell and Kleinberg [2007] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7):1019–1031, 2007. ISSN 1532-2890. doi: 10.1002/asi.20591. URL http://dx.doi.org/10.1002/asi.20591.
Liu et al. [2022] Y. Liu, Z. Zhang, Y. Liu, and Y. Zhu. Gatsmote: Improving imbalanced node classification on graphs via attention and homophily. Mathematics, 10(11), 2022. ISSN 2227-7390. doi: 10.3390/math10111799. URL https://www.mdpi.com/2227-7390/10/11/1799.
Longa et al. [2023] A. Longa, V. Lachi, G. Santin, M. Bianchini, B. Lepri, P. Lio, F. Scarselli, and A. Passerini. Graph neural networks for temporal graphs: State of the art, open challenges, and opportunities. arXiv preprint arXiv:2302.01018, 2023.
Lu et al. [2024] J. Lu, Y. Xu, H. Wang, Y. Bai, and Y. Fu. Latent graph inference with limited supervision. Advances in Neural Information Processing Systems, 36, 2024.
Ma et al. [2019] J. Ma, W. Tang, J. Zhu, and Q. Mei. A flexible generative framework for graph-based semi-supervised learning. Advances in Neural Information Processing Systems, 32, 2019.
Mialon et al. [2021] G. Mialon, D. Chen, M. Selosse, and J. Mairal. Graphit: Encoding graph structure in transformers. arXiv preprint arXiv:2106.05667, 2021.
Molloy and Reed [1995] M. Molloy and B. Reed. A critical point for random graphs with a given degree sequence. Random Struct. Alg., 6(2-3):161–180, Mar. 1995. ISSN 1098-2418. doi: 10.1002/rsa.3240060204.
Monti et al. [2018] F. Monti, K. Otness, and M. M. Bronstein. Motifnet: a motif-based graph convolutional network for directed graphs. In 2018 IEEE data science workshop (DSW), pages 225–228. IEEE, 2018.
Morris et al. [2020] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. CoRR, abs/2007.08663, 2020. URL https://arxiv.longhoe.net/abs/2007.08663.
Pal et al. [2020] S. Pal, S. Malekmohammadi, F. Regol, Y. Zhang, Y. Xu, and M. Coates. Non parametric graph learning for bayesian graph neural networks. In Conference on uncertainty in artificial intelligence, pages 1318–1327. PMLR, 2020.
Perozzi et al. [2014] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’14. ACM, Aug. 2014. doi: 10.1145/2623330.2623732. URL http://dx.doi.org/10.1145/2623330.2623732.
Pham et al. [2017] T. Pham, T. Tran, H. Dam, and S. Venkatesh. Graph classification via deep learning with virtual nodes. arXiv preprint arXiv:1708.04357, 2017.
Phan et al. [2023] H. T. Phan, N. T. Nguyen, and D. Hwang. Fake news detection: A survey of graph neural network methods. Appl. Soft Comput., 139(C), may 2023. ISSN 1568-4946. doi: 10.1016/j.asoc.2023.110235. URL https://doi.org/10.1016/j.asoc.2023.110235.
Qarkaxhija et al. [2022] L. Qarkaxhija, V. Perri, and I. Scholtes. De bruijn goes neural: Causality-aware graph neural networks for time series data on dynamic graphs, 2022.
Rong et al. [2019] Y. Rong, W. Huang, T. Xu, and J. Huang. Dropedge: Towards deep graph convolutional networks on node classification. arXiv preprint arXiv:1907.10903, 2019.
Rossi et al. [2020] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, and M. Bronstein. Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637, 2020.
Saebi et al. [2020] M. Saebi, G. L. Ciampaglia, L. M. Kaplan, and N. V. Chawla. Honem: Learning embedding for higher order networks. Big Data, 8(4):255–269, Aug. 2020. ISSN 2167-647X. doi: 10.1089/big.2019.0169. URL http://dx.doi.org/10.1089/big.2019.0169.
Sankar et al. [2020] A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. In Proceedings of the 13th international conference on web search and data mining, pages 519–527, 2020.
Sapiezynski et al. [2019] P. Sapiezynski, A. Stopczynski, D. Lassen, and S. Lehmann. Interaction data from the copenhagen networks study. Scientific data, 6, Dec. 2019. ISSN 2052-4463. doi: 10.1038/s41597-019-0325-x.
Scholtes [2017] I. Scholtes. When is a network a network? multi-order graphical model selection in pathways and temporal networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1037–1046, 2017.
Stokes et al. [2020] J. M. Stokes, K. Yang, K. Swanson, W. **, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair, S. French, L. A. Carfrae, Z. Bloom-Ackermann, V. M. Tran, A. Chiappino-Pepe, A. H. Badran, I. W. Andrews, E. J. Chory, G. M. Church, E. D. Brown, T. S. Jaakkola, R. Barzilay, and J. J. Collins. A deep learning approach to antibiotic discovery. Cell, 180(4):688–702.e13, 2020. ISSN 0092-8674. doi: https://doi.org/10.1016/j.cell.2020.01.021. URL https://www.sciencedirect.com/science/article/pii/S0092867420301021.
Top** et al. [2021] J. Top**, F. Di Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. arXiv preprint arXiv:2111.14522, 2021.
Vanhems et al. [2013] P. Vanhems, A. Barrat, C. Cattuto, J.-F. Pinton, N. Khanafer, C. Régis, B.-a. Kim, B. Comte, and N. Voirin. Estimating potential infection transmission routes in hospital wards using wearable proximity sensors. PLoS ONE, 8(9):e73970, Sept. 2013. ISSN 1932-6203. doi: 10.1371/journal.pone.0073970. URL http://dx.doi.org/10.1371/journal.pone.0073970.
Veličković et al. [2018] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks, 2018.
Wang et al. [2020] Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi. Graphcrop: Subgraph crop** for graph classification. arXiv preprint arXiv:2009.10564, 2020.
Xu et al. [2020] D. Xu, C. Ruan, E. Körpeoglu, S. Kumar, and K. Achan. Inductive representation learning on temporal graphs. CoRR, abs/2002.07962, 2020. URL https://arxiv.longhoe.net/abs/2002.07962.
Ying et al. [2021] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu. Do transformers really perform badly for graph representation? Advances in neural information processing systems, 34:28877–28888, 2021.
You et al. [2020] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823, 2020.
Zhang et al. [2021] X.-M. Zhang, L. Liang, L. Liu, and M.-J. Tang. Graph neural networks and their current applications in bioinformatics. Frontiers in Genetics, 12, 2021. ISSN 1664-8021. doi: 10.3389/fgene.2021.690049. URL https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.690049.
Zhang et al. [2019] Y. Zhang, S. Pal, M. Coates, and D. Ustebay. Bayesian graph convolutional neural networks for semi-supervised classification. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5829–5836, 2019.
Zhao et al. [2021a] J. Zhao, Y. Dong, M. Ding, E. Kharlamov, and J. Tang. Adaptive diffusion in graph neural networks. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021a. URL https://openreview.net/forum?id=0Kb33DHJ1g.
Zhao et al. [2021b] L. Zhao, W. **, L. Akoglu, and N. Shah. From stars to subgraphs: Uplifting any gnn with local structure awareness. arXiv preprint arXiv:2110.03753, 2021b.
Zhao et al. [2021c] T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah. Data augmentation for graph neural networks. In Proceedings of the aaai conference on artificial intelligence, volume 35, pages 11015–11023, 2021c.
Zhao et al. [2021d] T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah. Data augmentation for graph neural networks. In Proceedings of the aaai conference on artificial intelligence, volume 35, pages 11015–11023, 2021d.
Zhao et al. [2022] T. Zhao, W. **, Y. Liu, Y. Wang, G. Liu, S. Günnemann, N. Shah, and M. Jiang. Graph data augmentation for graph machine learning: A survey. arXiv preprint arXiv:2202.08871, 2022.

Appendix A Data

In this section, we give information about the synthetic data set creation, its chracteristics and the properties of used empirical data sets.

A.1 Synthetic Data Creation Procedure

We use two synthetic data sets that are created with the following procedure. Figure 3 gives an overview of the procedure.

The algorithm consists of two main parts aimed at constructing the first-order and second-order topology of the network, respectively. Initially, the algorithm receives as input parameters the set of nodes, a node-to-class map**, a bias parameter, and the desired number of paths of length $k$ (k-th order edges) to generate.

In the first part, we assign the node degrees, and consequently the values of the $\Xi$ matrix as $\Xi=k_{in}\cdot k_{out}$ , To do this, we give each node a random weight sampled from a continuous uniform distribution $\mathcal{U}[0,1]$ . Next, for each node, we sample a number of (unweighted) edge stubs from a multinomial distribution. The number of categories in the multinomial distribution equals the number of nodes, and the probability for each category, respectively edge stub, is proportional to the previously assigned node weight. The number of stubs we sample equals the desired number of paths of length $k$ given in input. Once we have this, we randomly connect the in and out stubs, thus getting the multi-set of multi-edges and the first-order topolgy. Notice that the multi-edges created in this step also yields the higher-order nodes, and that the multi-edge frequencies correspond to their in- and out-weighted degrees.

In the second part, an iterative process creates higher-order edges. First, an out-stub $(\langle v_{0}v_{1}\dots v_{k-1}\rangle,\,\cdot\,)$ is sampled proportional to its weighted out-degree. Subsequently, a set P of potential in-stubs $(\,\cdot\,,\langle v_{1}\dots v_{k-1}v_{k}\rangle)$ is identified, ensuring valid connections between higher-order nodes by applying the de Bruijn condition that requires the last $k-1$ elements of $(\langle v_{0}v_{1}\dots v_{k-1}\rangle,\,\cdot\,)$ to match the first $k-1$ elements of $(\,\cdot\,,\langle v_{1}\dots v_{k-1}v_{k}\rangle)$ . The sampling process for successor in-stubs from P is biased based on the classes of the first-order nodes $v_{0}$ , $v_{1}\dots v_{k-1}$ , and $v_{k}$ . Specifically, counts are artificially inflated by the bias parameter for in-stubs where all $k$ nodes belong to the same class, encoding the desired pattern of preferential attachment. The selected out-stub $(\langle v_{0}v_{1}\dots v_{k-1}\rangle,\,\cdot\,)$ and in-stub $(\,\cdot\,,\langle v_{1}\dots v_{k-1}v_{k}\rangle)$ form a higher-order edge $(\langle v_{0}v_{1}\dots v_{k-1}\rangle,\langle v_{1}\dots v_{k-1}v_{k}\rangle)$ in the final network. This iterative process continues until all stubs are connected, resulting in paths of length $k$ that predominantly connect nodes within the same class, with the degree of class-assortativity controlled by the bias parameter.

A.2 Synthetic Data Characteristics

We use two synthetic data sets with $n=2^{22}$ paths. The paths emit first and second-order graphs with heterogeneous edge statistics. Figure 4 presents the the edge statistics for the synthetic data sets. For the given resolution, the graph for the synthetic data set with implanted pattern looks identical to the one without pattern due to the construction through the reused expected first-order statistics that defines the $\Xi$ -matrix for the second-order statistics and a sufficient small bias parameter during sampling. However, the emitted edge frequencies vary between the emitted graphs due to the random sampling procedure.

Figure 5 presents the absolute difference of the first- and second-order edge frequencies between the two data sets. Notable, all edges whose incident nodes are predominantly connected are sampled more often due to the bias parameter. This class-assortativity needs to be learned by the machine learning model.

The comparison in Figure 6 of the frequencies with the the expected frequencies given by the $\Xi$ -matrix supports the differences between the two synthetic data sets and highlights the encoded class-assortativity in the data set with the biased sampling.

A.3 Properties of Empirical Data

Table 3: Overview of time series data and ground truth node classes used in the experiments.

\delta

describes the maximum time difference for edges to be considered part of a casual walk.

Data Set	Ref.	$\|V\|$	$\|E\|$	$\|V^{(2)}\|$	$\|E^{(2)}\|$	Classes (Sizes)	$\delta$
Highschool2011	[14]	126	3355	3042	17141	2 (85/41)	4
Highschool2012	[14]	180	4399	3965	20614	2 (132/48)	4
Hospital	[54]	75	2052	2028	15500	4 (29/27/11/8)	4
StudentSMS	[50]	429	1160	733	846	2 (314/115)	40
Workplace2016	[19]	92	1491	1431	7121	5 (34/26/15/13/4)	4

Appendix B Comments on Computational Complexity

There are two distinct steps to be considered when arguing about the complexity of our approach. First, there is the preprocessing step that creates the augmented graphs,i.e., the competition of the HYPA scores and the removal of the under-represented paths. Second, the graph neural network is trained on that graphs. For both steps, the complexity is determined by the number of edges in the higher-order De Bruijn graph. In the preprocessing, we calculate the HYPA score for higher-order edges.

The worst-case for the number of higher-order edges is given by the number of different sequences of length $k$ , i.e., $|V|^{k}$ for a network with $|V|$ nodes. However, two arguments show that we can expect much lower complexity in real-world data. First of all, real-world networks are usually sparse, which implies that most sequences cannot occur as they would otherwise violate the network topology.

LaRock et al. [31] use this argument, and prove that the complexity of their algorithm can be tightened with $\Delta G^{(k)}\leq|V|^{2}\lambda_{1}^{k}$ , where $|V|$ denotes the number of nodes in the first-order graph $G$ and $\lambda_{1}$ is the leading eigenvalue of the binary adjacency matrix of $G$ . They conclude, that the HYPA score calculation scales linearly with the number of paths $N$ in the given data set for sparse real-world graphs, a moderate order $k$ , and a sufficiently large $N$ . [45] also uses the argument of sparsity to further limit the complexity of the De Bruijn graph. They note that the number of walks of length $k$ becoming higher-order edges in the higher-order De Bruijn graph is also limited by $\sum_{ij}A^{k}_{ij}\leq|V|^{k}$ , where $A^{k}$ is the k-th power of the binary adjacency matrix $A$ of $G$ .

Furthermore, higher-order networks are even sparser than what we would expect based on the first-order topology. This is because the number of different time-respecting paths occurring on a network is generally much lower than the number of possible paths. [45] demonstrate this (see in the appendix) by plotting the number of realized walks at each length and showing that in empirical graphs only a small fraction of walks is realized due to the restriction to time-respecting paths. By studying the complexity of the used empirical data set, they argue that De Bruijn graphs are applicable to real-world tasks.

We consider a path data set $S$ with $N$ entries. The number of edges in the k-th-order De Bruijn graph is denoted as $\Delta G^{(k)}$ . LaRock et al. [31] state that the asymptotic runtime of HYPA is $O(N+\Delta G^{(k)})$ . A trivial upper-bound for $\Delta G^{(k)}$ is the fully connected case with $|V|^{k+1}$ . This trivial case is also considered by [45] when they argue that the complexity of message passing on the De Bruijn graph is bounded.

Appendix C Variants of HYPA-DBGNN

In this section, we present other variations of our main HYPA-DBGNN architecture.

C.1 Base Architecture without Anomalies (HYPA-DBGNN^-)

Replacing the HYPA scores with the absolute edge frequencies in the message passing procedure leads to the original message passing layers proposed by Kipf and Welling [29]. The overall structure including the bipartite layers is kept. The comparison of this model (HYPA-DBGNN^-) with HYPA-DBGNN reinforces the understanding of the significance of HYPA scores.

C.2 Edge Embedded HYPA Scores (HYPA-DBGNN^E)

For HYPA-DBGNN the HYPA scores are used in a graph model selection step to enhance the message passing. Whereas for HYPA-DBGNN^E the HYPA scores are understood as additional edge attributes whose significance is learned by an adapted graph convolution operation that embeds the edge attributes into the incident node attributes during message passing in the first graph neural network layers. The augmented propagation rule is given as

\displaystyle\vec{h}_{v_{i}}^{k,1}=\sigma\left(\sum_{j}\frac{1}{c_{ij}}\left(% \vec{h}_{v_{j}}^{k,0}W^{k,1}+\vec{h}_{e_{ij}^{k}}W^{k,e}\right)\right),

(4)

with the first hidden representation $\vec{h}_{v_{j}}^{k,0}$ of node $u\in V^{(k)}$ , the inferred HYPA scores in $\vec{h}_{e_{ij}^{k}}$ for the $k$ -th-order edge $e^{k}_{ij}\in E^{(k)}$ , the trainable weight matrices $W^{k,1}\in\mathbb{R}^{H^{1}\times H^{0}}$ for the nodes and $W^{k,e}\in\mathbb{R}^{H^{1}\times 1}$ for the edges and the normalization factor $c_{ij}$ as defined by Kipf and Welling [29].

C.3 Z-Score as Replacement for HYPA Scores (HYPA-DBGNN^Z)

The HYPA scores are based on the CDF. A a replacement for the CDF, a transformed Z-score instead of the HYPA score is implemented in HYPA-DBGNN^Z. The underlying soft configuration model provides the needed expected value and variance with

\displaystyle\mathbb{E}[X_{ij}]=m\frac{\Xi_{ij}}{M}

(5)

and

\displaystyle Var[X_{ij}]=m\frac{M-m}{M-1}\frac{\Xi_{ij}}{M}

(6)

needed to define the Z-score as

\displaystyle z(A_{ij})=\frac{A_{ij}-\mathbb{E}[X_{ij}]}{\sqrt{Var[X_{ij}]}}.

(7)

Opposing to the HYPA score the Z-score is unbounded and possibly negative. Edges with negative Z-score are excluded because they are under-represented. Likewise in HYPA-DBGNN in most cases under-represented edges are removed, too, because their HYPA scores is approximately zero. Additionally, edges with a Z-score smaller than one are removed with the same argument of not having an unexpected large contribution to the graph and only beeing larger than 0 due to noisy fluctuations in the frequencies. The resulting restricted Z-score is logarithmically transformed due to observed large spread in empirical data, leading to the final replacement for the HYPA-score:

\displaystyle z^{\prime}(e_{ij})=\begin{cases}0&\text{ if }z(e_{ij})<1,\\ \log(z(e_{ij}))&\text{ otherwise}\end{cases}

(8)

Appendix D Ablation Study: Impact of Statistical Information

We conduct an ablation study in which we compare our architectures HYPA-DBGNN, HYPA-DBGNN^E and HYPA-DBGNN^Z to the base architecture HYPA-DBGNN^- that is not using statistical information. We aim to answer the question of what effect the addition of statistical information has on the prediction capability of the architectures in Table 4.

By comparing HYPA-DBGNN to HYPA-DBGNN^- we see that the statistical information play an important role for all data sets but most importantly it becomes visible that the improvements for Hospital are indeed related to the additional information.

HYPA-DBGNN^E with edge encoded statistical features performs better than the uninformed baseline but is most of the time significant weaker than HYPA-DBGNN. The structural graph correction applied in HYPA-DBGNN is still missing even when the edge encoder is able to learn the significance of the HYPA scores. HYPA-DBGNN^Z performs weak for data sets where we don’t see direct patterns in the analysis but works well for Hospital. It needs to be explored why the Z-score is more susceptible for data sets with weak or no patterns.

Table 4: Ablation study for HYPA-DBGNN. The best results are marked.

Model	Highschool2011	Highschool2012	Hospital	StudentSMS	Workplace2016
HYPA-DBGNN	63.25 ± 16.18	66.41 ± 10.24	76.39 ± 17.12	60.66 ± 6.11	88.29 ± 10.51
HYPA-DBGNN^E	61.54 ± 13.62	64.94 ± 17.71	59.03 ± 12.72	60.46 ± 9.42	88.50 ± 13.57
HYPA-DBGNN^Z	53.97 ± 17.59	59.63 ± 15.74	69.31 ± 11.74	53.45 ± 7.50	88.42 ± 10.88
HYPA-DBGNN^-	57.67 ± 17.16	64.49 ± 15.27	55.83 ± 19.27	56.23 ± 10.41	86.46 ± 12.65

Appendix E Experiment Resources and Reproducibility

We performed the experiments on a single PC with an NVIDIA GeForce RTX 3070 with 8 GB memory. On average one single experiment repetition takes approximately 5 minutes depending on the method and the data set. We run 4 experiments in parallel. We test the 11 methods (8 in the main study, 3 in the ablation study) with a parameter search over at most 25 variants on 7 data sets (5 empirical, 2 synthetic). All in all, the estimated time for the experiments is approximately 400 hours, excluding pre-studies. While this is only a rough estimate it reflects the order of magnitude of time needed to run all experiments.

To reproduce the experiments, we provide a reference implementation at https://github.com/jvpichowski/HYPA-DBGNN together with synthetic and empirical data sets and their splits and licenses. For the implementations of the baselines we attribute the reused implementations from the DBGNN reference paper [45]. They also parse and provide the used empirical data sets.

We include a self-containing benchmark to compare HYPA-DBGNN to other methods including strong candidates like GCN and DBGNN following the described evaluation procedure. The benchmark is as concise as possible to let the reader focus on the main contributions. This benchmark can be used to reproduce presented results.

Appendix F Additional Results

Table 5: Comparison of our architectures (HYPA-DBGNN, HYPA-DBGNN^-, HYPA-DBGNN^E, HYPA-DBGNN^Z) with different machine learning models. The balanced accuracy is given in Table 1, Table 2 and Table 4. The results are obtained as described in Section 5. The best results are marked.

Data Set	Model	F1-score-macro	Precision-macro	Recall-macro
Highschool2011	EVO	39.51 ± 11.50	39.38 ± 19.64	43.68 ± 10.91
	HONEM	57.54 ± 11.52	58.19 ± 13.09	59.00 ± 10.61
	DeepWalk	53.70 ± 18.55	53.47 ± 19.61	54.64 ± 17.70
	Node2Vec	53.70 ± 18.55	53.47 ± 19.61	54.64 ± 17.70
	GCN	48.55 ± 15.49	49.45 ± 18.52	55.00 ± 13.37
	LGNN	52.66 ± 14.71	53.57 ± 15.97	57.72 ± 9.85
	DBGNN	57.08 ± 11.35	61.78 ± 10.75	61.54 ± 11.13
	HYPA-DBGNN	59.60 ± 15.04	62.55 ± 14.38	63.25 ± 16.18
	HYPA-DBGNN^-	55.92 ± 17.41	56.85 ± 16.26	57.67 ± 17.16
	HYPA-DBGNN^E	57.30 ± 15.77	63.29 ± 14.85	61.54 ± 13.62
	HYPA-DBGNN^Z	49.63 ± 17.56	52.23 ± 19.82	53.97 ± 17.59
Highschool2012	EVO	46.83 ± 9.44	47.97 ± 18.15	50.05 ± 7.30
	HONEM	50.58 ± 9.49	53.89 ± 15.27	50.49 ± 9.31
	DeepWalk	48.79 ± 13.02	49.75 ± 13.77	49.65 ± 12.97
	Node2Vec	48.79 ± 13.02	49.75 ± 13.77	49.65 ± 12.97
	GCN	54.53 ± 10.82	56.94 ± 12.00	59.35 ± 11.13
	LGNN	45.32 ± 16.88	51.43 ± 14.63	51.43 ± 17.94
	DBGNN	60.22 ± 13.73	63.18 ± 12.57	64.93 ± 15.26
	HYPA-DBGNN	60.58 ± 12.12	66.23 ± 13.01	66.41 ± 10.24
	HYPA-DBGNN^-	61.26 ± 16.13	64.37 ± 15.44	64.49 ± 15.27
	HYPA-DBGNN^E	61.53 ± 17.30	64.22 ± 15.56	64.94 ± 17.71
	HYPA-DBGNN^Z	56.00 ± 15.24	58.46 ± 14.32	59.63 ± 15.74
Hospital	EVO	20.05 ± 6.64	19.12 ± 9.20	25.00 ± 7.86
	HONEM	34.88 ± 18.22	36.88 ± 23.53	37.50 ± 17.35
	DeepWalk	20.00 ± 9.53	18.76 ± 9.68	23.89 ± 10.91
	Node2Vec	20.00 ± 9.53	18.76 ± 9.68	23.89 ± 10.91
	GCN	37.38 ± 8.67	33.83 ± 8.00	43.47 ± 9.03
	LGNN	35.81 ± 8.96	32.75 ± 10.64	44.03 ± 9.03
	DBGNN	47.87 ± 20.02	48.21 ± 21.79	51.67 ± 20.34
	HYPA-DBGNN	71.80 ± 19.18	71.50 ± 20.95	74.31 ± 17.45
	HYPA-DBGNN^-	51.91 ± 20.77	50.83 ± 22.33	55.00 ± 20.49
	HYPA-DBGNN^E	52.08 ± 13.10	52.25 ± 13.41	59.03 ± 12.72
	HYPA-DBGNN^Z	65.66 ± 13.39	66.79 ± 16.20	69.31 ± 11.74
StudentSMS	EVO	54.62 ± 7.73	55.63 ± 9.53	55.05 ± 6.39
	HONEM	52.46 ± 9.71	55.65 ± 14.29	53.81 ± 7.28
	DeepWalk	52.08 ± 7.19	53.18 ± 7.61	52.78 ± 7.83
	Node2Vec	51.87 ± 7.39	52.13 ± 6.90	52.31 ± 7.70
	GCN	53.85 ± 6.39	54.39 ± 6.27	54.50 ± 6.40
	LGNN	46.79 ± 5.27	52.70 ± 6.07	52.71 ± 6.63
	DBGNN	56.87 ± 5.05	58.55 ± 5.58	57.72 ± 5.29
	HYPA-DBGNN	60.47 ± 6.68	61.40 ± 7.00	60.66 ± 6.11
	HYPA-DBGNN^-	54.58 ± 9.12	55.66 ± 8.88	56.23 ± 10.41
	HYPA-DBGNN^E	59.31 ± 9.08	59.97 ± 9.24	60.46 ± 9.42
	HYPA-DBGNN^Z	52.60 ± 6.74	54.24 ± 9.03	53.45 ± 7.50
Workplace2016	EVO	22.74 ± 12.34	21.84 ± 14.18	26.50 ± 12.08
	HONEM	77.75 ± 11.70	79.53 ± 13.50	79.46 ± 10.32
	DeepWalk	17.23 ± 8.77	16.30 ± 9.42	20.54 ± 9.51
	Node2Vec	17.23 ± 8.77	16.30 ± 9.42	20.54 ± 9.51
	GCN	68.56 ± 14.78	66.21 ± 16.88	73.33 ± 12.60
	LGNN	82.96 ± 15.65	84.32 ± 15.04	84.83 ± 14.77
	DBGNN	81.16 ± 19.16	81.33 ± 20.14	84.42 ± 15.59
	HYPA-DBGNN	85.82 ± 12.23	85.42 ± 13.75	88.29 ± 10.51
	HYPA-DBGNN^-	82.75 ± 14.26	83.25 ± 15.21	84.71 ± 13.66
	HYPA-DBGNN^E	86.47 ± 16.28	86.00 ± 17.36	88.50 ± 13.57
	HYPA-DBGNN^Z	87.67 ± 11.99	88.83 ± 13.10	88.42 ± 10.88