MuGSI: Distilling GNNs with Multi-Granularity Structural Information for Graph Classification

Tianjun Yao 0009-0006-0553-2809 Mohamed bin Zayed University of Artificial IntelligenceAbu DhabiUAE [email protected] Jiaqi Sun 0000-0001-5776-9564 Carnegie Mellon UniversityPittsburghPAUSA [email protected] Defu Cao 0000-0003-0240-3818 University of Southern CaliforniaLos AngelesCaliforniaUSA [email protected] Kun Zhang 0000-0002-0738-9958 Mohamed bin Zayed University of Artificial IntelligenceAbu DhabiUAE Carnegie Mellon UniversityPittsburghPAUSA [email protected]  and  Guangyi Chen 0000-0001-7542-5378 Mohamed bin Zayed University of Artificial IntelligenceAbu DhabiUAE Carnegie Mellon UniversityPittsburghPAUSA [email protected]
(2024)
ABSTRACT.

Recent works have introduced GNN-to-MLP knowledge distillation (KD) frameworks to combine both GNN’s superior performance and MLP’s fast inference speed. However, existing KD frameworks are primarily designed for node classification within single graphs, leaving their applicability to graph classification largely unexplored. Two main challenges arise when extending KD for node classification to graph classification: (1) The inherent sparsity of learning signals due to soft labels being generated at the graph level; (2) The limited expressiveness of student MLPs, especially in datasets with limited input feature spaces. To overcome these challenges, we introduce MuGSI, a novel KD framework that employs Multi-granularity Structural Information for graph classification. Specifically, we propose multi-granularity distillation loss in MuGSI to tackle the first challenge. This loss function is composed of three distinct components: graph-level distillation, subgraph-level distillation, and node-level distillation. Each component targets a specific granularity of the graph structure, ensuring a comprehensive transfer of structural knowledge from the teacher model to the student model. To tackle the second challenge, MuGSI proposes to incorporate a node feature augmentation component, thereby enhancing the expressiveness of the student MLPs and making them more capable learners. We perform extensive experiments across a variety of datasets and different teacher/student model architectures. The experiment results demonstrate the effectiveness, efficiency, and robustness of MuGSI. Codes are publicly available at: https://github.com/tianyao-aka/MuGSI.

Graph neural networks, Knowledge distillation, Graph classification
journalyear: 2024copyright: rightsretainedconference: Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singaporebooktitle: Proceedings of the ACM Web Conference 2024 (WWW ’24), May 13–17, 2024, Singapore, Singaporedoi: 10.1145/3589334.3645542isbn: 979-8-4007-0171-9/24/05ccs: Computing methodologies Machine learning approachesccs: Computing methodologies Neural networks

1. Introduction

In recent years, Graph Neural Networks (GNNs) have emerged as a powerful tool for graph-structured data and have consistently achieved superior performance in graph-related tasks in a variety of domains, such as bioinformatics (Gasteiger et al., 2021), social network analysis (Fan et al., 2019) and personalized recommendation (He et al., 2020). Building on this, GNNs are highly relevant to the Graph Algorithms and Modelling for the Web.

To facilitate the deployment of latency-sensitive applications, several works (Zhang et al., 2022; Tian et al., 2023; Wu et al., 2023a, b) employ Knowledge Distillation (KD) (Hinton et al., 2015) to transfer the learned knowledge from a well-trained teacher GNN model to a student MLP model, combining GNN’s superior performance with MLP’s fast inference speed. However, existing GNN-to-MLP KD methods mainly focus on node classification, and its application to graph classification is largely overlooked. This gap is significant as KD for graph classification presents unique challenges that are fundamentally distinct from those in node classification: (1) Sparse learning signals. For node classification, dense learning signals can be generated through node-level gradient updates using soft labels, especially for large-scale graphs that consist of thousands or even millions of nodes. Conversely, graph classification inherently provides sparse learning signals, as soft labels are obtained at the level of entire graphs, making the KD process for graph classification more challenging; (2) Limited expressive power of MLPs. Previous work (Zhang et al., 2022; Chen et al., 2021) has established that a key factor for the success of KD for node classification is the small gap in the number of equivalence classes generated by GNNs and MLPs due to the enormous input feature space of the real-world node classification datasets (more details can be found in the Appendix D of (Zhang et al., 2022)). However, this condition is often not met in graph classification tasks due to the limited input feature space, which severely limits the expressive power and learning capability of student MLPs. The empirical results illustrated in Table 1 also align with our analysis, i.e., due to the outlined challenges, a GNN-to-MLP KD framework effective for node classification only yields slight gains for graph classification. Here, we adopt GLNNMLP as the KD framework, our implementation is similar to the one from GLNN (Zhang et al., 2022), except that a graph pooling function is utilized to obtain a graph-level representation.

PROTEINS BZR DD IMDB-B
GIN 79.25±3.22 93.09±1.89 77.67±2.86 79.60±3.02
MLP 72.61±2.98 79.26±1.50 73.59±2.90 77.11±2.76
GLNNMLP 72.96±2.54 79.51±1.94 74.49±2.94 77.58±3.27
Table 1. Experiment results for soft logits-based KD method. Here the student is MLP and the teacher is GIN(Xu et al., 2019). Details about experiment setting can be found in Section 5.1.

Present Work. In this work, we introduce a novel Knowledge Distillation framework titled MuGSI (Multi-Granularity Structural Information for Graph distillation) to address the aforementioned challenges, namely sparse learning signals and limited expressive power of MLPs. (1) To tackle the first challenge, we propose multi-granularity distillation loss to align multiple distributions across various scales of graph structures between the teacher model and the student model (as discussed in Appendix A.4). Our intuition is that both local and global structural information play a critical role in graph classification as GNNs first encode a rooted subtree for each node to capture the local substructures, then a graph pooling function is utilized to obtain a whole-graph representation, which captures the global structures. The proposed multi-granularity distillation loss in MuGSI is composed of three distinct components: graph-level distillation, subgraph-level distillation, and node-level distillation. Each component targets a specific granularity of the graph structure, ensuring a comprehensive transfer of structural knowledge from the teacher model to the student model. By leveraging this multi-granularity approach, we can provide dense learning signals during the KD process and facilitate the effective transfer of structural knowledge. (2) To tackle the second challenge, MuGSI proposes to incorporate a node feature augmentation component, thereby enlarging the input feature space and enhancing the expressiveness of the student MLPs to make them more capable learners. We further utilize a specific type of Graph-Augmented MLP (GA-MLP) as a more expressive student. Notably, the time complexity of the GA-MLP is almost identical to that of a traditional MLP.

Motivation. Our work also reveals the multifaceted advantages of employing KD for graph classification, addressing key challenges in computational efficiency, robustness, and resource constraints: (1) Recently, there is a line of work aiming to improve the model expressiveness  (Bouritsas et al., 2022; Dwivedi et al., 2021; Morris et al., 2019; Maron et al., 2018, 2019; Bodnar et al., 2021; Maron et al., 2020; Kondor et al., 2018; Vignac et al., 2020; Thiede et al., 2021; Bevilacqua et al., 2021; Zhang and Li, 2021; Zhao et al., 2021; You et al., 2021), but they are usually costly in computational time and memory space. An effective KD framework can mitigate these issues by training a lightweight student model that retains, or even surpasses the performance of a more complex teacher model. (2) Graphs are often dynamically changed, leading to distribution shifts that can adversely affect model performance at test time. Our experiments validate that an effective KD framework can serve as a potent technique to address test-time distribution shifts. (3) In dynamic environments, student MLP-type models enable incremental computation, thus significantly improve the inference speed, which facilitates the inference in CPU machines and environments with limited computational resources.

Our contributions can be summarized as follows:

  • We identify an under-explored problem: the GNN-to-MLP distillation for graph classification. Furthermore, we offer an analysis explaining why existing GNN-to-MLP KD frameworks are suboptimal for graph classification tasks.

  • We propose MuGSI, the first GNN-to-MLP KD framework for graph classification to the best of our knowledge, which facilitates efficient structural knowledge distillation at multiple granularities.

  • We perform extensive experiments across a variety of datasets, where the results validate MuGSI’s effectiveness, efficiency, and robustness. Additionally, MuGSI effectively addresses test-time distribution shifts and enables efficient inference in dynamic settings, with the student GA-MLP model being 17.18x faster than the teacher GIN model.

2. PRELIMINARY

2.1. Notations and Problem Definition

We use {}\left\{\right\}{ } to denote sets. The index set is denoted as [n]:={1,,n}assigndelimited-[]𝑛1𝑛[n]:=\{1,\cdots,n\}[ italic_n ] := { 1 , ⋯ , italic_n }. Throughout this paper, we consider simple undirected graphs G=(𝒱,)𝐺𝒱G=(\mathcal{V},\mathcal{E})italic_G = ( caligraphic_V , caligraphic_E ), where 𝒱={v1,,vn}𝒱subscript𝑣1subscript𝑣𝑛\mathcal{V}=\left\{v_{1},\ldots,v_{n}\right\}caligraphic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is the node set and 𝒱×𝒱𝒱𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}caligraphic_E ⊆ caligraphic_V × caligraphic_V is the edge set. For a node u𝑢uitalic_u, denote its neighbors as 𝒩(u):={v𝒱:{u,v}}assign𝒩𝑢conditional-set𝑣𝒱𝑢𝑣\mathcal{N}(u):=\{v\in\mathcal{V}:\{u,v\}\in\mathcal{E}\}caligraphic_N ( italic_u ) := { italic_v ∈ caligraphic_V : { italic_u , italic_v } ∈ caligraphic_E }. GuKsuperscriptsubscript𝐺𝑢𝐾G_{u}^{K}italic_G start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the node-induced K𝐾Kitalic_K-hop ego-network where the central node is u𝑢uitalic_u.

In the context of graph classification tasks, the input is typically represented as a set of graphs, where each graph Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is characterized by its node set 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, edge set isubscript𝑖\mathcal{E}_{i}caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, a node feature matrix 𝐗iNi×Dsubscript𝐗𝑖superscriptsubscript𝑁𝑖𝐷\mathbf{X}_{i}\in\mathbb{R}^{N_{i}\times D}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, and an adjacency matrix 𝐀iNi×Nisubscript𝐀𝑖superscriptsubscript𝑁𝑖subscript𝑁𝑖\mathbf{A}_{i}\in\mathbb{R}^{N_{i}\times N_{i}}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Here, Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the total number of nodes in graph Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and D𝐷Ditalic_D is the dimensionality of the node features. The node feature matrix 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the attributes of nodes in graph Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Each row, 𝐱i,vsubscript𝐱𝑖𝑣\mathbf{x}_{i,v}bold_x start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT corresponds to the D𝐷Ditalic_D-dimensional feature vector of a node v𝒱i𝑣subscript𝒱𝑖v\in\mathcal{V}_{i}italic_v ∈ caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The adjacency matrix 𝐀isubscript𝐀𝑖\mathbf{A}_{i}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT describes the structure of the graph, where 𝐀i[u,v]=1subscript𝐀𝑖𝑢𝑣1\mathbf{A}_{i}[u,v]=1bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_u , italic_v ] = 1 if an edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) exists in isubscript𝑖\mathcal{E}_{i}caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝐀i[u,v]=0subscript𝐀𝑖𝑢𝑣0\mathbf{A}_{i}[u,v]=0bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_u , italic_v ] = 0 otherwise. Gi,uKsuperscriptsubscript𝐺𝑖𝑢𝐾G_{i,u}^{K}italic_G start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the node-induced K𝐾Kitalic_K-hop ego-network in graph Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where the central node is u𝑢uitalic_u, and 𝐗i[u]superscriptsubscript𝐗𝑖delimited-[]𝑢\mathbf{X}_{i}^{[u]}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_u ] end_POSTSUPERSCRIPT denotes the feature matrix for the involved nodes in Gi,uKsuperscriptsubscript𝐺𝑖𝑢𝐾G_{i,u}^{K}italic_G start_POSTSUBSCRIPT italic_i , italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. The prediction targets for graph classification tasks are represented as 𝐘N×K𝐘superscript𝑁𝐾\mathbf{Y}\in\mathbb{R}^{N\times K}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of graphs in the dataset, and K𝐾Kitalic_K is the number of classes. Each row 𝐲isubscript𝐲𝑖\mathbf{y}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 𝐘𝐘\mathbf{Y}bold_Y is a one-hot vector representing the true class of graph Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

The entire dataset 𝒟={Gi,𝐲i}i=1N𝒟superscriptsubscriptsubscript𝐺𝑖subscript𝐲𝑖𝑖1𝑁\mathcal{D}=\left\{G_{i},\mathbf{y}_{i}\right\}_{i=1}^{N}caligraphic_D = { italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is divided into a training and validation set 𝒟L={Gi,𝐲i}i=1NLsubscript𝒟𝐿superscriptsubscriptsubscript𝐺𝑖subscript𝐲𝑖𝑖1subscript𝑁𝐿\mathcal{D}_{L}=\left\{G_{i},\mathbf{y}_{i}\right\}_{i=1}^{N_{L}}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = { italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and a test set 𝒟U={Gi}i=NL+1Nsubscript𝒟𝑈superscriptsubscriptsubscript𝐺𝑖𝑖subscript𝑁𝐿1𝑁\mathcal{D}_{U}=\left\{G_{i}\right\}_{i=N_{L}+1}^{N}caligraphic_D start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT = { italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where NLsubscript𝑁𝐿N_{L}italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is the number of graphs in the training/validation set. In the training/validation phase, our goal is to learn a map** function Φ:Gi𝐲i,i1,,NL:Φformulae-sequencesubscript𝐺𝑖subscript𝐲𝑖for-all𝑖1subscript𝑁𝐿\Phi:G_{i}\rightarrow\mathbf{y}_{i},\forall i\in{1,\ldots,N_{L}}roman_Φ : italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ 1 , … , italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, using the labeled set 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. Once learned, the function ΦΦ\Phiroman_Φ is expected to predict the true class labels of the unlabeled graphs in the test set 𝒟Usubscript𝒟𝑈\mathcal{D}_{U}caligraphic_D start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT.

2.2. Graph Neural Networks

In this paper, we focus on message-passing GNNs, where the representation 𝐡v(l)superscriptsubscript𝐡𝑣𝑙\mathbf{h}_{v}^{(l)}bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT of each node v𝑣vitalic_v in a graph G𝐺Gitalic_G is iteratively updated by aggregating information from its neighbors 𝒩(v)𝒩𝑣\mathcal{N}(v)caligraphic_N ( italic_v ). For the l𝑙litalic_l-th layer, the updated representation is obtained via an AGGREGATE operation followed by an UPDATE operation:

(1) 𝐦v(l)superscriptsubscript𝐦𝑣𝑙\displaystyle\mathbf{m}_{v}^{(l)}bold_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT =AGGREGATE(l)({𝐡u(l1):u𝒩(v)})absentsuperscriptAGGREGATE𝑙conditional-setsuperscriptsubscript𝐡𝑢𝑙1𝑢𝒩𝑣\displaystyle=\text{AGGREGATE}^{(l)}\left(\left\{\mathbf{h}_{u}^{(l-1)}:u\in% \mathcal{N}(v)\right\}\right)= AGGREGATE start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( { bold_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT : italic_u ∈ caligraphic_N ( italic_v ) } )
(2) 𝐡v(l)superscriptsubscript𝐡𝑣𝑙\displaystyle\mathbf{h}_{v}^{(l)}bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT =UPDATE(l)(𝐡v(l1),𝐦v(l)),absentsuperscriptUPDATE𝑙superscriptsubscript𝐡𝑣𝑙1superscriptsubscript𝐦𝑣𝑙\displaystyle=\text{UPDATE}^{(l)}\left(\mathbf{h}_{v}^{(l-1)},\mathbf{m}_{v}^{% (l)}\right),= UPDATE start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ,

where 𝐡v(0)=𝐱vsuperscriptsubscript𝐡𝑣0subscript𝐱𝑣\mathbf{h}_{v}^{(0)}=\mathbf{x}_{v}bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the initial node feature of node v𝑣vitalic_v in graph G𝐺Gitalic_G. For graph classification tasks, GNNs employ a READOUT function to aggregate the final layer node features {𝐡v(L):v𝒱}conditional-setsuperscriptsubscript𝐡𝑣𝐿𝑣𝒱\left\{\mathbf{h}_{v}^{(L)}:v\in\mathcal{V}\right\}{ bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT : italic_v ∈ caligraphic_V } into a graph-level representation 𝐡Gsubscript𝐡𝐺\mathbf{h}_{G}bold_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT:

(3) 𝐡G=READOUT({𝐡v(L):v𝒱}).subscript𝐡𝐺READOUTconditional-setsuperscriptsubscript𝐡𝑣𝐿𝑣𝒱\mathbf{h}_{G}=\operatorname{READOUT}\left(\left\{\mathbf{h}_{v}^{(L)}:v\in% \mathcal{V}\right\}\right).bold_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = roman_READOUT ( { bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT : italic_v ∈ caligraphic_V } ) .

This graph-level representation is used for graph classification.

2.3. Graph Augmented Multi-Layer Perceptrons

GA-MLP (Graph-Augmented Multi-Layer Perceptrons) models (Chen et al., 2021) are a class of graph neural networks designed to understand graph structure and enhance computational efficiency. These models operate in two primary steps: augmenting node features with linear operators based on the graph topology, and applying a node-wise learnable function. Formally, given a set of linear operators Ω={ω1(𝐀),,ωk(𝐀)}|V|×|V|Ωsubscript𝜔1𝐀subscript𝜔𝑘𝐀superscript𝑉𝑉\Omega=\left\{\omega_{1}(\mathbf{A}),\ldots,\omega_{k}(\mathbf{A})\right\}% \subseteq\mathbb{R}^{|V|\times|V|}roman_Ω = { italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_A ) , … , italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_A ) } ⊆ blackboard_R start_POSTSUPERSCRIPT | italic_V | × | italic_V | end_POSTSUPERSCRIPT, derived from the adjacency matrix 𝐀𝐀\mathbf{A}bold_A, a GA-MLP first computes augmented features

(4) 𝐗~k=ωk(𝐀)φ(𝐗)n×d~,subscript~𝐗𝑘subscript𝜔𝑘𝐀𝜑𝐗superscript𝑛~𝑑\tilde{\mathbf{X}}_{k}=\omega_{k}(\mathbf{A})\cdot\varphi(\mathbf{X})\in% \mathbb{R}^{n\times\tilde{d}},over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_A ) ⋅ italic_φ ( bold_X ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × over~ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT ,

where φ:dd~:𝜑superscript𝑑superscript~𝑑\varphi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{\tilde{d}}italic_φ : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT over~ start_ARG italic_d end_ARG end_POSTSUPERSCRIPT is a learnable feature transformation, often realized by an MLP. The model then concatenates these features to get 𝐗~~𝐗\tilde{\mathbf{X}}over~ start_ARG bold_X end_ARG and applies a learnable node-wise function ρ𝜌\rhoitalic_ρ, to compute the final representation

(5) Z=ρ(𝐗~)n×d.𝑍𝜌~𝐗superscript𝑛superscript𝑑Z=\rho(\tilde{\mathbf{X}})\in\mathbb{R}^{n\times d^{\prime}}.italic_Z = italic_ρ ( over~ start_ARG bold_X end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT .

A simplified version of GA-MLP takes φ𝜑\varphiitalic_φ as the identity function, allowing pre-computation of the matrix products, thus improving computational efficiency. Various GA-MLP-type models, including SGC (Wu et al., 2019), GFN (Chen et al., 2020), gfNN (NT and Maehara, 2019), and SIGNs (Frasca et al., 2020) have been proposed, showing competitive performances on diverse datasets.

3. RELATED WORK

Recently, Knowledge Distillation has proven to be effective for graph learning. Some previous works (Lassance et al., 2019; Zhang et al., 2023; Ren et al., 2022; Joshi et al., 2022; Wu et al., 2022; Zhang et al., 2020; Yan et al., 2020; Feng et al., 2022b; Guo et al., 2023) have explored the distillation of knowledge from large teacher GNNs to smaller student GNNs. To further reduce the inference time and enable real-time applications, some recent works in this field explore GNN-to-MLP knowledge distillation. GLNN(Zhang et al., 2022) adopts a soft logits-based KD method, which achieves predictive performance comparable to teacher GNN models, enabling real-time applications and significantly reducing inference time. KRD (Wu et al., 2023a) explores the reliability of different knowledge points in GNNs and the diversity of roles they play in the distillation process. It introduces the KRD framework, which leverages reliable knowledge points to provide additional supervision signals. NOSMOG (Tian et al., 2023) introduces three key components: the incorporation of position features, representational similarity distillation, and adversarial feature augmentation to enhance the predictive performance of the student MLP compared to the vanilla soft logits-based KD method. FF-G2M (Wu et al., 2023b) leverages both low-frequency and high-frequency components extracted from a single graph for full-frequency knowledge distillation. However, it is important to note that these methods are primarily designed for node classification and mainly operate on a single graph. It is not straightforward to adapt them to graph classification. In this work, we propose the first GNN-to-MLP KD framework for graph classification to bridge this gap.

4. PROPOSED FRAMEWORK

In this section, we present the details of MuGSI, which consists of three key components: graph-level distillation, subgraph-level distillation, and node-level distillation. The overall framework of MuGSI is illustrated in Figure 1.

Refer to caption
Figure 1. The figure illustrates the KD process with multi-granularity distillation loss. First a teacher GNN model is pre-trained, then an MLP-type student model is trained using the distilled multi-granularity structural knowledge from the teacher model: (a) whole-graph distillation loss 𝒢subscript𝒢\mathcal{L}_{\mathcal{G}}caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT; (b) inter-cluster distillation loss 𝒞subscript𝒞\mathcal{L}_{\mathcal{C}}caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT; (c) path-consistency loss 𝒫subscript𝒫\mathcal{L}_{\mathcal{P}}caligraphic_L start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT. Note that the soft logits distillation loss SLsubscript𝑆𝐿\mathcal{L}_{SL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT and the ground-truth cross-entropy loss GTsubscript𝐺𝑇\mathcal{L}_{GT}caligraphic_L start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT are not shown in the figure.

4.1. Graph-Level Distillation

Currently, most GNN-to-MLP KD frameworks build upon response-based knowledge relying on the output of the last layer, i.e., soft logits. However, this approach fails to address the intermediate-level supervision from the teacher model, which turns out to be important for representation learning using deep neural networks, as deep neural networks are good at learning multiple levels of feature representation with increasing abstraction (Bengio et al., 2014). Hence we resort to intermediate layers, i.e., feature maps as additional supervision signals, which serve as a good extension for soft logits-based KD approach. In MuGSI, we employ graph-level representation hGTsuperscriptsubscript𝐺𝑇h_{G}^{T}italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT as a direct supervision signal from the teacher model for the student to emulate. This is because hGTsuperscriptsubscript𝐺𝑇h_{G}^{T}italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT may encapsulate latent information that is concealed in soft logits. The whole-graph distillation loss can be formulated as follows:

(6) 𝒢=𝔼Gi𝒟L hGiThGiT2hGiShGiS222,subscript𝒢subscript𝔼similar-tosubscript𝐺𝑖subscript𝒟𝐿 superscriptsubscriptdelimited-∥∥superscriptsubscriptsubscript𝐺𝑖𝑇subscriptdelimited-∥∥superscriptsubscriptsubscript𝐺𝑖𝑇2superscriptsubscriptsubscript𝐺𝑖𝑆subscriptdelimited-∥∥superscriptsubscriptsubscript𝐺𝑖𝑆222\mathcal{L}_{\mathcal{G}}=\mathbb{E}_{G_{i}\sim\mathcal{D}_{L}}\text{ }\bigl{% \lVert}\frac{h_{G_{i}}^{T}}{\lVert h_{G_{i}}^{T}\rVert_{2}}-\frac{h_{G_{i}}^{S% }}{\lVert h_{G_{i}}^{S}\rVert_{2}}\bigr{\rVert}_{2}^{2},caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ divide start_ARG italic_h start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_h start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - divide start_ARG italic_h start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_h start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where hGiTsuperscriptsubscriptsubscript𝐺𝑖𝑇h_{G_{i}}^{T}italic_h start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the graph-level representation in the teacher model, and hGiSsuperscriptsubscriptsubscript𝐺𝑖𝑆h_{G_{i}}^{S}italic_h start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT refers to the corresponding representation in the student model. The L2 norm, denoted by 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, measures the dissimilarity between these representations, thus driving the student to align with the teacher’s graph-level representation.

4.2. Subgraph-Level Distillation

While whole-graph level representations provide meaningful learning signals for MuGSI, a more nuanced understanding of the structural information can be attained through subgraph-level distillation. Previous work in Computer Vision has found that the attention maps of hidden activations across image patches tend to have spatial correlations with predicted objects on the image level, and these correlations also tend to be higher in networks with higher accuracy (Zagoruyko and Komodakis, 2017).

In the context of graph-structured data, the concept of an image ”patch” can be naturally analogized to a subgraph. This raises an important question: What type of subgraph should be selected as the underlying structure for a given graph? In this work, we elect to use clusters as the defining subgraphs, recognizing their essential role in understanding complex graph structures. For example, clusters in IMDB-BINARY (Yanardag and Vishwanathan, 2015) may correspond to groups of actors who frequently co-star in the same films. In REDDIT-BINARY (Yanardag and Vishwanathan, 2015), clustering nodes (users) can reveal community structures or groups of users that interact more frequently with each other. This could reflect shared opinions, interests, or other social dynamics within that specific thread. Although for some other scenarios, such as bioinformatics, clusters do not necessarily have a straightforward interpretation as they might be in social networks, they could be used to identify structural motifs or common substructures within a molecule, depending on the features used. This suggests that clusters as graph ”patches” can provide valuable information for graph classification.

In MuGSI, we maximize the inter-cluster similarity by leveraging the kernel matrix 𝐊𝐊\mathbf{K}bold_K, which embodies pairwise interactions among clusters and allows for describing the geometry of the corresponding feature spaces (Hinton and Roweis, 2002). Specifically, given two clusters i𝑖iitalic_i and j𝑗jitalic_j, let 𝒞isubscript𝒞𝑖\mathcal{C}_{i}caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒞jsubscript𝒞𝑗\mathcal{C}_{j}caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denotes the node sets belonging to cluster i𝑖iitalic_i and j𝑗jitalic_j respectively, then we can calculate a subgraph-level representation 𝐡𝒞isubscript𝐡subscript𝒞𝑖\mathbf{h}_{\mathcal{C}_{i}}bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝐡𝒞jsubscript𝐡subscript𝒞𝑗\mathbf{h}_{\mathcal{C}_{j}}bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT similarly in Eq. 3, i.e., 𝐡𝒞i=READOUT({𝐡v(L):v𝒞i})subscript𝐡subscript𝒞𝑖READOUTconditional-setsuperscriptsubscript𝐡𝑣𝐿𝑣subscript𝒞𝑖\mathbf{h}_{\mathcal{C}_{i}}=\operatorname{READOUT}\left(\left\{\mathbf{h}_{v}% ^{(L)}:v\in\mathcal{C}_{i}\right\}\right)bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_READOUT ( { bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT : italic_v ∈ caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ). A kernel matrix 𝐊RN𝒞×N𝒞𝐊superscript𝑅subscript𝑁𝒞subscript𝑁𝒞\mathbf{K}\in R^{N_{\mathcal{C}}\times N_{\mathcal{C}}}bold_K ∈ italic_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is obtained where N𝒞subscript𝑁𝒞N_{\mathcal{C}}italic_N start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT denotes the number of clusters in the given graph G𝐺Gitalic_G, and each element kij=k(𝐡𝒞i,𝐡𝒞j)subscript𝑘𝑖𝑗𝑘subscript𝐡subscript𝒞𝑖subscript𝐡subscript𝒞𝑗k_{ij}=k\bigl{(}\mathbf{h}_{\mathcal{C}_{i}},\mathbf{h}_{\mathcal{C}_{j}}\bigr% {)}italic_k start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_k ( bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). k(,)𝑘k(\cdot,\cdot)italic_k ( ⋅ , ⋅ ) is a kernel function that projects the sample vectors into a higher or infinite dimensional feature space. We use Cosine Similarity as the kernel function, i.e.,

(7) kij=k(𝐡𝒞i,𝐡𝒞j)=𝐡Ci,𝐡Cj𝐡Ci2𝐡Cj2.subscript𝑘𝑖𝑗𝑘subscript𝐡subscript𝒞𝑖subscript𝐡subscript𝒞𝑗subscript𝐡subscript𝐶𝑖subscript𝐡subscript𝐶𝑗subscriptnormsubscript𝐡subscript𝐶𝑖2subscriptdelimited-∥∥subscript𝐡subscript𝐶𝑗2k_{ij}=k\bigl{(}\mathbf{h}_{\mathcal{C}_{i}},\mathbf{h}_{\mathcal{C}_{j}}\bigr% {)}=\frac{\bigl{\langle}\mathbf{h}_{C_{i}},\mathbf{h}_{C_{j}}\bigr{\rangle}}{% \left\|\mathbf{h}_{C_{i}}\right\|_{2}\cdot\bigl{\|}\mathbf{h}_{C_{j}}\bigr{\|}% _{2}}.italic_k start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_k ( bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = divide start_ARG ⟨ bold_h start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ bold_h start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ bold_h start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG .

We then define the inter-cluster distillation loss 𝒞subscript𝒞\mathcal{L}_{\mathcal{C}}caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT as following:

(8) 𝒞=𝐊S𝐊TF2.subscript𝒞superscriptsubscriptnormsubscript𝐊𝑆subscript𝐊𝑇𝐹2\mathcal{L}_{\mathcal{C}}=\left\|\mathbf{K}_{S}-\mathbf{K}_{T}\right\|_{F}^{2}.caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT = ∥ bold_K start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - bold_K start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Here 𝐊Ssubscript𝐊𝑆\mathbf{K}_{S}bold_K start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and 𝐊Tsubscript𝐊𝑇\mathbf{K}_{T}bold_K start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are the obtained kernel matrix from the student and teacher model respectively, i.e., 𝐊S[i,j]=k(𝐡𝒞iS,𝐡𝒞jS)subscript𝐊𝑆𝑖𝑗𝑘superscriptsubscript𝐡subscript𝒞𝑖𝑆superscriptsubscript𝐡subscript𝒞𝑗𝑆\mathbf{K}_{S}[i,j]=k\left(\mathbf{h}_{\mathcal{C}_{i}}^{S},\mathbf{h}_{% \mathcal{C}_{j}}^{S}\right)bold_K start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT [ italic_i , italic_j ] = italic_k ( bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ), and 𝐊T[i,j]=k(𝐡𝒞iT,𝐡𝒞jT)subscript𝐊𝑇𝑖𝑗𝑘superscriptsubscript𝐡subscript𝒞𝑖𝑇superscriptsubscript𝐡subscript𝒞𝑗𝑇\mathbf{K}_{T}[i,j]=k\left(\mathbf{h}_{\mathcal{C}_{i}}^{T},\mathbf{h}_{% \mathcal{C}_{j}}^{T}\right)bold_K start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT [ italic_i , italic_j ] = italic_k ( bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ). Here 𝐡𝒞iSsuperscriptsubscript𝐡subscript𝒞𝑖𝑆\mathbf{h}_{\mathcal{C}_{i}}^{S}bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT and 𝐡𝒞iTsuperscriptsubscript𝐡subscript𝒞𝑖𝑇\mathbf{h}_{\mathcal{C}_{i}}^{T}bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT are the cluster-level representation obtained for cluster i𝑖iitalic_i from the teacher model and student model respectively.

4.3. Node-Level Distillation

Graph neural network’s success in graph classification is closely related to the Weisfeiler-Lehman (1-WL) algorithm. By iteratively aggregating neighboring node features to a center node, both 1-WL and GNN obtain a node representation that encodes a rooted subtree around the center node. These rooted subtree representations are then pooled into a single representation to represent the whole graph (Zhang and Li, 2021).

Hence, to obtain a discriminative representation for the whole graph, it is necessary to learn a ”good” representation for each node v𝑣vitalic_v that captures its local substructure. Let 𝐇T=fT(𝐗,𝐀)subscript𝐇𝑇subscript𝑓𝑇𝐗𝐀\mathbf{H}_{T}=f_{T}(\mathbf{X},\mathbf{A})bold_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_X , bold_A ) and 𝐇S=fS(𝐗,(𝐀))subscript𝐇𝑆subscript𝑓𝑆𝐗𝐀\mathbf{H}_{S}=f_{S}(\mathbf{X},(\mathbf{A}))bold_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_X , ( bold_A ) ) denote the node representations for a given graph G𝐺Gitalic_G obtained from teacher model and student model respectively ((𝐀)𝐀(\mathbf{A})( bold_A ) means 𝐀𝐀\mathbf{A}bold_A is optional depending on the choice of fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT), 𝐡vTsuperscriptsubscript𝐡𝑣𝑇\mathbf{h}_{v}^{T}bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝐡vSsuperscriptsubscript𝐡𝑣𝑆\mathbf{h}_{v}^{S}bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT denote the representation for node v𝑣vitalic_v from the teacher and student model. If we assume fT()subscript𝑓𝑇f_{T}(\cdot)italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ ) is more expressive than fS()subscript𝑓𝑆f_{S}(\cdot)italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( ⋅ ), then 𝐡vTsuperscriptsubscript𝐡𝑣𝑇\mathbf{h}_{v}^{T}bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT should more accurately reflect the local substructure of node v𝑣vitalic_v compared to 𝐡vSsuperscriptsubscript𝐡𝑣𝑆\mathbf{h}_{v}^{S}bold_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT. As the local substructure of every node v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V is essential for graph classification, we propose a novel node-level component in MuGSI to transfer the local structural knowledge from the teacher to the student model. This is done by maximizing the agreement between the teacher and student model on their opinions regarding the similarity of local neighborhood nodes. Specifically, for each node v𝑣vitalic_v, let 𝒫vKsuperscriptsubscript𝒫𝑣𝐾\mathcal{P}_{v}^{K}caligraphic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT be a collection of K𝐾Kitalic_K-step random-walk paths starting from node v𝑣vitalic_v, a single path drawn from 𝒫vKsuperscriptsubscript𝒫𝑣𝐾\mathcal{P}_{v}^{K}caligraphic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is denoted as pv:=(pv1,,pvK)assignsubscript𝑝𝑣superscriptsubscript𝑝𝑣1superscriptsubscript𝑝𝑣𝐾p_{v}:=\left(p_{v}^{1},\cdots,p_{v}^{K}\right)italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT := ( italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ). To measure the similarity between node v𝑣vitalic_v and its neighboring nodes along the random-walk path pvsubscript𝑝𝑣p_{v}italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, we define the following conditional probability p(uv)𝑝conditional𝑢𝑣p(u\mid v)italic_p ( italic_u ∣ italic_v ) for the teacher model:

(9) p(uv)=ehuThvwpvehwThv,upv,formulae-sequence𝑝conditional𝑢𝑣superscript𝑒superscriptsubscript𝑢𝑇subscript𝑣𝑤subscript𝑝𝑣superscript𝑒superscriptsubscript𝑤𝑇subscript𝑣𝑢subscript𝑝𝑣p(u\mid v)=\frac{e^{h_{u}^{T}h_{v}}}{\underset{w\in p_{v}}{\sum}e^{h_{w}^{T}h_% {v}}},u\in p_{v},italic_p ( italic_u ∣ italic_v ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG start_UNDERACCENT italic_w ∈ italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_UNDERACCENT start_ARG ∑ end_ARG italic_e start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , italic_u ∈ italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ,

and q(uv)𝑞conditional𝑢𝑣q(u\mid v)italic_q ( italic_u ∣ italic_v ) is similarly defined for the student model. The path consistency distillation loss is defined as follows:

(10) 𝒫=𝔼v𝒱 𝔼pv𝒫vK 𝒟KL(p(uv),q(uv))subscript𝒫subscript𝔼similar-to𝑣𝒱 subscript𝔼similar-tosubscript𝑝𝑣superscriptsubscript𝒫𝑣𝐾 subscript𝒟𝐾𝐿𝑝conditional𝑢𝑣𝑞conditional𝑢𝑣\mathcal{L}_{\mathcal{P}}=\mathbb{E}_{v\sim\mathcal{V}}\text{ }\mathbb{E}_{p_{% v}\sim\mathcal{P}_{v}^{K}}\text{ }\mathcal{D}_{KL}(p(u\mid v),q(u\mid v))caligraphic_L start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_v ∼ caligraphic_V end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∼ caligraphic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_p ( italic_u ∣ italic_v ) , italic_q ( italic_u ∣ italic_v ) )

Node feature augmentation. In addition to the node-level distillation of local substructures, the bottleneck of expressiveness of student models still needs to be addressed. As discussed in Section 1, the input feature space is typically very small for graph classification datasets, which severely limits the expressive power of the student model, and its learning capability, hence in MuGSI we enhance the node features by incorporating structure-aware features. Specifically, we utilize Laplacian eigenvectors (Belkin and Niyogi, 2003) as node positional encoding which is shown to be effective across various message-passing GNNs (Dwivedi et al., 2022). To further address this issue, we propose using a 1-hop GA-MLP as a more expressive student model. Notably, an MLP is essentially a 0-hop GA-MLP. Although the expressive power of a GA-MLP is still exponentially lower than a GNN model in terms of the number of equivalence classes (Chen et al., 2021), the student model can achieve comparable or superior results to the teacher GNN when combined with MuGSI for knowledge transfer.

Overall Framework. The final objective \mathcal{L}caligraphic_L of the proposed framework MuGSI is defined as a weighted combination of ground-truth cross-entropy loss GTsubscript𝐺𝑇\mathcal{L}_{GT}caligraphic_L start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT, soft logits distillation loss SLsubscript𝑆𝐿\mathcal{L}_{SL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT, and the multi-granularity distillation loss 𝒢subscript𝒢\mathcal{L}_{\mathcal{G}}caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT, 𝒞subscript𝒞\mathcal{L}_{\mathcal{C}}caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT and 𝒫subscript𝒫\mathcal{L}_{\mathcal{P}}caligraphic_L start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT respectively.

(11) =GT+SL+λ𝒢+μ𝒞+η𝒫,subscript𝐺𝑇subscript𝑆𝐿𝜆subscript𝒢𝜇subscript𝒞𝜂subscript𝒫\mathcal{L}=\mathcal{L}_{GT}+\mathcal{L}_{SL}+\lambda\mathcal{L}_{\mathcal{G}}% +\mu\mathcal{L}_{\mathcal{C}}+\eta\mathcal{L}_{\mathcal{P}},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT + italic_μ caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT + italic_η caligraphic_L start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ,

where λ,μ𝜆𝜇\lambda,\muitalic_λ , italic_μ and η𝜂\etaitalic_η are trade-off weights for balancing 𝒢,𝒞subscript𝒢subscript𝒞\mathcal{L}_{\mathcal{G}},\mathcal{L}_{\mathcal{C}}caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT and 𝒫subscript𝒫\mathcal{L}_{\mathcal{P}}caligraphic_L start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT, respectively. For SLsubscript𝑆𝐿\mathcal{L}_{SL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT, the weight is set to 1.01.01.01.0 without any hyper-parameter tuning for MuGSI.

5. NUMERICAL EXPERIMENTS

In this section, we extensively evaluate the effectiveness, efficiency, and robustness of the proposed framework MuGSI by investigating the following research questions.

RQ1: How does MuGSI perform for student MLPs? RQ2: How does MuGSI perform for more expressive student architecture? RQ3: How does MuGSI perform for different teachers? RQ4: How robust and efficient is MuGSI in dynamic environments? RQ5: How does each component perform in MuGSI? RQ6: How do different hyper-parameters affect the performance of MuGSI?

PROTEINS BZR DD NCI1 IMDB-B REDDIT-B CIFAR10 MolHIV
GIN 79.25±3.22 93.09±1.89 77.67±2.86 82.43±1.12 79.60±3.02 91.35±1.58 55.57 76.43
MLP 72.61±2.98 79.26±1.50 73.59±2.90 59.56±1.46 77.11±2.76 80.81±2.36 51.57±0.19 65.31±1.49
MLP+LaPE 75.92±2.63 81.73±2.21 79.45±2.79 66.05±2.01 76.60±2.61 84.70±2.44 48.34±0.08 64.72±1.07
GLNNMLP 72.96±2.54 79.51±1.94 74.49±2.94 59.95±2.33 77.58±3.27 80.21±2.60 51.61±0.26 68.38±1.01
GLNNMLP+LaPE 76.74±4.50 82.47±2.38 79.36±3.03 66.93±1.32 77.59±3.27 84.86±3.97 49.34±0.34 67.56±0.52
NOSMOGMLP 76.71±3.79 84.41±4.46 79.96±3.04 68.29±2.07 77.02±4.43 84.61±2.78 48.49±0.31 64.56±1.76
MuGSIMLP𝑀𝐿superscript𝑃{}_{MLP^{*}}start_FLOATSUBSCRIPT italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT (ours) 77.1±3.59 85.68±2.26 80.33±2.76 67.71±2.43 78.06±3.02 87.91±1.37 51.89±0.21 71.92±0.71
ΔMLPsubscriptΔ𝑀𝐿𝑃\Delta_{MLP}roman_Δ start_POSTSUBSCRIPT italic_M italic_L italic_P end_POSTSUBSCRIPT 4.49(6.18%) 6.42(8.10%) 6.74(9.16%) 8.15(13.68%) 0.95(1.23%) 7.10(8.79%) 0.32(0.62%) 6.61(10.12%)
ΔGLNNMLPsubscriptΔ𝐺𝐿𝑁subscript𝑁𝑀𝐿superscript𝑃\Delta_{GLNN_{MLP^{*}}}roman_Δ start_POSTSUBSCRIPT italic_G italic_L italic_N italic_N start_POSTSUBSCRIPT italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT 0.36(0.47%) 3.21(3.89%) 0.97(1.22%) 0.78(1.17%) 0.58(0.75%) 3.05(3.59%) 0.28(0.54%) 3.54(5.18%)
ΔNOSMOGMLPsubscriptΔ𝑁𝑂𝑆𝑀𝑂subscript𝐺𝑀𝐿𝑃\Delta_{NOSMOG_{MLP}}roman_Δ start_POSTSUBSCRIPT italic_N italic_O italic_S italic_M italic_O italic_G start_POSTSUBSCRIPT italic_M italic_L italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT 0.39(0.50%) 1.27(1.48%) 0.37(0.46%) -0.58(-0.85%) 1.04(1.33%) 3.30(3.75%) 3.39(6.55%) 7.36(10.23%)
ΔGINsubscriptΔ𝐺𝐼𝑁\Delta_{GIN}roman_Δ start_POSTSUBSCRIPT italic_G italic_I italic_N end_POSTSUBSCRIPT -2.15(-2.71%) -7.41(-7.96%) 2.66(3.42%) -14.72(-17.86%) -1.54(-1.93%) -3.44(-3.77%) -3.68(-6.62%) -4.51(-5.90%)
GA-MLP 75.74±2.68 90.62±3.81 75.71±1.73 75.93±1.98 79.95±3.02 88.45±2.36 54.81±0.19 71.55±1.08
GA-MLP+LaPE 75.47±2.58 87.42±3.67 78.85±2.49 71.61±2.02 78.45±3.26 89.62±2.86 51.91±0.18 71.78±1.48
GLNNGA-MLP 75.76±3.41 91.97±3.92 76.73±2.52 75.45±2.28 80.20±3.19 88.07±1.83 54.88±0.26 73.74±0.92
GLNNGA-MLP+LaPE 76.28±2.61 87.39±2.79 79.86±2.31 71.94±2.14 79.40±3.92 89.55±2.21 52.85±0.27 72.91±0.86
NOSMOGGA-MLP 78.35±2.74 88.78±2.32 80.41±3.57 74.84±2.92 79.10±3.72 89.09±1.64 51.36±0.46 73.67±1.31
MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT (ours) 78.26±4.78 93.1±2.81 81.57±2.24 76.86±2.33 80.31±3.36 90.91±2.05 55.63±0.31 76.38±0.95
ΔGAMLPsubscriptΔ𝐺𝐴𝑀𝐿𝑃\Delta_{GA-MLP}roman_Δ start_POSTSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P end_POSTSUBSCRIPT 2.52 (3.33%) 2.48 (2.74%) 5.86 (7.74%) 0.93 (1.22%) 0.36 (0.45%) 2.46 (2.78%) 0.82 (1.50%) 4.83 (6.75%)
ΔGLNNGAMLPsubscriptΔ𝐺𝐿𝑁subscript𝑁𝐺𝐴𝑀𝐿superscript𝑃\Delta_{GLNN_{GA-MLP^{*}}}roman_Δ start_POSTSUBSCRIPT italic_G italic_L italic_N italic_N start_POSTSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT 1.98(2.60%) 1.13(1.23%) 1.71(2.14%) 1.41(1.87%) 0.11(0.14%) 1.36(1.52%) 0.75(1.37%) 2.64(3.58%)
ΔNOSMOGGAMLPsubscriptΔ𝑁𝑂𝑆𝑀𝑂subscript𝐺𝐺𝐴𝑀𝐿𝑃\Delta_{NOSMOG_{GA-MLP}}roman_Δ start_POSTSUBSCRIPT italic_N italic_O italic_S italic_M italic_O italic_G start_POSTSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT -0.08(-0.11%) 4.32(4.64%) 1.16(1.42%) 2.02(2.63%) 1.21(1.50%) 1.82(2.01%) 4.27(7.68%) 2.71(3.68%)
ΔGINsubscriptΔ𝐺𝐼𝑁\Delta_{GIN}roman_Δ start_POSTSUBSCRIPT italic_G italic_I italic_N end_POSTSUBSCRIPT -0.99 (-1.25%) 0.01 (0.01%) 3.90 (5.02%) -5.57 (-6.76%) 0.71 (0.89%) -0.43 (-0.48%) 0.06 (0.11%) -0.05 (-0.07%)
Table 2. Experiment results where the teacher model is GIN, and the student models are MLP, MLP+LaPE and GA-MLP, GA-MLP+LaPE. The absolute improvement and relative improvement are both illustrated in the table. As illustrated in the figure, MuGSI outperforms other competitive baseline methods on almost all the datasets, with different student MLP-type model architectures. Using GA-MLP as the student model, MuGSI exhibits comparable performance with the teacher GIN model in 7/8 datasets.

5.1. Experiment Settings

Datasets. We use 6 small real-world datasets and 2 large real-world datasets to evaluate our proposed framework. For the 6 small real-world datasets from TUDataset (Morris et al., 2020), PROTEINS (Dobson and Doig, 2003),NCI1 (Wale and Karypis, 2006), BZR (doi:10.1021/ci034143r) and DD (Shervashidze et al., 2011; Dobson and Doig, 2003) are bioinformatics datasets; REDDIT-BINARY and IMDB-BINARY are social network datasets. As no node features are provided for the social network datasets, we use one-hot encoding of node degrees as their node features. For the 2 large real-world datasets, we use CIFAR10 from Benchmarking GNNs (Dwivedi et al., 2022), and MolHIV from Open Graph Benchmark (Hu et al., 2021). See Appendix A.6 for the dataset statistics.

Model Architectures. As a model-agnostic framework, MuGSI can be combined with any teacher GNN architecture. In this work, we adopt three GNN teacher model architectures: GIN (Xu et al., 2019), GCN (Kipf and Welling, 2016) and KPGNN (Feng et al., 2022a). For student model architectures, MLP and GA-MLP are both adopted to thoroughly evaluate MuGSI’s performance with students of different expressiveness levels. For GA-MLP, a simplified version is utilized with 1-hop neighborhood aggregation, i.e., Ω={𝐈,𝐀𝐃1}Ω𝐈superscript𝐀𝐃1\Omega=\left\{\mathbf{I},\mathbf{A}\mathbf{D}^{-1}\right\}roman_Ω = { bold_I , bold_AD start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT } and ϕitalic-ϕ\phiitalic_ϕ being the identity function, using the notation from Section 2.3. This simplified version allows pre-computation, leading to the time complexity of this GA-MLP architecture becoming close to that of a standard MLP.

Baselines. We consider several baseline methods to facilitate a comprehensive evaluation of our proposed framework. MLP: We use MLP as the basis for comparison with more advanced methods. GLNNMLP: This method distills student MLPs using soft labels, which is similar to GLNN (Zhang et al., 2022), except that a graph pooling function is utilized to obtain a graph-level representation. MLP+LaPE: Here, we extend the MLP by augmenting it with node features encoded through Laplacian eigenvector positional encodings (LaPE). This enhancement aims to increase the expressiveness of the student MLP model. GLNNMLP+LaPE: This method combines the MLP with both Laplacian eigenvector positional encodings (LaPE) and soft logits-based KD, serving as a more advanced variant of GLNN. We extend the same experiment setting for GA-MLP, specifically, the baseline methods are GA-MLP, GLNNGA-MLP, GA-MLP+LaPE, and GLNNGA-MLP+LaPE. NOSMOG: We also adopt NOSMOG (Tian et al., 2023) as another strong baseline method for comparison. As DeepWalk (Perozzi et al., 2014) generates node embeddings in a transductive manner, which is not suitable for graph classification, we use LaPE to replace this component in NOSMOG. We use NOSMOGMLP and NOSMOGGA-MLP to denote NOSMOG applied to student MLP and GA-MLP respectively, also note that LaPE is an inherent component in NOSMOG, which injects structural features to student models. Finally, we denote MLP as the best performing model between MLP and MLP+LaPE, similarly for GA-MLP.

Evaluation Protocol. For the 6 real-world datasets from TUDataset, we use the standard stratified splits (Xu et al., 2019), and perform 10-fold cross-validation with 90% training and 10% testing, we report the mean best test results. The teacher GNN model for each fold is saved based on the best test result and, hence is consistent with the reported test results from student models. For CIFAR10, we use standard split that consists of 45,000 train, 5,000 validation, and 10,000 test graphs, we report the test classification accuracy according to the best validation accuracy. For MolHIV, we follow the scaffold split(Ramsundar et al., 2019; Hu et al., 2020), the split for train/validation/test sets is 80%:10%:10%. We report the ROC-AUC value on the test set according to the best ROC-AUC on the validation set.

Teacher Student PROTEINS IMDB-BINARY DD BZR
- GA-MLP 75.74±2.68 79.95±3.02 75.71±1.73 90.62±3.81
- GA-MLP+LaPE 75.47±2.58 78.45±3.26 78.85±2.49 89.62±2.86
- 76.28±2.71 79.27±4.16 76.31±1.44 89.88±3.38
GLNNGA-MLP 75.57±2.73 80.01±4.05 75.40±3.09 92.08±2.52
GLNNGA-MLP+LaPE 77.09±2.83 79.60±3.13 79.62±2.12 88.68±3.66
MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT 77.69±2.67 81.09±3.91 80.52±2.29 91.94±3.07
ΔGAMLPsubscriptΔ𝐺𝐴𝑀𝐿𝑃\Delta_{GA-MLP}roman_Δ start_POSTSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P end_POSTSUBSCRIPT 1.95(2.57%) 1.14(1.43%) 4.81(6.35%) 1.32(1.46%)
ΔGLNNGAMLPsubscriptΔ𝐺𝐿𝑁subscript𝑁𝐺𝐴𝑀𝐿superscript𝑃\Delta_{GLNN_{GA-MLP^{*}}}roman_Δ start_POSTSUBSCRIPT italic_G italic_L italic_N italic_N start_POSTSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT 0.59(0.77%) 1.07(1.35%) 0.89(1.13%) -0.14(-0.15%)
GCN ΔGCNsubscriptΔ𝐺𝐶𝑁\Delta_{GCN}roman_Δ start_POSTSUBSCRIPT italic_G italic_C italic_N end_POSTSUBSCRIPT 1.41(1.85%) 1.82(2.30%) 4.21(5.52%) 2.06(2.29%)
- 78.56±3.17 80.30±4.37 81.07±2.83 93.11±2.51
GLNNGA-MLP 76.01±2.56 80.50±4.01 76.14±3.29 91.72±2.31
GLNNGA-MLP+LaPE 76.37±3.84 79.80±2.84 81.23±3.57 89.17±3.99
MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT 77.13±2.53 81.04±3.82 82.64±3.31 92.89±3.54
ΔGAMLPsubscriptΔ𝐺𝐴𝑀𝐿𝑃\Delta_{GA-MLP}roman_Δ start_POSTSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P end_POSTSUBSCRIPT 1.39(1.84%) 1.09(1.36%) 6.93(9.15%) 2.27(2.50%)
ΔGLNNGAMLPsubscriptΔ𝐺𝐿𝑁subscript𝑁𝐺𝐴𝑀𝐿superscript𝑃\Delta_{GLNN_{GA-MLP^{*}}}roman_Δ start_POSTSUBSCRIPT italic_G italic_L italic_N italic_N start_POSTSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT 0.75(1.00%) 0.54(0.67%) 1.40(1.74%) 1.17(1.28%)
KPGIN ΔKPGINsubscriptΔ𝐾𝑃𝐺𝐼𝑁\Delta_{KPGIN}roman_Δ start_POSTSUBSCRIPT italic_K italic_P italic_G italic_I italic_N end_POSTSUBSCRIPT -1.43(-1.82%) 0.74(0.92%) 1.57(1.94%) -0.22(-0.24%)
Table 3. Experiment results with different teacher GNN model architectures, the student model is GA-MLP.

5.2. How Does MuGSI Perform for Student MLPs? (RQ1)

We first evaluate MuGSI where the student models are MLPs, and compare with MLP-related baseline methods. The experimental results are illustrated in Table 2, from which we can make several observations: (1) Incorporating Laplacian eigenvectors into the vanilla MLP models is able to enhance their classification performance across various datasets. Notable improvements include an increase of 4.06% in PROTEINS, 7.96% in DD, and 4.81% in REDDIT-BINARY. (2) While the use of soft logits(Hinton et al., 2015) has shown significant benefits for node classification, as evidenced by (Zhang et al., 2022), its impact on graph classification is negligible for most datasets. This finding aligns with our analysis. (3) Our proposed MuGSIMLP𝑀𝐿superscript𝑃{}_{MLP^{*}}start_FLOATSUBSCRIPT italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT framework consistently outperforms other variations such as GLNNMLP and GLNNMLP+LaPE across different datasets. Notably, MuGSI outperforms GLNNMLP𝑀𝐿superscript𝑃{}_{MLP^{*}}start_FLOATSUBSCRIPT italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT by 3.89% in BZR, 3.59% in REDDIT-BINARY, and 6.45% in MolHIV, highlighting the effectiveness of our framework for graph classification tasks. (4) NOSMOG also excels across several datasets, thanks to its representational similarity distillation component, which aligns the distribution over local substructures between the teacher GNNs and the student MLPs, since it shares the same function form with inter-cluster distillation loss 𝒞subscript𝒞\mathcal{L}_{\mathcal{C}}caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT, however this component works at the node level, and incurs a high space complexity of 𝒪(|𝒱|2)𝒪superscript𝒱2\mathcal{O}(|\mathcal{V}|^{2})caligraphic_O ( | caligraphic_V | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In contrast, the path-consistency distillation loss 𝒫subscript𝒫\mathcal{L}_{\mathcal{P}}caligraphic_L start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT is more memory efficient (𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ) space complexity since the random-walk path length is a fixed constant). Furthermore, the multi-granularity structural distillation introduced in MuGSI generally outperforms NOSMOG, which solely relies on node-level structural distillation. (5) However, we also notice that the improvements are slight in several datasets. We hypothesize that this could be attributed to the limited expressive power of the student model architecture and the constraints imposed by the small size of the input feature space. These results lead us to consider an intriguing question: what might be achieved with a more expressive student model?

5.3. How Does MuGSI Perform for More Expressive Student Architecture? (RQ2)

We adopt a 1-hop GA-MLP as the student model in this experiment. The experiment results are illustrated in Table 2, from which we can make several observations: (1) MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT achieves the best performance across 7/8 datasets, demonstrating its effectiveness. (2) For several datasets, the enhanced expressiveness of the student model yields a larger performance gain over GLNNMLP𝑀𝐿superscript𝑃{}_{MLP^{*}}start_FLOATSUBSCRIPT italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT, e.g., 0.47% versus 2.59% in PROTEINS and 1.22% versus 2.15% in DD, suggesting that a more expressive learner can sometimes be a ”smarter” learner. (3) For several datasets such as DD and IMDB-BINARY, using GA-MLP on its own without the aid of knowledge distillation already achieves comparable or even superior performance compared with the teacher GIN model. Nevertheless, utilizing MuGSI further enhances the student model’s performance. (4) When adopting GA-MLP as the student model, MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT exhibits performance on par with the teacher model in 7/8 datasets and surpasses the teacher model in 4/8 datasets. This shows the effectiveness of our proposed knowledge distillation framework.

Datasets MLP* w/ GraphKD w/ ClusterKD w/ NodeKD MuGSIMLP𝑀𝐿superscript𝑃{}_{MLP^{*}}start_FLOATSUBSCRIPT italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT ΔGraphKDsubscriptΔ𝐺𝑟𝑎𝑝𝐾𝐷\Delta_{GraphKD}roman_Δ start_POSTSUBSCRIPT italic_G italic_r italic_a italic_p italic_h italic_K italic_D end_POSTSUBSCRIPT ΔClusterKDsubscriptΔ𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐾𝐷\Delta_{ClusterKD}roman_Δ start_POSTSUBSCRIPT italic_C italic_l italic_u italic_s italic_t italic_e italic_r italic_K italic_D end_POSTSUBSCRIPT ΔNodeKDsubscriptΔ𝑁𝑜𝑑𝑒𝐾𝐷\Delta_{NodeKD}roman_Δ start_POSTSUBSCRIPT italic_N italic_o italic_d italic_e italic_K italic_D end_POSTSUBSCRIPT ΔMuGSIsubscriptΔ𝑀𝑢𝐺𝑆𝐼\Delta_{MuGSI}roman_Δ start_POSTSUBSCRIPT italic_M italic_u italic_G italic_S italic_I end_POSTSUBSCRIPT
PROTEINS 75.92±2.63 76.28±4.31 76.73±3.71 76.49±3.33 77.10±3.59 0.36(0.47%) 0.81(1.07%) 0.57(0.75%) 1.18(1.55%)
BZR 81.73±2.21 84.20±3.48 84.68±3.24 83.71±3.46 85.68±2.26 2.47(3.02%) 2.95(3.61%) 1.98(2.42%) 3.95(4.83%)
DD 79.45±2.79 79.87±2.95 80.13±3.25 79.96±2.53 80.33±2.76 0.42(0.53%) 0.68(0.86%) 0.51(0.64%) 0.88(1.11%)
REDDIT-BINARY 84.70±2.44 85.85±2.31 86.73±2.28 86.11±1.97 87.91±1.37 1.15(1.36%) 2.03(2.4%) 1.41(1.66%) 3.21(3.79%)
Table 4. Ablation study for independent components in MuGSIMLP𝑀𝐿superscript𝑃{}_{MLP^{*}}start_FLOATSUBSCRIPT italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT, in which the teacher model is GIN. As shown in the table, each independent component in MuGSI makes a positive contribution to the KD process.

5.4. How Does MuGSI Perform for Different Teachers? (RQ3)

As the expressive power of GIN is upper bounded by 1-WL(Xu et al., 2019; Morris et al., 2019), recently there is a collection of literature proposed to enhance the expressivity of message-passing GNNs. To explore how different teacher model architectures with different levels of expressiveness affect the knowledge distillation process, we adopt another two teacher model architectures: GCN (Kipf and Welling, 2016) and KPGIN (Feng et al., 2022a). The expressive power of GCN is also upper bounded by 1-WL, and KPGIN is a K𝐾Kitalic_K-hop message-passing GNN model with peripheral subgraph information, which is strictly more powerful than 1-WL and is upper bounded by 3-WL. We adopt GA-MLP and GA-MLP+LaPE as the student models.

As illustrated in Table 3, we can see that (1) For both GCN and KPGIN as teacher models, MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT outperform GLNNGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT in most datasets, which demonstrates that MuGSI as a model-agnostic KD framework is effective. (2) The performance of vanilla GA-MLP is on par with GCN or even superior to GCN, e.g., in DD and IMDB-BINARY. However, distilling knowledge from GCN into GA-MLP using MuGSI can still benefit student GA-MLP significantly. For instance, the accuracy for GA-MLP improves from 78.85% to 80.52% using MuGSI in the DD dataset although GCN merely achieves 76.31% accuracy, similarly in IMDB-BINARY, the classification accuracy of GA-MLP improves from 79.95% to 81.09% using MuGSI, while GCN achieves 79.27%. Furthermore, MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT outperforms GCN in all 4 datasets with a large margin. (3) For a more powerful teacher model KPGIN, MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT also consistently outperforms GLNNGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT. Notably, even if KPGIN is 3-WL equivalent, a 1-hop student GA-MLP using MuGSI achieves comparable or superior performance.

Refer to caption
Figure 2. Average prediction error and entropy resulted by GIN and MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT when sequentially inserting 10 nodes back to the graphs. As demonstrated, MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT is more robust and less susceptible to topological changes.

5.5. How Robust and Efficient is MuGSI in Dynamic Environments? (RQ4)

In practical production environments, graphs are often dynamic, with nodes being inserted or removed over time. Taking REDDIT-BINARY as an example, each node in a graph corresponds to a user engaged in a discussion thread, and edges represent interactions between these users. As nodes can be added or removed, this may lead to distributional shift issues. In this section, we verify how does teacher GIN model and student GA-MLP perform under this scenario. We utilize the first fold of the REDDIT-BINARY dataset for our experiments. For each graph G𝐺Gitalic_G in the test set, we first randomly remove 10 nodes from the graph, then we insert them back sequentially to get the same graph G𝐺Gitalic_G. This process is repeated 20 times for each graph in the test set. As we only remove a small fraction of nodes (2%-3% at most) in each graph, it is reasonable to assume that the graph’s label remains unchanged. We calculate two metrics: (1) Average prediction error. For each perturbed graph with k𝑘kitalic_k inserted nodes where k[0,10]𝑘010k\in[0,10]italic_k ∈ [ 0 , 10 ], we assess whether its predicted label matches that of the original graph. The error is binary: 0 for a match and 1 otherwise, and we calculate the average error across all perturbations for each k𝑘kitalic_k. (2) Average entropy. Instead of recording a binary variable, we compute the Shannon entropy of the predicted label distribution for each perturbed graph with k𝑘kitalic_k insertions. For incorrect predictions, we set the entropy to its maximum value (i.e., 1.0 for binary classification). This metric helps to quantify the confidence of the model predictions.

The average prediction error for MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT is significantly lower than that for GIN, as depicted in Figure 2, despite comparable accuracies on unperturbed test graphs (90.1% for MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT vs. 89.86% for GIN). Specifically, GIN’s accuracy drops by 7.76% upon the removal of 10 nodes, while MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT’s accuracy decreases only by 2.77%, this demonstrates the robustness of the student model. We hypothesize that the robustness of MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT arises from the structural information retained in the model parameters during the knowledge distillation process, which is orthogonal to topological changes. Additionally, the receptive field of GIN (5 hops in this case) is much larger than a 1-hop GA-MLP, hence is more susceptible to the topological changes. Despite its higher average prediction error, GIN’s model predictions exhibit greater confidence (i.e., lower entropy) compared to those from MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT.

Regarding efficiency, as illustrated in Figure 3, GA-MLP is substantially faster than GIN. This is due to that it takes the entire input 𝐀𝐀\mathbf{A}bold_A and 𝐗𝐗\mathbf{X}bold_X to re-calculate the model prediction for GIN; whereas for GA-MLP, with sum𝑠𝑢𝑚sumitalic_s italic_u italic_m pooling as readout function, we can obtain a static representation first given the graph with 10 nodes removed, then for each node inserted back, we only need to incrementally calculate its representation and sum it with the static representation, followed by a linear transformation. The static representation can be updated with one additional operation. This procedure significantly reduces computational overhead, allowing GA-MLP to achieve an average incremental inference time of 0.59ms on a CPU machine, which is 17.18x faster than GIN using a CPU machine and 4.98x faster than GIN using a CUDA machine. The efficiency makes the student model deployable in resource-constrained environments.

5.6. How Does Each Component Perform in MuGSI? (RQ5)

As MuGSI consists of three components for multi-scaled structural knowledge distillation, we explore how each independent component affects the KD process for several datasets. To ensure a fair comparison, MLP is adopted as the baseline method, since we use MLP as the student model for knowledge distillation. The three components are named GraphKD, ClusterKD, and NodeKD as illustrated in Table 4. We can see that: (1) Each independent component makes a positive contribution to the KD process; (2) For the 4 datasets, ClusterKD consistently brings the largest performance gain; (3) MuGSI leverages joint structural knowledge distillation, outperforms the individual components, showcasing the effectiveness of distilling multi-granularity structural information.

Refer to caption
Figure 3. Average inference time from GIN and MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT when sequentially inserting 10 nodes back to the graphs.

5.7. How Do Different Hyper-parameters Affect the Performance of MuGSI? (RQ6)

We first provide sensitivity analysis for λ,μ𝜆𝜇\lambda,\muitalic_λ , italic_μ, and η𝜂\etaitalic_η, which control the strength of each distillation component in MuGSI. We perform grid search on λ(1.0,1e1,1e2),μ(1.0,1e1,1e2),η(1e4,1e5)formulae-sequence𝜆1.01𝑒11𝑒2formulae-sequence𝜇1.01𝑒11𝑒2𝜂1𝑒41𝑒5\lambda\in(1.0,1e-1,1e-2),\mu\in(1.0,1e-1,1e-2),\eta\in(1e-4,1e-5)italic_λ ∈ ( 1.0 , 1 italic_e - 1 , 1 italic_e - 2 ) , italic_μ ∈ ( 1.0 , 1 italic_e - 1 , 1 italic_e - 2 ) , italic_η ∈ ( 1 italic_e - 4 , 1 italic_e - 5 ), leading to 18 models with different hyper-parameter combinations. We index these models from 0 to 17, as illustrated in Figure 4. As we can see, the correlation for MuGSIMLP𝑀𝐿superscript𝑃{}_{MLP^{*}}start_FLOATSUBSCRIPT italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT and MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT in REDDIT-BINARY is much higher than that in the BZR dataset, possibly because REDDIT-BINARY is a much larger dataset than BZR (2000 samples vs. 405 samples); Furthermore, MLP and GA-MLP both utilize Laplacian eigenvectors in REDDIT-BINARY, whereas in BZR, MLP is MLP+LaPE, and GA-MLP is GA-MLP. This may lead to different inductive biases for MuGSIMLP𝑀𝐿superscript𝑃{}_{MLP^{*}}start_FLOATSUBSCRIPT italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT and MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT in BZR, leading to lower correlation between two different student models with different hyper-parameters.

The random-walk path length is another key hyper-parameter in the path consistency loss 𝒫subscript𝒫\mathcal{L}_{\mathcal{P}}caligraphic_L start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT. We do an ablation study and investigate the impact of various random-walk path lengths, from which several observations can be made: (1) The choice of an optimal random walk path length appears to be influenced by the inherent topological structure of graphs within specific datasets. This suggests that the most effective path length is not universally constant, but rather is subject to the unique characteristics of each dataset. (2) Extended random walk path lengths generally yield sub-optimal results. One possible explanation for this trend is that longer paths could introduce additional noise during the knowledge distillation process in capturing local substructures.

Refer to caption
Figure 4. Mean best accuracy for different hyper-parameter combinations for MuGSIMLP𝑀𝐿superscript𝑃{}_{MLP^{*}}start_FLOATSUBSCRIPT italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT and MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT.
PROTEINS DD BZR IMDB-BINARY
4subscript4\mathcal{L}_{4}caligraphic_L start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 77.54±2.92 81.15±1.93 92.63±3.68 81.03±3.26
8subscript8\mathcal{L}_{8}caligraphic_L start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 78.26±4.78 81.57±2.24 93.10±2.81 80.31±3.36
12subscript12\mathcal{L}_{12}caligraphic_L start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT 77.89±3.45 80.98±2.08 92.80±3.64 80.81±3.25
16subscript16\mathcal{L}_{16}caligraphic_L start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT 78.05±3.84 80.31±2.33 92.66±2.46 80.20±3.96
Table 5. Analysis of the effect of random walk path lengths on MuGSIGAMLP𝐺𝐴𝑀𝐿superscript𝑃{}_{GA-MLP^{*}}start_FLOATSUBSCRIPT italic_G italic_A - italic_M italic_L italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT with GIN as the teacher model. Here, isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes a random walk path of length i𝑖iitalic_i.

6. CONCLUSION

In this paper, we identified an under-explored problem: the GNN-to-MLP distillation for graph classification, then we offer an analysis of why existing GNN-to-MLP KD frameworks are suboptimal for graph classification. We then introduce MuGSI, the first GNN-to-MLP Knowledge Distillation framework for graph classification. MuGSI proposes a novel multi-granularity distillation loss to generate dense learning feedback and facilitate comprehensive knowledge transfer from the teacher model to the student model. MuGSI is model-agnostic, demonstrating comparable performance across a variety of teacher model architectures using 1-hop GA-MLP as the student model. Moreover, MuGSI is robust and efficient in dynamic environments, which serves as a potent technique to tackle test-time distribution shift issues, and enables fast inference in environments with limited computational resources.

References

  • (1)
  • Belkin and Niyogi (2003) Mikhail Belkin and Partha Niyogi. 2003. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Comput. 15, 6 (jun 2003), 1373–1396. https://doi.org/10.1162/089976603321780317
  • Bengio et al. (2014) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2014. Representation Learning: A Review and New Perspectives. arXiv:1206.5538 [cs.LG]
  • Bevilacqua et al. (2021) Beatrice Bevilacqua, Fabrizio Frasca, Derek Lim, Balasubramaniam Srinivasan, Chen Cai, Gopinath Balamurugan, Michael M Bronstein, and Haggai Maron. 2021. Equivariant subgraph aggregation networks. arXiv preprint arXiv:2110.02910 (2021).
  • Blondel et al. (2008) Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, 10 (oct 2008), P10008. https://doi.org/10.1088/1742-5468/2008/10/p10008
  • Bodnar et al. (2021) Cristian Bodnar, Fabrizio Frasca, Nina Otter, Yuguang Wang, Pietro Lio, Guido F Montufar, and Michael Bronstein. 2021. Weisfeiler and Lehman go cellular: CW networks. Advances in Neural Information Processing Systems 34 (2021), 2625–2640.
  • Bouritsas et al. (2022) Giorgos Bouritsas, Fabrizio Frasca, Stefanos Zafeiriou, and Michael M Bronstein. 2022. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 657–668.
  • Chen et al. (2021) Lei Chen, Zhengdao Chen, and Joan Bruna. 2021. On Graph Neural Networks versus Graph-Augmented {MLP}s. In International Conference on Learning Representations. https://openreview.net/forum?id=tiqI7w64JG2
  • Chen et al. (2020) Ting Chen, Song Bian, and Yizhou Sun. 2020. Are Powerful Graph Neural Nets Necessary? A Dissection on Graph Classification. arXiv:1905.04579 [cs.LG]
  • Dobson and Doig (2003) Paul D. Dobson and Andrew J. Doig. 2003. Distinguishing Enzyme Structures from Non-enzymes Without Alignments. Journal of Molecular Biology 330, 4 (2003), 771–783. https://doi.org/10.1016/S0022-2836(03)00628-4
  • Dwivedi et al. (2022) Vijay Prakash Dwivedi, Chaitanya K. Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. 2022. Benchmarking Graph Neural Networks. arXiv:2003.00982 [cs.LG]
  • Dwivedi et al. (2021) Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. 2021. Graph neural networks with learnable structural and positional representations. arXiv preprint arXiv:2110.07875 (2021).
  • Fan et al. (2019) Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. 2019. Graph Neural Networks for Social Recommendation. arXiv:1902.07243 [cs.IR]
  • Feng et al. (2022a) Jiarui Feng, Yixin Chen, Fuhai Li, Anindya Sarkar, and Muhan Zhang. 2022a. How Powerful are K-hop Message Passing Graph Neural Networks. arXiv preprint arXiv:2205.13328 (2022).
  • Feng et al. (2022b) Kaituo Feng, Changsheng Li, Ye Yuan, and Guoren Wang. 2022b. FreeKD. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM. https://doi.org/10.1145/3534678.3539320
  • Frasca et al. (2020) Fabrizio Frasca, Emanuele Rossi, Davide Eynard, Ben Chamberlain, Michael Bronstein, and Federico Monti. 2020. SIGN: Scalable Inception Graph Neural Networks. arXiv:2004.11198 [cs.LG]
  • Gasteiger et al. (2021) Johannes Gasteiger, Florian Becker, and Stephan Günnemann. 2021. Gemnet: Universal directional graph neural networks for molecules. Advances in Neural Information Processing Systems 34 (2021), 6790–6802.
  • Gong et al. (2013) Boqing Gong, Kristen Grauman, and Fei Sha. 2013. Connecting the Dots with Landmarks: Discriminatively Learning Domain-Invariant Features for Unsupervised Domain Adaptation. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 222–230. https://proceedings.mlr.press/v28/gong13.html
  • Gretton et al. (2012) Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A Kernel Two-Sample Test. J. Mach. Learn. Res. 13, null (mar 2012), 723–773.
  • Gretton et al. (2008) Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. 2008. 131Covariate Shift by Kernel Mean Matching. In Dataset Shift in Machine Learning. The MIT Press. https://doi.org/10.7551/mitpress/9780262170055.003.0008 arXiv:https://academic.oup.com/mit-press-scholarship-online/book/0/chapter/166932982/chapter-ag-pdf/44903280/book_13447_section_166932982.ag.pdf
  • Guo et al. (2023) Zhichun Guo, Chunhui Zhang, Yujie Fan, Yijun Tian, Chuxu Zhang, and Nitesh Chawla. 2023. Boosting Graph Neural Networks via Adaptive Knowledge Distillation. arXiv:2210.05920 [cs.LG]
  • He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [stat.ML]
  • Hinton and Roweis (2002) Geoffrey E Hinton and Sam Roweis. 2002. Stochastic Neighbor Embedding. In Advances in Neural Information Processing Systems, S. Becker, S. Thrun, and K. Obermayer (Eds.), Vol. 15. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2002/file/6150ccc6069bea6b5716254057a194ef-Paper.pdf
  • Hu et al. (2021) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2021. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv:2005.00687 [cs.LG]
  • Hu et al. (2020) Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. 2020. Strategies for Pre-training Graph Neural Networks. arXiv:1905.12265 [cs.LG]
  • Huang et al. (2006) Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex Smola. 2006. Correcting Sample Selection Bias by Unlabeled Data. In Advances in Neural Information Processing Systems, B. Schölkopf, J. Platt, and T. Hoffman (Eds.), Vol. 19. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2006/file/a2186aa7c086b46ad4e8bf81e2a3a19b-Paper.pdf
  • Joshi et al. (2022) Chaitanya K. Joshi, Fayao Liu, Xu Xun, Jie Lin, and Chuan Sheng Foo. 2022. On Representation Knowledge Distillation for Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–12. https://doi.org/10.1109/tnnls.2022.3223018
  • Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Kondor et al. (2018) Risi Kondor, Hy Truong Son, Horace Pan, Brandon Anderson, and Shubhendu Trivedi. 2018. Covariant compositional networks for learning graphs. arXiv preprint arXiv:1801.02144 (2018).
  • Lassance et al. (2019) Carlos Lassance, Myriam Bontonou, Ghouthi Boukli Hacene, Vincent Gripon, Jian Tang, and Antonio Ortega. 2019. Deep geometric knowledge distillation with graphs. arXiv:1911.03080 [cs.LG]
  • Li et al. (2019) Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph Matching Networks for Learning the Similarity of Graph Structured Objects. arXiv:1904.12787 [cs.LG]
  • Maron et al. (2019) Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. 2019. Provably powerful graph networks. Advances in neural information processing systems 32 (2019).
  • Maron et al. (2018) Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. 2018. Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902 (2018).
  • Maron et al. (2020) Haggai Maron, Or Litany, Gal Chechik, and Ethan Fetaya. 2020. On learning sets of symmetric elements. In International Conference on Machine Learning. PMLR, 6734–6744.
  • Morris et al. (2020) Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. 2020. TUDataset: A collection of benchmark datasets for learning with graphs. arXiv:2007.08663 [cs.LG]
  • Morris et al. (2019) Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. 2019. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 4602–4609.
  • NT and Maehara (2019) Hoang NT and Takanori Maehara. 2019. Revisiting Graph Neural Networks: All We Have is Low-Pass Filters. arXiv:1905.09550 [stat.ML]
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. https://doi.org/10.1145/2623330.2623732
  • Ramsundar et al. (2019) B. Ramsundar, P. Eastman, P. Walters, and V. Pande. 2019. Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More. O’Reilly Media. https://books.google.com/books?id=tYFKuwEACAAJ
  • Ren et al. (2022) Yating Ren, Junzhong Ji, Lingfeng Niu, and Minglong Lei. 2022. Multi-task Self-distillation for Graph-based Semi-Supervised Learning. arXiv:2112.01174 [cs.LG]
  • Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. 2011. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 12, null (nov 2011), 2539–2561.
  • Thiede et al. (2021) Erik Thiede, Wenda Zhou, and Risi Kondor. 2021. Autobahn: Automorphism-based graph neural nets. Advances in Neural Information Processing Systems 34 (2021), 29922–29934.
  • Tian et al. (2023) Yijun Tian, Chuxu Zhang, Zhichun Guo, Xiangliang Zhang, and Nitesh Chawla. 2023. Learning MLPs on Graphs: A Unified View of Effectiveness, Robustness, and Efficiency. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=Cs3r5KLdoj
  • Vignac et al. (2020) Clement Vignac, Andreas Loukas, and Pascal Frossard. 2020. Building powerful and equivariant graph neural networks with structural message-passing. Advances in Neural Information Processing Systems 33 (2020), 14143–14155.
  • Wale and Karypis (2006) Nikil Wale and George Karypis. 2006. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification. In Sixth International Conference on Data Mining (ICDM’06). 678–689. https://doi.org/10.1109/ICDM.2006.39
  • Wu et al. (2019) Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr. au2, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. 2019. Simplifying Graph Convolutional Networks. arXiv:1902.07153 [cs.LG]
  • Wu et al. (2023b) Lirong Wu, Haitao Lin, Yufei Huang, Tianyu Fan, and Stan Z. Li. 2023b. Extracting Low-/High- Frequency Knowledge from Graph Neural Networks and Injecting it into MLPs: An Effective GNN-to-MLP Distillation Framework. arXiv:2305.10758 [cs.LG]
  • Wu et al. (2022) Lirong Wu, Haitao Lin, Yufei Huang, and Stan Z. Li. 2022. Knowledge Distillation Improves Graph Structure Augmentation for Graph Neural Networks. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=7yHte3tH8Xh
  • Wu et al. (2023a) Lirong Wu, Haitao Lin, Yufei Huang, and Stan Z. Li. 2023a. Quantifying the Knowledge in GNNs for Reliable Distillation into MLPs. arXiv:2306.05628 [cs.LG]
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks? arXiv:1810.00826 [cs.LG]
  • Yan et al. (2020) Bencheng Yan, Chaokun Wang, Gaoyang Guo, and Yunkai Lou. 2020. TinyGNN: Learning Efficient Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 1848–1856. https://doi.org/10.1145/3394486.3403236
  • Yanardag and Vishwanathan (2015) Pinar Yanardag and S.V.N. Vishwanathan. 2015. Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD ’15). Association for Computing Machinery, New York, NY, USA, 1365–1374. https://doi.org/10.1145/2783258.2783417
  • Yao et al. (2023) Tianjun Yao, Yingxu Wang, Kun Zhang, and Shangsong Liang. 2023. Improving the Expressiveness of K-Hop Message-Passing GNNs by Injecting Contextualized Substructure Information (KDD ’23). Association for Computing Machinery, New York, NY, USA, 12 pages. https://doi.org/10.1145/3580305.3599390
  • You et al. (2021) Jiaxuan You, Jonathan M Gomes-Selman, Rex Ying, and Jure Leskovec. 2021. Identity-aware graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10737–10745.
  • Zagoruyko and Komodakis (2017) Sergey Zagoruyko and Nikos Komodakis. 2017. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. arXiv:1612.03928 [cs.CV]
  • Zhang et al. (2023) Hanlin Zhang, Shuai Lin, Weiyang Liu, Pan Zhou, Jian Tang, Xiaodan Liang, and Eric P. Xing. 2023. Iterative Graph Self-Distillation. arXiv:2010.12609 [cs.LG]
  • Zhang et al. (2015) **g Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu **g, and Juanzi Li. 2015. Panther: Fast Top-k Similarity Search in Large Networks. arXiv:1504.02577 [cs.SI]
  • Zhang and Li (2021) Muhan Zhang and Pan Li. 2021. Nested Graph Neural Networks. arXiv:2110.13197 [cs.LG]
  • Zhang et al. (2022) Shichang Zhang, Yozen Liu, Yizhou Sun, and Neil Shah. 2022. Graph-less Neural Networks: Teaching Old MLPs New Tricks via Distillation. arXiv:2110.08727 [cs.LG]
  • Zhang et al. (2020) Wentao Zhang, Xupeng Miao, Yingxia Shao, Jiawei Jiang, Lei Chen, Olivier Ruas, and Bin Cui. 2020. Reliable Data Distillation on Graph Convolutional Network. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1399–1414. https://doi.org/10.1145/3318464.3389706
  • Zhao et al. (2021) Lingxiao Zhao, Wei **, Leman Akoglu, and Neil Shah. 2021. From stars to subgraphs: Uplifting any GNN with local structure awareness. arXiv preprint arXiv:2110.03753 (2021).

Appendix A APPENDIX

A.1. More Experiment Setting and Implementation Details

Teacher GNN Models. The hyper-parameter search spaces for GCN and GIN include: the number of layers L={2,3,5}𝐿235L=\left\{2,3,5\right\}italic_L = { 2 , 3 , 5 }, dropout rate p={0,0.5}𝑝00.5p=\left\{0,0.5\right\}italic_p = { 0 , 0.5 }. The hidden size H={32,64}𝐻3264H=\left\{32,64\right\}italic_H = { 32 , 64 }. The search space for KPGIN is: number of layers L={2,3,4}𝐿234L=\left\{2,3,4\right\}italic_L = { 2 , 3 , 4 }, dropout rate p={0,0.5}𝑝00.5p=\left\{0,0.5\right\}italic_p = { 0 , 0.5 }, number of hops K={3,4}𝐾34K=\left\{3,4\right\}italic_K = { 3 , 4 }, combine function F={attention,geometric}𝐹𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑔𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐F=\left\{attention,geometric\right\}italic_F = { italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n , italic_g italic_e italic_o italic_m italic_e italic_t italic_r italic_i italic_c }. The kernel is the shortest path kernel and the hidden size is 64 for all the datasets. For TUDataset, the GNN model is selected based on the mean best validation accuracy. For CIFAR10 the teacher model is obtained according to the best accuracy on the test dataset, and for MolHIV, the teacher GNN model is obtained according to the best ROCAUC value on the test dataset.

Student MLP Models. The hyper-parameter search space for MLP and GA-MLP is: number of layers L={3,4}𝐿34L=\left\{3,4\right\}italic_L = { 3 , 4 }, dropout is not used for TUDataset and CIFAR10; for MolHIV, dropout is set to 0.50.50.50.5. The hidden size of student models is set to 64 uniformly.

Model Training. For TUDataset, all models are trained for 350 epochs, initial learning rate is 8e38𝑒38e-38 italic_e - 3, with a decaying factor of 0.6 with patience to be 30 epochs. For CIFAR10, all models are trained for 120 epochs. The initial learning rate is 8e38𝑒38e-38 italic_e - 3, with a decaying factor of 0.6 with patience to be 15 epochs. For MolHIV, all models are trained for 100 epochs, The initial learning rate is 1e31𝑒31e-31 italic_e - 3, with a decaying factor of 0.75 with patience to be 15 epochs. The batch size for TUDataset and MolHIV is 32323232, and 128128128128 for CIFAR10. All student models are trained 3 times and we report the average results with 1 standard deviation. We use Adam optimizer (Kingma and Ba, 2017) across all the experiments.

KD Framework. For GLNN, the strength for SLsubscript𝑆𝐿\mathcal{L}_{SL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT is searched over {1.0,1e1,1e2,1e3}1.01𝑒11𝑒21𝑒3\left\{1.0,1e-1,1e-2,1e-3\right\}{ 1.0 , 1 italic_e - 1 , 1 italic_e - 2 , 1 italic_e - 3 }; For NOSMOG, the strength for SLsubscript𝑆𝐿\mathcal{L}_{SL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT is fixed to 1.01.01.01.0, and the strength for representational similarity loss is searched over {1.0,1e1,1e2,1e3}1.01𝑒11𝑒21𝑒3\left\{1.0,1e-1,1e-2,1e-3\right\}{ 1.0 , 1 italic_e - 1 , 1 italic_e - 2 , 1 italic_e - 3 }, the adversarial feature augmentation is not utilized as in graph classification dataset, node features are typically represented as one-hot vectors with limited dimensions; For MuGSI, the strength for SLsubscript𝑆𝐿\mathcal{L}_{SL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT is fixed to 1.01.01.01.0. λ,μ,η𝜆𝜇𝜂\lambda,\mu,\etaitalic_λ , italic_μ , italic_η in Eq. 11 are searched over {1.0,1e1,1e2}1.01𝑒11𝑒2\left\{1.0,1e-1,1e-2\right\}{ 1.0 , 1 italic_e - 1 , 1 italic_e - 2 }, {1.0,1e1,1e2}1.01𝑒11𝑒2\left\{1.0,1e-1,1e-2\right\}{ 1.0 , 1 italic_e - 1 , 1 italic_e - 2 } and {1e4,1e5}1𝑒41𝑒5\left\{1e-4,1e-5\right\}{ 1 italic_e - 4 , 1 italic_e - 5 } respectively.

Other Implementations. To sample a random walk path for path consistency loss 𝒫subscript𝒫\mathcal{L}_{\mathcal{P}}caligraphic_L start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT, we use generate_random_paths from networkx. The path length is fixed to 8 uniformly. The clustering algorithm for 𝒞subscript𝒞\mathcal{L}_{\mathcal{C}}caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT is Louvain method (Blondel et al., 2008). We use python-louvain package for our implementation. The graph pooling function (readout function) is attention-based aggregation (Li et al., 2019) or summation. Moreover, the graph pooling and cluster pooling share the same pooling function (if attention-based aggregation is utilized) to improve generalizability.

A.2. Complexity Analysis

Time Complexity. Although the time complexity during the inference stage for student MLP and GA-MLP models is identical to the vanilla student models without using knowledge distillation. The preprocessing and training stage will incur some extra computational costs. Specifically, in preprocessing stage, to compute 𝐀𝐃1𝐗superscript𝐀𝐃1𝐗\mathbf{A}\mathbf{D}^{-1}\mathbf{X}bold_AD start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X for 1-hop neighborhood aggregation for GA-MLP, it take 𝒪(|V|d¯D)𝒪𝑉¯𝑑𝐷\mathcal{O}\left(|V|\bar{d}D\right)caligraphic_O ( | italic_V | over¯ start_ARG italic_d end_ARG italic_D ) when 𝐀𝐀\mathbf{A}bold_A is a sparse matrix. Here D𝐷Ditalic_D is the number of feature dimensions and d¯¯𝑑\bar{d}over¯ start_ARG italic_d end_ARG is the average node degree. To compute the clustering assignment, as we use the Louvain method for our implementation, the time complexity is 𝒪(|V|log|V|)𝒪𝑉𝑙𝑜𝑔𝑉\mathcal{O}\left(|V|log|V|\right)caligraphic_O ( | italic_V | italic_l italic_o italic_g | italic_V | ). The preprocessing is only performed once, hence the cost is affordable. During the training stage, in addition to the training cost of the teacher GNN model and student MLP model, the extra computational cost comes from random walk path sampling for node-level distillation. As generate_random_paths from networkx follows the implementation of (Zhang et al., 2015), the time complexity is 𝒪(RTlogd¯)𝒪𝑅𝑇¯𝑑\mathcal{O}(RT\log\bar{d})caligraphic_O ( italic_R italic_T roman_log over¯ start_ARG italic_d end_ARG ), which can be simplified as 𝒪(cRT)𝒪𝑐𝑅𝑇\mathcal{O}(cRT)caligraphic_O ( italic_c italic_R italic_T ), where c𝑐citalic_c is a small constant, T𝑇Titalic_T is the number of steps for each path, R𝑅Ritalic_R is the number of random walk paths to sample, d¯¯𝑑\bar{d}over¯ start_ARG italic_d end_ARG is the average node degree. Since T𝑇Titalic_T and R𝑅Ritalic_R are typically small integers, the extra computational cost during the training stage is also affordable.

Space Complexity. Regarding the space complexity, given a graph with n𝑛nitalic_n nodes and m𝑚mitalic_m edges, MuGSI’s space complexity is only 𝒪(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ), which is lower than that of GIN (Xu et al., 2019) and GCN (Kipf and Welling, 2016), both of which are 𝒪(m)𝒪𝑚\mathcal{O}(m)caligraphic_O ( italic_m ). It is also significantly lower than more powerful GNNs, such as KPGIN (Feng et al., 2022a), which has a worst-case space complexity of 𝒪(n2)𝒪superscript𝑛2\mathcal{O}\left(n^{2}\right)caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and GIN-AK+ (Zhao et al., 2021), with a space complexity of 𝒪(hm)𝒪𝑚\mathcal{O}(hm)caligraphic_O ( italic_h italic_m ), where hhitalic_h is the height of the extracted rooted subgraphs. Furthermore, MuGSI’s space complexity is much lower than that of NOSMOG (Tian et al., 2023), which incurs a space complexity of 𝒪(n2)𝒪superscript𝑛2\mathcal{O}\left(n^{2}\right)caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) during the training stage. we also provide a table below to demonstrate the training and test running time, as well as the memory consumption for various methods.

A.3. Pseudo-code for MuGSI

The pseudo-code of the proposed framework MuGSI is summarized in Algorithm 1.

Algorithm 1 Algorithm for the MuGSI Knowledge Distillation Framework
Graph datasets 𝒟=𝒟L𝒟U𝒟subscript𝒟𝐿subscript𝒟𝑈\mathcal{D}=\mathcal{D}_{L}\cup\mathcal{D}_{U}caligraphic_D = caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT, #epochs E𝐸Eitalic_E, # paths to sample R𝑅Ritalic_R, student model type Ssubscript𝑆\mathcal{M}_{S}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT
Predicted labels 𝒴Usubscript𝒴𝑈\mathcal{Y}_{U}caligraphic_Y start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT and optimized network parameters of student model 𝚯Ssuperscriptsubscript𝚯𝑆\mathbf{\Theta}_{S}^{*}bold_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
Randomly initialize parameters of teacher model ΘTsubscriptΘ𝑇\Theta_{T}roman_Θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and student model ΘSsubscriptΘ𝑆\Theta_{S}roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT.
Train multiple teacher GNN models with different hyper-parameters using 𝒟𝒟\mathcal{D}caligraphic_D, select the best GNN model ΘTsubscriptsuperscriptΘ𝑇\Theta^{*}_{T}roman_Θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
for each graph G={𝒱,,𝐗}𝐺𝒱𝐗G=\{\mathcal{V},\mathcal{E},\mathbf{X}\}italic_G = { caligraphic_V , caligraphic_E , bold_X } in 𝒟𝒟\mathcal{D}caligraphic_D do \triangleright Preprocessing stage
     Compute the clustering assignment for each node vG𝑣𝐺v\in Gitalic_v ∈ italic_G using Louvain method
     Compute the top-k𝑘kitalic_k non-trivial Laplacian eigenvector 𝐗LaPEsubscript𝐗LaPE\mathbf{X}_{\text{LaPE}}bold_X start_POSTSUBSCRIPT LaPE end_POSTSUBSCRIPT
     Set 𝐗CONCAT(𝐗,𝐗LaPE)𝐗CONCAT𝐗subscript𝐗LaPE\mathbf{X}\leftarrow\text{CONCAT}(\mathbf{X},\mathbf{X}_{\text{LaPE}})bold_X ← CONCAT ( bold_X , bold_X start_POSTSUBSCRIPT LaPE end_POSTSUBSCRIPT ) \triangleright Optional
     if Ssubscript𝑆\mathcal{M}_{S}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is GA-MLP then
         Compute 1-hop neighborhood aggregation feature 𝐗~=𝐀𝐃1𝐗~𝐗superscript𝐀𝐃1𝐗\tilde{\mathbf{X}}=\mathbf{A}\mathbf{D}^{-1}\mathbf{X}over~ start_ARG bold_X end_ARG = bold_AD start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X \triangleright For GA-MLP, the student model input is 𝐗𝐗\mathbf{X}bold_X and 𝐗~~𝐗\tilde{\mathbf{X}}over~ start_ARG bold_X end_ARG; for MLP, the model input is 𝐗𝐗\mathbf{X}bold_X
     end if
end for
for epoch {1,2,,E}absent12𝐸\in\{1,2,\dots,E\}∈ { 1 , 2 , … , italic_E } do
     Obtain hGTsuperscriptsubscript𝐺𝑇h_{G}^{T}italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and hGSsuperscriptsubscript𝐺𝑆h_{G}^{S}italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, and calculate whole-graph distillation loss 𝒢subscript𝒢\mathcal{L}_{\mathcal{G}}caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT using Eq. 6
     Obtain 𝐊Ssubscript𝐊𝑆\mathbf{K}_{S}bold_K start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and 𝐊Tsubscript𝐊𝑇\mathbf{K}_{T}bold_K start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT according to Eq. 7, and calculate inter-cluster distillation loss 𝒞subscript𝒞\mathcal{L}_{\mathcal{C}}caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT using Eq. 8
     Randomly sample R𝑅Ritalic_R random walk paths, compute p(u|v)𝑝conditional𝑢𝑣p(u|v)italic_p ( italic_u | italic_v ) and q(u|v)𝑞conditional𝑢𝑣q(u|v)italic_q ( italic_u | italic_v ) according to Eq. 9, and calculate path consistency distillation loss 𝒫subscript𝒫\mathcal{L}_{\mathcal{P}}caligraphic_L start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT using Eq. 10
     Obtain ground-truth label 𝒴Lsubscript𝒴𝐿\mathcal{Y}_{L}caligraphic_Y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT from 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and soft logits 𝒴^Lsubscript^𝒴𝐿\hat{\mathcal{Y}}_{L}over^ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT from teacher model output, calculate final loss \mathcal{L}caligraphic_L using Eq. 11
end for
Predict 𝒴Usubscript𝒴𝑈\mathcal{Y}_{U}caligraphic_Y start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT from 𝒟Usubscript𝒟𝑈\mathcal{D}_{U}caligraphic_D start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT using optimized student model ΘSsuperscriptsubscriptΘ𝑆\Theta_{S}^{*}roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
Return Predicted labels 𝒴Usubscript𝒴𝑈\mathcal{Y}_{U}caligraphic_Y start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT and student model ΘSsuperscriptsubscriptΘ𝑆\Theta_{S}^{*}roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
Table 6. Experiment results for different pooling functions.
DD PROTEINS BZR NCI1 IMDB-BINARY REDDIT-BINARY
MuGSI w/ AttnPooling 81.57±2.24 78.26±4.78 93.10±2.81 76.86±2.33 80.31±3.36 90.91±2.05
MuGSI w/ SumPooling 80.55±2.25 78.98±3.27 91.83±3.73 78.15±2.15 80.09±3.66 91.20±2.33
Table 7. Comparison of time and memory consumption for MuGSI and other baselines.
GIN GLNN NOSMOG KPGIN GIN-AK+ MuGSI
PROTEINS ACC 79.25±3.22 76.28±2.61 78.35±2.74 78.56±3.17 78.62±3.01 78.26±4.78
Training Runtime(S/Epoch) 0.96 2.16 2.32 1.18 1.02 5.31
Testing Runtime(S/Epoch) 0.31 0.26 0.25 0.27 0.27 0.25
Max Allocated Memory(MB) 32.3 14.12 217.63 70.17 2722 11.41
REDDIT-BINARY ACC 91.35±1.58 89.55±2.21 89.09±1.64 >24H 94.8±0.8 90.91±2.05
Training Runtime(S/Epoch) 2.62 3.64 3.63 OOM 22.15
Testing Runtime(S/Epoch) 0.26 0.25 0.26 0.25
Max Allocated Memory(MB) 418.09 192.48 16538.79 121.72
DD ACC 77.67±2.86 79.86±2.31 80.41±3.57 81.07±2.83 80.1±2.04 81.57±2.24
Training Runtime(S/Epoch) 0.76 3.12 3.37 1.76 2.92 12.53
Testing Runtime(S/Epoch) 0.19 0.21 0.22 0.23 0.31 0.2
Max Allocated Memory(MB) 244.73 124.36 7096.61 899.66 13601.92 88.17

A.4. Discussion

In this section, we first establish connections between the subgraph-level distillation loss 𝒞subscript𝒞\mathcal{L}_{\mathcal{C}}caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT and Maximum Mean Discrepancy, then we offer an explanation to explain why MuGSI is effective for graph classification.

A.4.1. Relation To Maximum Mean Discrepancy.

Maximum Mean Discrepancy (MMD) is a widely used criterion in Domain Adaptation (Gong et al., 2013; Gretton et al., 2008; Huang et al., 2006), which compares distributions in the Reproducing Kernel Hilbert Space (RKHS) (Gretton et al., 2012). Assume we have two sets of samples, 𝒳={𝒙i}i=1N𝒳superscriptsubscriptsuperscript𝒙𝑖𝑖1𝑁\mathcal{X}=\left\{\boldsymbol{x}^{i}\right\}_{i=1}^{N}caligraphic_X = { bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and 𝒴={𝒚j}j=1M𝒴superscriptsubscriptsuperscript𝒚𝑗𝑗1𝑀\mathcal{Y}=\left\{\boldsymbol{y}^{j}\right\}_{j=1}^{M}caligraphic_Y = { bold_italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, drawn from distributions p𝑝pitalic_p and q𝑞qitalic_q, respectively. The squared MMD distance between p𝑝pitalic_p and q𝑞qitalic_q can be expressed as follows:

(12) MMD2(𝒳,𝒴)subscriptsuperscriptMMD2𝒳𝒴\displaystyle\mathcal{L}_{\mathrm{MMD}^{2}}(\mathcal{X},\mathcal{Y})caligraphic_L start_POSTSUBSCRIPT roman_MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_X , caligraphic_Y ) =1Ni=1Nϕ(𝒙i)1Mj=1Mϕ(𝒚j)22absentsuperscriptsubscriptnorm1𝑁superscriptsubscript𝑖1𝑁italic-ϕsuperscript𝒙𝑖1𝑀superscriptsubscript𝑗1𝑀italic-ϕsuperscript𝒚𝑗22\displaystyle=\left\|\frac{1}{N}\sum_{i=1}^{N}\phi\left(\boldsymbol{x}^{i}% \right)-\frac{1}{M}\sum_{j=1}^{M}\phi\left(\boldsymbol{y}^{j}\right)\right\|_{% 2}^{2}= ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_ϕ ( bold_italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=1N2i=1Ni=1Nk(𝒙i,𝒙i)+1M2j=1Mj=1Mk(𝒚j,𝒚j)absent1superscript𝑁2superscriptsubscript𝑖1𝑁superscriptsubscriptsuperscript𝑖1𝑁𝑘superscript𝒙𝑖superscript𝒙superscript𝑖1superscript𝑀2superscriptsubscript𝑗1𝑀superscriptsubscriptsuperscript𝑗1𝑀𝑘superscript𝒚𝑗superscript𝒚superscript𝑗\displaystyle=\frac{1}{N^{2}}\sum_{i=1}^{N}\sum_{i^{\prime}=1}^{N}k\left(% \boldsymbol{x}^{i},\boldsymbol{x}^{i^{\prime}}\right)+\frac{1}{M^{2}}\sum_{j=1% }^{M}\sum_{j^{\prime}=1}^{M}k\left(\boldsymbol{y}^{j},\boldsymbol{y}^{j^{% \prime}}\right)= divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_k ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_k ( bold_italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT )
2MNi=1Nj=1Mk(𝒙i,𝒚j),2𝑀𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑀𝑘superscript𝒙𝑖superscript𝒚𝑗\displaystyle-\frac{2}{MN}\sum_{i=1}^{N}\sum_{j=1}^{M}k\left(\boldsymbol{x}^{i% },\boldsymbol{y}^{j}\right),- divide start_ARG 2 end_ARG start_ARG italic_M italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_k ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ,

where ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) is an explicit map** function, and k(,)𝑘k(\cdot,\cdot)italic_k ( ⋅ , ⋅ ) is a kernel function that projects the sample vectors into a higher or infinite dimensional feature space. MMD loss is 00 if and only if p=q𝑝𝑞p=qitalic_p = italic_q when the feature space corresponds to a universal RKHS. Minimizing MMD loss is equivalently minimizing the distance between distribution p𝑝pitalic_p and q𝑞qitalic_q. There are many valid kernels for MMD, in the specific case of employing a polynomial kernel k(𝒙,𝒚)=(𝒙𝒚+c)d𝑘𝒙𝒚superscriptsuperscript𝒙top𝒚𝑐𝑑k(\boldsymbol{x},\boldsymbol{y})=\left(\boldsymbol{x}^{\top}\boldsymbol{y}+c% \right)^{d}italic_k ( bold_italic_x , bold_italic_y ) = ( bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y + italic_c ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with parameters d=2𝑑2d=2italic_d = 2 and c=0𝑐0c=0italic_c = 0, the resulting MMD is represented as follows:

(13) MMDP2(𝐇𝒞T,𝐇𝒞S)=𝐆S𝐆TF2,subscriptsuperscriptsubscriptMMD𝑃2subscriptsuperscript𝐇𝑇𝒞subscriptsuperscript𝐇𝑆𝒞superscriptsubscriptnormsubscript𝐆𝑆subscript𝐆𝑇𝐹2\mathcal{L}_{\mathrm{MMD}_{P}^{2}}\left(\mathbf{H}^{T}_{\mathcal{C}},\mathbf{H% }^{S}_{\mathcal{C}}\right)=\left\|\mathbf{G}_{S}-\mathbf{G}_{T}\right\|_{F}^{2},caligraphic_L start_POSTSUBSCRIPT roman_MMD start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , bold_H start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ) = ∥ bold_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - bold_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

here 𝐇𝒞TNC×Fsubscriptsuperscript𝐇𝑇𝒞superscriptsubscript𝑁𝐶𝐹\mathbf{H}^{T}_{\mathcal{C}}\in\mathbb{R}^{N_{C}\times F}bold_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT × italic_F end_POSTSUPERSCRIPT and 𝐇𝒞SNC×Fsubscriptsuperscript𝐇𝑆𝒞superscriptsubscript𝑁𝐶𝐹\mathbf{H}^{S}_{\mathcal{C}}\in\mathbb{R}^{N_{C}\times F}bold_H start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT × italic_F end_POSTSUPERSCRIPT are the cluster-level representations for a given graph G𝐺Gitalic_G obtained from teacher and student model respectively, F𝐹Fitalic_F denotes the hidden dimension size. 𝐆S,𝐆TNC×NCsubscript𝐆𝑆subscript𝐆𝑇superscriptsubscript𝑁𝐶subscript𝑁𝐶\mathbf{G}_{S},\mathbf{G}_{T}\in\mathbb{R}^{N_{C}\times N_{C}}bold_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the Gram matrix with each entry gij=𝐡𝒞i,𝐡𝒞jsubscript𝑔𝑖𝑗subscript𝐡subscript𝒞𝑖subscript𝐡subscript𝒞𝑗g_{ij}=\bigl{\langle}\mathbf{h}_{\mathcal{C}_{i}},\mathbf{h}_{\mathcal{C}_{j}}% \bigr{\rangle}italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ⟨ bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ (the subscript S and T are omitted here). As illustrated in Eq. 7 and Eq. 8, the inter-cluster distillation loss 𝒞subscript𝒞\mathcal{L}_{\mathcal{C}}caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT is a slightly modified version of MMDP2subscriptsuperscriptsubscriptMMD𝑃2\mathcal{L}_{\mathrm{MMD}_{P}^{2}}caligraphic_L start_POSTSUBSCRIPT roman_MMD start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, which aims to minimize the distance of the distribution over subgraphs (clusters) for teacher model and student model.

A.4.2. Why MuGSI is Effective for Graph Classification.

The core of MuGSI is the multi-granularity distillation loss, which is composed of graph-level distillation loss, cluster-level distillation loss, and node-level distillation loss. For cluster-level distillation loss 𝒞subscript𝒞\mathcal{L}_{\mathcal{C}}caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT, we have shown in the previous subsection that it forces the student model to approximate the teacher GNN model in distributional space over clusters. For graph-level distillation loss 𝒢subscript𝒢\mathcal{L}_{\mathcal{G}}caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT, if we assume the graph representations zGTsuperscriptsubscript𝑧𝐺𝑇z_{G}^{T}italic_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and zGSsuperscriptsubscript𝑧𝐺𝑆z_{G}^{S}italic_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT for a given graph G𝐺Gitalic_G, as generated by the teacher and student models respectively, follows Gaussian distribution with mean hGTsuperscriptsubscript𝐺𝑇h_{G}^{T}italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, hGSsuperscriptsubscript𝐺𝑆h_{G}^{S}italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT and the same co-variance ΣΣ\Sigmaroman_Σ, i.e., zGT𝒩(hGT,Σ)similar-tosuperscriptsubscript𝑧𝐺𝑇𝒩superscriptsubscript𝐺𝑇Σz_{G}^{T}\sim\mathcal{N}\left(h_{G}^{T},\Sigma\right)italic_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , roman_Σ ) and zGS𝒩(hGS,Σ)similar-tosuperscriptsubscript𝑧𝐺𝑆𝒩superscriptsubscript𝐺𝑆Σz_{G}^{S}\sim\mathcal{N}\left(h_{G}^{S},\Sigma\right)italic_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , roman_Σ ), then the KL divergence between zGTsuperscriptsubscript𝑧𝐺𝑇z_{G}^{T}italic_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and zGSsuperscriptsubscript𝑧𝐺𝑆z_{G}^{S}italic_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT is given by:

(14) 𝒟KL(zGT,zGS)=12(hGThGS)TΣ1(hGThGS).subscript𝒟𝐾𝐿superscriptsubscript𝑧𝐺𝑇superscriptsubscript𝑧𝐺𝑆12superscriptsuperscriptsubscript𝐺𝑇superscriptsubscript𝐺𝑆𝑇superscriptΣ1superscriptsubscript𝐺𝑇superscriptsubscript𝐺𝑆\mathcal{D}_{KL}\left(z_{G}^{T},z_{G}^{S}\right)=\frac{1}{2}\left(h_{G}^{T}-h_% {G}^{S}\right)^{T}\Sigma^{-1}\left(h_{G}^{T}-h_{G}^{S}\right).caligraphic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) .

Consequently, the graph-level distillation loss 𝒢subscript𝒢\mathcal{L}_{\mathcal{G}}caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT serves to minimize this KL divergence, thereby ensuring that the distribution of zGSsuperscriptsubscript𝑧𝐺𝑆z_{G}^{S}italic_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT closely approximates that of zGTsuperscriptsubscript𝑧𝐺𝑇z_{G}^{T}italic_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Finally, prior research has established that multi-step random walks are capable of extracting local substructures for any node vG𝑣𝐺v\in Gitalic_v ∈ italic_G (Yao et al., 2023). In light of this, we employ random walks to calculate p(u|v)𝑝conditional𝑢𝑣p(u|v)italic_p ( italic_u | italic_v ) and q(u|v)𝑞conditional𝑢𝑣q(u|v)italic_q ( italic_u | italic_v ) as a surrogate loss in Eq. 10, thereby aligning the distribution over local substructures between the student and teacher models.

The proposed multi-granularity distillation loss addresses the challenges discussed in Section 1 by generating dense learning signals across multiple scales of graph structure, and ensures a comprehensive transfer of structural knowledge by aligning multiple distributions between the student and teacher models, which is proven to be efficient and effective in extensive experiments.

A.5. More Experimental Results

MuGSI with Different Pooling Functions. To ensure a fair comparison, we also adopt AttentionalAggregation (Li et al., 2019) as pooling function for other baseline methods and MuGSI. To study the impact of different pooling functions for MuGSI, we include the following additional experimental results where the student model is GA-MLP. As shown in Table 6, sum pooling can sometimes lead to better performance, demonstrating the effectiveness of MuGSI.

Time and Memory Consumptions. As illustrated in Table 7, MuGSI is significantly more memory-efficient than other baseline methods. Notably, it only consumes 1%-10% memory compared with NOSMOG and more powerful GNN models such as KPGIN (Feng et al., 2022a) and GIN-AK+ (Zhao et al., 2021), which aligns with our space complexity analysis.

A.6. Datasets Statistics

The datasets statistics are illustrated in Table 8.

Dataset # Tasks # Graphs Ave. # Nodes Ave. # Edges
PROTEINS 2 1113 39.06 72.82
NCI1 2 4110 29.87 32.3
BZR 2 405 35.75 38.36
DD 2 1178 284.32 715.66
REDDIT-BINARY 2 2000 429.63 497.75
IMDB-BINARY 2 1000 19.77 96.53
CIFAR10 10 60000 117.63 941.07
MolHIV 2 41127 25.5 27.5
Table 8. Dataset statistics