Federated Graph Semantic and Structural Learning

Wenke Huang1†    Guancheng Wan1†    Mang Ye1,2∗&Bo Du1,2 1School of Computer Science, Wuhan University, Wuhan, China
2 Hubei Luojia Laboratory, Wuhan, China
{wenkehuang, guanchengwan, yemang, dubo}@whu.edu.cn
Abstract

Federated graph learning collaboratively learns a global graph neural network with distributed graphs, where the non-independent and identically distributed property is one of the major challenge. Most relative arts focus on traditional distributed tasks like images and voices, incapable of the graph structures. This paper firstly reveals that local client distortion is brought by both node-level semantics and graph-level structure. First, for node-level semantic, we find that contrasting nodes from distinct classes is beneficial to provide a well-performing discrimination. We pull the local node towards the global node of the same class and push it away from the global node of different classes. Second, we postulate that a well-structural graph neural network possesses similarity for neighbors due to the inherent adjacency relationships. However, aligning each node with adjacent nodes hinders discrimination due to the potential class inconsistency. We transform the adjacency relationships into the similarity distribution and leverage the global model to distill the relation knowledge into the local model, which preserves the structural information and discriminability of the local model. Empirical results on three graph datasets manifest the superiority of the proposed method over counterparts.

11footnotetext: Corresponding Author22footnotetext: Equal Contribution

1 Introduction

Federated learning (FL) has shown considerable potential in collaborative machine learning across distributed devices without disclosing privacy Wan et al. (2024a). Although FL has attracted wide research interest and witnessed remarkable progress Huang et al. (2023b), most of them focus on the tasks like images and voices on the basis of CNN and transformer He et al. (2016); Vaswani et al. (2017). However, many real-world applications generate structured graphical data (e.g., knowledge graph, social network Liu et al. (2024)), consisting of vertices and edges Panagopoulos et al. (2021), while CNN and transformer can not deal with them effectively due to the inability to capture the topological structure Kipf and Welling (2017). For these graph applications, graph neural networks (GNN) have won praise for their impressive performance Wan et al. (2024b) because they utilize both the independence of nodes and the unique structure of graphs to mine graph data. Therefore, for the purpose of handling the graph data across multiple participants with growing privacy concerns, Federated Graph Learning (FGL) has become a promising paradigm Fu et al. (2022).

Refer to caption
Figure 1: Problem illustration. We present the structure and semantic level similarities among clients. The deeper color suggests a more similar representation of node and graph across different participants, while the shallower mean dissimilarity. (a) Semantic bias: clients show the inconsistency predicted class of node. (b) Structure bias: clients hold distinct similarities among neighborhood nodes. In this work, we conduct semantics-level and structure-level calibration to achieve better federated graph learning performance.

Notably, data heterogeneity has become an inescapable problem in Federated Learning Kairouz et al. (2019); Huang et al. (2022a). Specifically, the data distribution among different parties presents non-IID (identically and independently distributed) property, which results in the divergence of local direction Huang et al. (2023a, 2022b). Although existing methods have made efforts to restrain the local model with respect to the global model, they mainly design for typical data heterogeneity without special consideration for graph structural bias. However, previous work has demonstrated the importance of exploiting structural knowledge Liu et al. (2023a, 2022b). In this paper, we preliminarily investigate the unique characteristics of graph heterogeneity and find that there exists node semantic and graph structural bias in FGL setting. In detail, we leverage the Centered Kernel Alignment (CKA) Kornblith et al. (2019) to measure the similarity between node representations given by different client model pairs. Especially, for structure heterogeneity, we utilize the Anonymous Walk Embedding (AWE) to generate a representation for the graph and exploit the Jensen-Shannon distance between pair of graphs to measure the discrepancy. We reveal that there exists severe divergence on both node-level semantic and graph-level structure among clients (Fig. 1). Therefore, a crucial problem for FGL is that How to calibrate both node-level semantic and graph-level structure bias in Federated Graph Learning?

For node-level semantic calibration, we argue that a well-discriminative graph neural network contrasts nodes from different classes to provide a clear decisional boundary. Inspired by the success of supervised contrastive learning Khosla et al. (2020); Liu et al. (2023b), we naturally expect to conduct pull-push operations among different classes on local model to acquire a well-performing decisional ability. However, under federated learning, the local GNN model purely optimizes on private data and drifts towards a distinct local minimum, which means solely relying on the guidance signals provided by the local model is confusing and unreliable. Prior studies Hu et al. (2022), attempt to reweight local model during the FL aggregation process, but this sheds little light on identity node semantics bias caused by data heterogeneity. In this work, we investigate the node semantic knowledge and calibrate it during the local training process. We propose Federated Node Semantic Contrast (FNSC), which encourages the query local node representation to be close to global node embeddings within same class and pushes it away from global node embeddings of different classes.

Besides, for graph-level structural calibration, local clients normally possess a graph that is incomplete and is biased to depict the graph structure. Existing works normally focus on reconstructing the graph structure to handle the local structural bias. For example, FedStar Tan et al. (2023) decouples the structure information and encodes it in a personalized way, which brings extra model parameters for local updating. In this work, without more communication cost, we take a free ride to convert stiff graph structure reconstruction into structural relationship maintenance via the given global model during local training. We introduce Federated Graph Structure Distillation (FGSD). In detail, for each node, we leverage the global model to calculate the similarity of each node with its neighborhoods, based on the adjacency matrix. Then, we require the local model to generate the adjacent node similarity and mimic the global one, which leverages the global model to provide beneficial structural knowledge.

In a nutshell, we propose a novel Federated Graph Semantic and Structural Learning method (FGSSL). Our contributions are summarized as follows:

  • We are the first in FGL to decouple the data heterogeneity setting to node semantic level and graph structural level bias respectively. From this perspective, we can ameliorate final degraded performance by calibrating the local training drift, which sheds good light on future research in solving the non-IID problem in FGL scenarios.

  • We introduce a novel federated graph learning (FGSSL) frame for both node and graph-level calibration. The former Federated Node Semantic Contrast calibrates local node semantics with the assistance of the global model without compromising privacy. The latter Federated Graph Structure Distillation transforms the adjacency relationships from the global model to the local model, fully reinforcing the graph representation with aggregated relation.

  • We conduct extensive experiments on benchmark datasets to verify that FGSSL achieves superior performance over related methods. Taking a free ride with the global model, it does not introduce additional communication rounds and shows stronger privacy since it does not require additional shared sensitive prior information.

2 Related Work

2.1 Federated Graph Learning

Federated graph learning (FGL) facilitates the distributed training of graph neural networks (GNN). Previous literature on FGL can be categorized into two types: inter-graph and intra-graph. Inter-graph FGL involves each participant possessing a set of graphs and collectively participating in federated learning (FL) to improve the modeling of local data or generate a generalizable model Xie et al. (2021). In contrast, intra-graph FGL involves each participant owning only a subset of the entire graph and the objective is to address missing links Zhang et al. (2021) or discover communities Baek et al. (2022). However, both of them are confronted with the non-IID issue which degrades the collaboratively learned model performance. Conventional methods solving the non-IID in FL field (e.g., FedProx Li et al. (2020) and MOON Li et al. (2021)) meet the absence of design for FGL scenarios. Some preceding methods are dedicated to handling the non-IID problem for FGL. FedGCN Hu et al. (2022) tries to reweight local model parameters via an attention mechanism. FILT+ Zhu et al. (2021a) pulls the local model closer to the global model by minimizing the loss discrepancy between a local model and the global model. However, they focus on leveraging the issue from model respect and fail to effectively exploit the unique characteristics of the graph data. In this paper, we consider inter-graph FGL and deal with the non-IID via exploiting the graphic characteristics and decoupling into node-level semantic and graph-level structure calibration.

2.2 Contrastive Learning on Graphs

In recent years, contrastive learning has seen a resurgence of interest in the field of visual representation learning He et al. (2020); Chen et al. (2020). This success has spurred a wealth of research exploring the adaptation of contrastive learning to graph-like data for self-supervised methods Zhu et al. (2021b); Liu et al. (2022a). Traditional unsupervised methods on graph representation learning approaches Grover and Leskovec (2016); Perozzi et al. (2014), adhere to a contrastive structure derived from the skip-gram model. The graph autoencoder (GAE) Kipf and Welling (2016) is a self-supervised learning technique that aims to reconstruct the graph structure while The MVGRL Hassani and Khasahmadi (2020) intends to do node diffusion and compare node representation to augmented graph representation in order to learn both node-level and graph-level representation. Similar to SimCLR Chen et al. (2020), GRACE Zhu et al. (2020) constructs two augmented views of a graph by randomly perturbing nodes and edges, and subsequently learns node representations by pushing apart representations of every other node while bringing together representations of the same node in the two different augmented graphs within the same network. Apart from self-supervised tasks, SupCon Khosla et al. (2020) firstly extend the self-supervised batch contrastive approach to the fully-supervised setting. In this work, we examine the contrastive method in distributed systems and conduct a inter-view based contrast between the global and local models respectively. Moreover, we consider the supervised contrast that leveraging the label as a signal to choose positive samples for calibrating the node embedding to be more similar to the global node embedding.

2.3 Knowledge Distillation

Knowledge Distillation (KD) Hinton et al. (2015) is a technique that has been extensively studied and applied in various areas of machine learning, including image classification, natural language processing, and graph representation learning. The key aspect of KD is transferring knowledge from a complex and powerful teacher model to a more limited student model. In many works, knowledge distillation is typically used to train a smaller student network under the guidance of a larger teacher network with minimal to no performance degradation Wang and Yoon (2021). In practice, knowledge distillation forces the feature or logit output of the student network to be similar to that of the teacher network. Researchers have attempted to improve knowledge distillation methods by introducing new techniques such as model distillation Mullapudi et al. (2019), feature distillation Romero et al. (2015), and relation distillation Park et al. (2019). In this work, we focus on ameliorating the heterogeneity of graph structure by adapting relation-based KD techniques for the FGL domain. We first transform the adjacency relationships into similarity distribution from global view, then distill them into the local model. In this way, we leverage aggregated contextual neighborhood information from global view and calibrate the drift caused by graph structure from the locally biased data.

3 Methodology

3.1 Preliminaries

Graph Neural Newrok. Graph neural networks (GNN), e.g., graph convolutional networks (GCN) Kipf and Welling (2017) and Graph Attention Networks (GAT) (Veličković et al. (2017)), improved the state-of-the-art in informative graph data with their elegant yet powerful designs. In general, given the structure and feature information of a graph 𝒢=(V,A,X)𝒢𝑉𝐴𝑋\mathcal{G}=(V,A,X)caligraphic_G = ( italic_V , italic_A , italic_X ), where V𝑉Vitalic_V, A𝐴Aitalic_A, X𝑋Xitalic_X denote nodes, adjacency matrix and node feature respectively, GNN targets to learn the representations of graphs, such as the node embedding hidsubscript𝑖superscript𝑑h_{i}\in\mathbb{R}^{d}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. A GNN typically involves two steps: the processes of message propagation and neighborhood aggregation. In this process, each node in the graph iteratively collects information from its neighbors with its own information in order to update and refine its representation. Generally, an L𝐿Litalic_L-layer GNN can be formulated as

hi(l+1)=σ(hi(l),𝐴𝐺𝐺({hj(l);jAi})),l[L],formulae-sequencesubscriptsuperscript𝑙1𝑖𝜎subscriptsuperscript𝑙𝑖𝐴𝐺𝐺subscriptsuperscript𝑙𝑗𝑗subscript𝐴𝑖for-all𝑙delimited-[]𝐿h^{(l+1)}_{i}=\sigma(h^{(l)}_{i},\mathit{AGG}(\{h^{(l)}_{j};j\in A_{i}\})),% \forall l\in[L],italic_h start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_AGG ( { italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_j ∈ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) ) , ∀ italic_l ∈ [ italic_L ] , (1)

where hi(l)subscriptsuperscript𝑙𝑖h^{(l)}_{i}italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the representation of node v𝑣vitalic_v at the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer, and hi(0)=visubscriptsuperscript0𝑖subscript𝑣𝑖h^{(0)}_{i}=v_{i}italic_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the node feature. Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined as the neighbors of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝐴𝐺𝐺()𝐴𝐺𝐺\mathit{AGG(\cdot)}italic_AGG ( ⋅ ) is a aggregation function that can vary for different GNN variants, and σ𝜎\sigmaitalic_σ means a activation function.

After L message-passing layers, the final node embedding hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is passed to a project head F𝐹Fitalic_F to obtain logits:

zi=F(hi).subscript𝑧𝑖𝐹subscript𝑖z_{i}=F(h_{i}).italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_F ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (2)

In this paper, we examine proposed FGSSL in node-level tasks (e.g., node classification), and F𝐹Fitalic_F is defined as the classifier head. Specially, we utilize L-1 layers as GNN feature extractor and the L layer as F𝐹Fitalic_F.

3.1.1 Centralized Aggregation

In vanilla FL setting there is always a central server with M𝑀Mitalic_M clients, the m-th𝑚-thm\text{-th}italic_m -th client owns a private dataset Dmsuperscript𝐷𝑚D^{m}italic_D start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and |D|𝐷|D|| italic_D | is the total size of samples over all clients. FedAvg McMahan et al. (2017) is a foundational algorithm in the field of federated learning, which serves as a starting point for the design of more advanced FL frameworks. It operates by aggregating the updated model parameters from individual clients and redistributing average of these parameters back to all clients:

θm=1M|Dm||D|θm.𝜃superscriptsubscript𝑚1𝑀superscript𝐷𝑚𝐷superscript𝜃𝑚{\theta}\leftarrow\sum_{m=1}^{M}\frac{\left|D^{m}\right|}{|D|}\theta^{m}.italic_θ ← ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG | italic_D start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | end_ARG start_ARG | italic_D | end_ARG italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT . (3)

In this study, we utilize the Federated Learning (FL) framework to enable collaborative learning on isolated graphs among multiple data owners, without the need to share raw graph data. By doing so, we aim to obtain a global node classifier. Specifically, when model parameters are set to θ𝜃\thetaitalic_θ for the Graph Neural Network (GNN) encoder and classifier F𝐹Fitalic_F, we formalize the global objective:

argmin1MmMm(θm;Dm).1𝑀superscriptsubscript𝑚𝑀superscript𝑚superscript𝜃𝑚superscript𝐷𝑚\arg\min\frac{1}{M}\sum_{m}^{M}\mathcal{L}^{m}(\theta^{m};D^{m}).roman_arg roman_min divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ; italic_D start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) . (4)

Normally, the loss function msuperscript𝑚\mathcal{L}^{m}caligraphic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in Eq. 5 is cross-entropy loss as each node which is optimized with softmax operation:

iCE=𝟏cilog(softmax(𝒛𝒊)),superscriptsubscript𝑖CEsubscript1subscript𝑐𝑖softmaxsubscript𝒛𝒊\small\mathcal{L}_{i}^{\textsl{CE}}=-\bm{1}_{c_{i}}\log(\texttt{softmax}({\bm{% z_{i}}})),\vspace{-3pt}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CE end_POSTSUPERSCRIPT = - bold_1 start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( softmax ( bold_italic_z start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ) , (5)

where 𝟏cisubscript1subscript𝑐𝑖\bm{1}_{c_{i}}bold_1 start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the one-hot encoding of the label cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

3.2 Motivation

Commonly, federated graph learning aims at training a shared global GNN model, where clients have their own graphs and do not expose private data. In real-world applications, heterogeneous data distribution exists among clients. Therefore, clients present divergent optimization directions, which impair the performance of the global GNN model. We also show that this client divergence manifests in node-level semantics and graph-level structure aspects. We leverage the pairwise Centered Kernel Alignment (CKA) Kornblith et al. (2019) and calculate the similarity between arbitrary GNN models on the same input testing samples. CKA generates the similarity score ranging from 0 (not at all similar) to 1 (identical). We select 20 clients and train the local GNN model for 100 epochs, simultaneously taking the node output from different models as node representation. As shown in Fig. 1, considering both node semantics and graph structure calibration into account is beneficial to learning a better shared GNN model.

3.3 Proposed Method

Federated Node Semantic Contrast. Generally, the goal of node classification is to identify all samples. Thus, the GNN module should maintain the discernible patterns. Inspired by the success of supervised contrastive learning, we naturally expect to contrast the node features of different classes. For the local model, we pull the node feature vectors closer to the positive samples from the same semantics and push them far away from negativeness with distinct classes. Specifically, for the node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its embedding himsubscriptsuperscript𝑚𝑖h^{m}_{i}italic_h start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT generated by local GNN encoder Gm()superscript𝐺𝑚G^{m}(\cdot)italic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ⋅ ) with its ground truth cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the positive samples are other nodes belonging to the same class cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, while the negatives are the nodes from the different classes 𝒞\ci\𝒞subscript𝑐𝑖\mathcal{C}\backslash c_{i}caligraphic_C \ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Our supervised, local node-wise contrastive loss is defined as:

iCON=1|𝑷i|p𝑷ilogφ(him,hpm,τ)φ(him,hpm,τ)+k𝑲iφ(him,hkm,τ),superscriptsubscript𝑖𝐶𝑂𝑁1subscript𝑷𝑖subscript𝑝subscript𝑷𝑖𝜑subscriptsuperscript𝑚𝑖subscriptsuperscript𝑚𝑝𝜏𝜑subscriptsuperscript𝑚𝑖subscriptsuperscript𝑚𝑝𝜏subscript𝑘subscript𝑲𝑖𝜑subscriptsuperscript𝑚𝑖subscriptsuperscript𝑚𝑘𝜏\small\mathcal{L}_{i}^{CON}=\frac{-1}{\left|\bm{P}_{i}\right|}\sum_{p\in\bm{P}% _{i}}\log\frac{{\varphi({h}^{m}_{i},{h}^{m}_{p},\tau)}}{{\varphi({h}^{m}_{i},{% h}^{m}_{p},\tau)}+\sum_{{k}\in\bm{K}_{i}}{\varphi({h}^{m}_{i},{h}^{m}_{k},\tau% )}},caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C italic_O italic_N end_POSTSUPERSCRIPT = divide start_ARG - 1 end_ARG start_ARG | bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG italic_φ ( italic_h start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_τ ) end_ARG start_ARG italic_φ ( italic_h start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_τ ) + ∑ start_POSTSUBSCRIPT italic_k ∈ bold_italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_φ ( italic_h start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ ) end_ARG , (6)

where 𝑷isubscript𝑷𝑖\bm{P}_{i}bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝑲isubscript𝑲𝑖\bm{K}_{i}bold_italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the collections of the positive and negative samples sets for the node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We define the τ𝜏\tauitalic_τ as a contrastive hyper-parameter and φ𝜑\varphiitalic_φ is formulated as:

φ(hi,hj,τ)=exp(hihjhihj/τ).𝜑subscript𝑖subscript𝑗𝜏𝑒𝑥𝑝subscript𝑖subscript𝑗normsubscript𝑖normsubscript𝑗𝜏\begin{split}\varphi(h_{i},h_{j},\tau)=exp(\frac{h_{i}\cdot h_{j}}{||h_{i}||\ % ||h_{j}||}/\tau).\end{split}start_ROW start_CELL italic_φ ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_τ ) = italic_e italic_x italic_p ( divide start_ARG italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG | | italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | | | italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | end_ARG / italic_τ ) . end_CELL end_ROW (7)

However, it is widely known that private models present drift from the ideal global optima. Thus, naively leveraging the private model to provide the positive and negative sets would further skew the local optimization direction. In our work, we argue that the shared global model aggregates knowledge from multiple parties and presents less bias than the local model. In this paper, we propose Federated Node Semantic Contrast (FNSC), which leverages the global model to provide positive and negative cluster representations for each local node embedding. We further reformulate the aforementioned supervised node contrastive learning as follows:

iFNSC=1|𝑷i|p𝑷ilogφ(him,hpg,τ)φ(him,hpg,τ)+k𝑲iφ(him,hkg,τ),superscriptsubscript𝑖FNSC1subscript𝑷𝑖subscript𝑝subscript𝑷𝑖𝜑subscriptsuperscript𝑚𝑖subscriptsuperscript𝑔𝑝𝜏𝜑subscriptsuperscript𝑚𝑖subscriptsuperscript𝑔𝑝𝜏subscript𝑘subscript𝑲𝑖𝜑subscriptsuperscript𝑚𝑖subscriptsuperscript𝑔𝑘𝜏\small\mathcal{L}_{i}^{\textsl{FNSC}{}}=\frac{-1}{\left|\bm{P}_{i}\right|}\sum% _{p\in\bm{P}_{i}}\log\frac{{\varphi({h}^{m}_{i},{h}^{g}_{p},\tau)}}{{\varphi({% h}^{m}_{i},{h}^{g}_{p},\tau)}+\sum_{{k}\in\bm{K}_{i}}{\varphi({h}^{m}_{i},{h}^% {g}_{k},\tau)}},caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT FNSC end_POSTSUPERSCRIPT = divide start_ARG - 1 end_ARG start_ARG | bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG italic_φ ( italic_h start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_τ ) end_ARG start_ARG italic_φ ( italic_h start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_τ ) + ∑ start_POSTSUBSCRIPT italic_k ∈ bold_italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_φ ( italic_h start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ ) end_ARG , (8)

where hgsuperscript𝑔h^{g}italic_h start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT denotes the the node embedding generated by the GNN encoder Gg()superscript𝐺𝑔G^{g}(\cdot)italic_G start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( ⋅ ) Moreover, given the node embedding himsuperscriptsubscript𝑖𝑚h_{i}^{m}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT generated by local GNN encoder Gm()superscript𝐺𝑚G^{m}(\cdot)italic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ⋅ ), we pull the node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from local view and its pairwise one higsuperscriptsubscript𝑖𝑔h_{i}^{g}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT in global view together, simultaneously pull it and nodes from global view with the same class cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT together.

Notably, the recent success of contrastive learning in image or video processing is largely due to carefully designed image augmentations Ye et al. (2022, 2019). These augmentations allow the model to explore a wider range of underlying semantic information and obtain better performance. In this section, we adopt a similar strategy for graph data by using an augmentation module, denoted by Aug()𝐴𝑢𝑔Aug(\cdot)italic_A italic_u italic_g ( ⋅ ), to generate two different views of the graph. Prior research has produced various methods for graph augmentation, which can be divided into two categories: topology (structure) transformation and feature transformation (e.g., Edge Removing and Feature Masking) Zhu et al. (2021b, 2020). In order to enforce local clients to acquire a well-discriminative ability, we leverage both augmentations in our augmentation modules. Furthermore, we propose an asymmetric design for the contrast process, which utilizes stronger augmentations for the local GNN and weaker augmentations for the global GNN, given by 𝒢~1=Augs(𝒢)subscript~𝒢1𝐴𝑢subscript𝑔𝑠𝒢\widetilde{\mathcal{G}}_{1}=Aug_{s}(\mathcal{G})over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_A italic_u italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( caligraphic_G ) for strong Aug()𝐴𝑢𝑔Aug(\cdot)italic_A italic_u italic_g ( ⋅ ) and 𝒢~2=Augw(𝒢)subscript~𝒢2𝐴𝑢subscript𝑔𝑤𝒢\widetilde{\mathcal{G}}_{2}=Aug_{w}(\mathcal{G})over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_A italic_u italic_g start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( caligraphic_G ) for weak Aug()𝐴𝑢𝑔Aug(\cdot)italic_A italic_u italic_g ( ⋅ ). This would give local clients great strength to optimize towards the global direction, meanwhile, the global model can provide stable contextual semantic information to local training process. We further demonstrate the effectiveness of this asymmetric augmentation strategy in Tab. 3.

Figure 2: Architecture illustration of Federated Graph Semantic and Structural Learning (FGSSL). The left yellow box corresponds to the federated aggregation scheme (e.g. FedAvg), while the right grey box suggests the local training process. FGSSL includes two components: (a) Federated Node Semantic Contrast and (b) Federated Graph Structure Distillation. Best viewed in color. Zoom in for details.

Federated Graph Structure Distillation. For graph-level calibration, it is normally assumed that adjacent nodes will share similar representations. However, under federated learning, each client fails to effectively depict this relationship because local data is normally incomplete. The straightforward solution is to directly align the query local node feature with the neighborhood nodes from the global model. However, it could potentially disrupt the discriminability because neighboring nodes probably belong to different classes. Motivated by similarity knowledge distillation Fang et al. (2021); Tejankar et al. (2021), we propose Federated Graph Structure Distillation (FGSD) to overcome semantic inconsistency of adjacent nodes, which maintains graphic structure knowledge via the support of the global model. We measure the similarity of the query node with neighboring nodes from the global model output and then optimize the local network to mimic the similarity distribution from the global view. Specifically, for the node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the logit output given by the node classifier F𝐹Fitalic_F, we denote the Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the neighborhood node set and define the Sg(vi,Ai)superscript𝑆𝑔subscript𝑣𝑖subscript𝐴𝑖S^{g}(v_{i},A_{i})italic_S start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as the similarity of the selected node vector with adjacent nodes computed by the global model:

Sg(vi,Ai)=[S1g,,S|Ai|g],Sjg=exp((zigzjgT)/ω)jAiexp((zigzjgT))/ω),\small\begin{split}S^{g}(v_{i},A_{i})=[S^{g}_{1},\ldots,S^{g}_{|A_{i}|}],\\ S^{g}_{j}=\frac{\exp\big{(}\big{(}z^{g}_{i}\cdot{z^{g}_{j}}^{T}\big{)}/\omega% \big{)}}{\sum_{j\in A_{i}}\text{exp}\big{(}\big{(}z^{g}_{i}\cdot{z^{g}_{j}}^{T% })\big{)}/\omega\big{)}},\end{split}start_ROW start_CELL italic_S start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = [ italic_S start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_POSTSUBSCRIPT ] , end_CELL end_ROW start_ROW start_CELL italic_S start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( ( italic_z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) / italic_ω ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT exp ( ( italic_z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) / italic_ω ) end_ARG , end_CELL end_ROW (9)

where ω𝜔\omegaitalic_ω is the distillation hyper-parameter, ()Tsuperscript𝑇(\ )^{T}( ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT means transpose operation. Then, we measure similarity distribution from the m𝑚mitalic_m local model, Sm(vi,Ai)superscript𝑆𝑚subscript𝑣𝑖subscript𝐴𝑖S^{m}(v_{i},A_{i})italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which is formed by:

Sm(vi,Ai)=[S1m,,S|Ai|m],Sjm=exp((zimzjmT)/ω)jAiexp((zimzjmT))/ω).\small\begin{split}S^{m}(v_{i},A_{i})=[S^{m}_{1},\ldots,S^{m}_{|A_{i}|}],\\ S^{m}_{j}=\frac{\exp\big{(}\big{(}z^{m}_{i}\cdot{z^{m}_{j}}^{T}\big{)}/\omega% \big{)}}{\sum_{j\in A_{i}}\text{exp}\big{(}\big{(}z^{m}_{i}\cdot{z^{m}_{j}}^{T% })\big{)}/\omega\big{)}}.\end{split}start_ROW start_CELL italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = [ italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_POSTSUBSCRIPT ] , end_CELL end_ROW start_ROW start_CELL italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( ( italic_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) / italic_ω ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT exp ( ( italic_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) / italic_ω ) end_ARG . end_CELL end_ROW (10)

The FGSD (Federated Graph Structure Distillation) loss is calculated as the following:

iFGSD=Sg(vi,Ai)logSg(vi,Ai)Sm(vi,Ai).superscriptsubscript𝑖FGSDsuperscript𝑆𝑔subscript𝑣𝑖subscript𝐴𝑖superscript𝑆𝑔subscript𝑣𝑖subscript𝐴𝑖superscript𝑆𝑚subscript𝑣𝑖subscript𝐴𝑖\small\mathcal{L}_{i}^{\text{{FGSD}{}}}=S^{g}(v_{i},A_{i})\log\frac{S^{g}(v_{i% },A_{i})}{S^{m}(v_{i},A_{i})}.caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT FGSD end_POSTSUPERSCRIPT = italic_S start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG italic_S start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG . (11)

Finally, the overall objective to be maximized is then formalized as the average across all nodes of the accumulation of the losses discussed above and is defined by:

=1Ni=1N(iCE+λCiFNSC+λDiFGSD).1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscriptCE𝑖subscript𝜆𝐶superscriptsubscript𝑖FNSCsubscript𝜆𝐷superscriptsubscript𝑖FGSD\small\mathcal{L}=\frac{1}{N}\sum_{i=1}^{N}\left(\mathcal{L}^{\textsl{CE}}_{i}% +\lambda_{C}\mathcal{L}_{i}^{\text{{FNSC}{}}}+\lambda_{D}\mathcal{L}_{i}^{% \text{{FGSD}{}}}\right).caligraphic_L = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUPERSCRIPT CE end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT FNSC end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT FGSD end_POSTSUPERSCRIPT ) . (12)

To sum up, Federated Graph Semantic and Structural Learning FGSSL leverage the global model to simultaneously calibrate local model from the node-level semantics and graph-level structure, which effectively handles the heterogeneous graph data and learns a well-performing global GNN model. We further illustrate the FGSSL in the Algorithm 1.

Input: communication rounds T𝑇Titalic_T, local epochs E𝐸Eitalic_E, participant scale M𝑀Mitalic_M, mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client private graph data 𝒢m(V,A,X;Y)superscript𝒢𝑚𝑉𝐴𝑋𝑌\mathcal{G}^{m}(V,A,X;Y)caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_V , italic_A , italic_X ; italic_Y ), private model θmsuperscript𝜃𝑚\theta^{m}italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, temperature τ𝜏\tauitalic_τ, distillation parameter ω𝜔\omegaitalic_ω, loss weight λCsubscript𝜆𝐶\lambda_{C}italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and λDsubscript𝜆𝐷\lambda_{D}italic_λ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, learning rate η𝜂\etaitalic_η
Output: The final global model θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
for t=1,2,,T𝑡12𝑇t=1,2,...,Titalic_t = 1 , 2 , … , italic_T do
      
      Participant Side;
      
      for m=1,2,,M𝑚12𝑀m=1,2,...,Mitalic_m = 1 , 2 , … , italic_M in parallel do
             send the global model θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to m-th𝑚-thm\text{-th}italic_m -th client
            θtmsubscriptsuperscript𝜃𝑚𝑡absent\theta^{m}_{t}\leftarrowitalic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← LocalUpdating(θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, m𝑚mitalic_m)
       end for
      
      Server Side;
      
      θt+1m=1M|Dm||D|θtmsubscript𝜃𝑡1superscriptsubscript𝑚1𝑀superscript𝐷𝑚𝐷subscriptsuperscript𝜃𝑚𝑡\theta_{t+1}\leftarrow\sum_{m=1}^{M}\frac{\left|D^{m}\right|}{|D|}\theta^{m}_{t}italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG | italic_D start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | end_ARG start_ARG | italic_D | end_ARG italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
end for
return θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
LocalUpdating(θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, m𝑚mitalic_m):
Initialize Gg(),Fg()superscript𝐺𝑔superscript𝐹𝑔absentG^{g}(\cdot),F^{g}(\cdot)\leftarrowitalic_G start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( ⋅ ) , italic_F start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( ⋅ ) ← θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Initialize Gm(),Fm()superscript𝐺𝑚superscript𝐹𝑚absentG^{m}(\cdot),F^{m}(\cdot)\leftarrowitalic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ⋅ ) , italic_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ⋅ ) ← θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Freeze Gg(),Fgsuperscript𝐺𝑔superscript𝐹𝑔G^{g}(\cdot),F^{g}italic_G start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( ⋅ ) , italic_F start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT for  e=1,2,,E𝑒12𝐸e=1,2,...,Eitalic_e = 1 , 2 , … , italic_E do
       Z=θm(X)𝑍superscript𝜃𝑚𝑋Z=\theta^{m}(X)italic_Z = italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_X )
      LCECE(Z,Y)superscript𝐿𝐶𝐸𝐶𝐸𝑍𝑌L^{CE}\leftarrow CE(Z,Y)italic_L start_POSTSUPERSCRIPT italic_C italic_E end_POSTSUPERSCRIPT ← italic_C italic_E ( italic_Z , italic_Y ) in Eq. 5
      𝒢~1,𝒢~2Augs(𝒢),Augw(𝒢)formulae-sequencesubscript~𝒢1subscript~𝒢2𝐴𝑢subscript𝑔𝑠𝒢𝐴𝑢subscript𝑔𝑤𝒢\widetilde{\mathcal{G}}_{1},\widetilde{\mathcal{G}}_{2}\leftarrow Aug_{s}(% \mathcal{G}),Aug_{w}(\mathcal{G})over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_A italic_u italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( caligraphic_G ) , italic_A italic_u italic_g start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( caligraphic_G )
      Hm,HgGm(𝒢~1),Gg(𝒢~2)formulae-sequencesuperscript𝐻𝑚superscript𝐻𝑔superscript𝐺𝑚subscript~𝒢1superscript𝐺𝑔subscript~𝒢2H^{m},H^{g}\leftarrow G^{m}(\widetilde{\mathcal{G}}_{1}),G^{g}(\widetilde{% \mathcal{G}}_{2})italic_H start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_H start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ← italic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_G start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
      LFNSC(Hm,Hg)superscript𝐿𝐹𝑁𝑆𝐶superscript𝐻𝑚superscript𝐻𝑔L^{FNSC}\leftarrow(H^{m},H^{g})italic_L start_POSTSUPERSCRIPT italic_F italic_N italic_S italic_C end_POSTSUPERSCRIPT ← ( italic_H start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_H start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) through Eq. 8
      Zm,ZgFm(Hm),Fg(Hg)formulae-sequencesuperscript𝑍𝑚superscript𝑍𝑔superscript𝐹𝑚superscript𝐻𝑚superscript𝐹𝑔superscript𝐻𝑔Z^{m},Z^{g}\leftarrow F^{m}(H^{m}),F^{g}(H^{g})italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ← italic_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) , italic_F start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT )
      Sg(V,A)(Zg)superscript𝑆𝑔𝑉𝐴superscript𝑍𝑔S^{g}(V,A)\leftarrow(Z^{g})italic_S start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_V , italic_A ) ← ( italic_Z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) by Eq. 9
      Sm(V,A)(Zm)superscript𝑆𝑚𝑉𝐴superscript𝑍𝑚S^{m}(V,A)\leftarrow(Z^{m})italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_V , italic_A ) ← ( italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) by Eq. 10
      LFGSD(Sm(V,A),Sg(V,A))superscript𝐿𝐹𝐺𝑆𝐷superscript𝑆𝑚𝑉𝐴superscript𝑆𝑔𝑉𝐴L^{FGSD}\leftarrow(S^{m}(V,A),S^{g}(V,A))italic_L start_POSTSUPERSCRIPT italic_F italic_G italic_S italic_D end_POSTSUPERSCRIPT ← ( italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_V , italic_A ) , italic_S start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_V , italic_A ) ) through Eq. 12
      =CE+λCFNSC+λDFGSDsuperscriptCEsubscript𝜆𝐶superscriptFNSCsubscript𝜆𝐷superscriptFGSD\mathcal{L}=\mathcal{L}^{\textsl{CE}}+\lambda_{C}\mathcal{L}^{\text{{FNSC}{}}}% +\lambda_{D}\mathcal{L}^{\text{{FGSD}{}}}caligraphic_L = caligraphic_L start_POSTSUPERSCRIPT CE end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT FNSC end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT FGSD end_POSTSUPERSCRIPT
      θmθmηsuperscript𝜃𝑚superscript𝜃𝑚𝜂\theta^{m}\leftarrow\theta^{m}-\eta\nabla\mathcal{L}italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ← italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT - italic_η ∇ caligraphic_L
end for
return θmsuperscript𝜃𝑚\theta^{m}italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT
Algorithm 1 The FGSSL Framework

4 Experiments

4.1 Experimental Setup

In this paper, we perform experiments on node-level tasks defined on graph data: we choose node classification to confirm the efficacy of FGSSL in various testing environments. The code is available at https://github.com/GuanchengWan/FGSSL.

  Cora    Citeseer    Pubmed
Methods M=5𝑀5M\!=\!5italic_M = 5 M=7𝑀7M\!=\!7italic_M = 7 M=10𝑀10M\!=\!10italic_M = 10  M=5𝑀5M\!=\!5italic_M = 5 M=7𝑀7M\!=\!7italic_M = 7 M=10𝑀10M\!=\!10italic_M = 10  M=5𝑀5M\!=\!5italic_M = 5 M=7𝑀7M\!=\!7italic_M = 7 M=10𝑀10M\!=\!10italic_M = 10
Global 87.78 ±plus-or-minus\!\pm\!± 1.34    76.91 ±plus-or-minus\!\pm\!± 1.02    88.38 ±plus-or-minus\!\pm\!± 0.33
Local 61.54 ±plus-or-minus\!\pm\!± 0.83 45.32 ±plus-or-minus\!\pm\!± 1.52 32.42 ±plus-or-minus\!\pm\!± 2.81    73.85 ±plus-or-minus\!\pm\!± 1.20 62.87 ±plus-or-minus\!\pm\!± 2.45 48.91 ±plus-or-minus\!\pm\!± 2.34    83.81 ±plus-or-minus\!\pm\!± 0.69 72.34 ±plus-or-minus\!\pm\!± 0.79 59.19 ±plus-or-minus\!\pm\!± 1.31
FedAvg 86.63 ±plus-or-minus\!\pm\!± 0.35 86.21 ±plus-or-minus\!\pm\!± 0.21 86.01 ±plus-or-minus\!\pm\!± 0.17    76.37 ±plus-or-minus\!\pm\!± 0.43 76.57 ±plus-or-minus\!\pm\!± 0.46 75.92 ±plus-or-minus\!\pm\!± 0.21    85.29 ±plus-or-minus\!\pm\!± 0.83 84.27 ±plus-or-minus\!\pm\!± 0.29 84.57 ±plus-or-minus\!\pm\!± 0.29
FedProx 86.60 ±plus-or-minus\!\pm\!± 0.59 86.27 ±plus-or-minus\!\pm\!± 0.12 86.22 ±plus-or-minus\!\pm\!± 0.25    77.15 ±plus-or-minus\!\pm\!± 0.45 77.28 ±plus-or-minus\!\pm\!± 0.78 76.87 ±plus-or-minus\!\pm\!± 0.80    85.21 ±plus-or-minus\!\pm\!± 0.24 84.01 ±plus-or-minus\!\pm\!± 0.59 84.98 ±plus-or-minus\!\pm\!± 0.65
FedOpt 86.11 ±plus-or-minus\!\pm\!± 0.24 85.89 ±plus-or-minus\!\pm\!± 0.43 85.20 ±plus-or-minus\!\pm\!± 0.93    76.96 ±plus-or-minus\!\pm\!± 0.34 76.82 ±plus-or-minus\!\pm\!± 0.04 76.71 ±plus-or-minus\!\pm\!± 0.19    84.39 ±plus-or-minus\!\pm\!± 0.42 84.10 ±plus-or-minus\!\pm\!± 0.19 83.91 ±plus-or-minus\!\pm\!± 0.20
FedSage 86.86 ±plus-or-minus\!\pm\!± 0.15 86.59 ±plus-or-minus\!\pm\!± 0.23 86.32 ±plus-or-minus\!\pm\!± 0.37    77.91 ±plus-or-minus\!\pm\!± 0.59 77.82 ±plus-or-minus\!\pm\!± 0.13 77.30 ±plus-or-minus\!\pm\!± 0.71    87.75 ±plus-or-minus\!\pm\!± 0.23 87.51 ±plus-or-minus\!\pm\!± 0.20 87.49 ±plus-or-minus\!\pm\!± 0.09
FGSSL
88.34 ±plus-or-minus\!\pm\!± 0.34
\uparrow 1.71
88.56 ±plus-or-minus\!\pm\!± 0.43
\uparrow 2.35
88.01 ±plus-or-minus\!\pm\!± 0.26
\uparrow 2.00
 
80.43 ±plus-or-minus\!\pm\!± 0.23
\uparrow 4.06
80.21 ±plus-or-minus\!\pm\!± 0.11
\uparrow 3.64
80.01 ±plus-or-minus\!\pm\!± 0.09
\uparrow 4.09
 
88.25 ±plus-or-minus\!\pm\!± 0.60
\uparrow 2.96
87.75 ±plus-or-minus\!\pm\!± 0.41
\uparrow 3.48
87.60 ±plus-or-minus\!\pm\!± 0.53
\uparrow 2.73
Table 1: Comparison with the state-of-the-art methods on Cora, Citeseer and Pubmed datasets. The best result is bolded. \uparrow means improved accuracy compared with FedAvg. ±plus-or-minus\pm± presents the standard deviation. Please see details in Sec. 4.2.

Datasets. For node classification, our experiments are conducted on three benchmark datasets for the citation networks:

  • Cora McCallum et al. (2000) dataset consists of 2708 scientific publications classified into one of seven classes. There are 5429 edges in the network of citations. 1433 distinct words make up the dictionary.

  • Citeseer Giles et al. (1998) dataset consists of 3312 scientific publications classified into one of six classes and 4732 edges. The dictionary contains 3703 unique words.

  • Pubmed Sen et al. (2008) dataset consists of 19717 scientific papers on diabetes that have been categorized into one of three categories in the PubMed database. The citation network has 44338 edges in it. A word vector from a dictionary with 500 unique terms that is TF/IDF weighted is used to describe each publication in the dataset.

Network Structure. Since the GAT Veličković et al. (2017) is a powerful and widely used benchmark network in graph representing learning, we realize two layers GAT with parameter θ𝜃\thetaitalic_θ, decoupling it into feature extractor G()𝐺G(\cdot)italic_G ( ⋅ ) and unified classifier F()𝐹F(\cdot)italic_F ( ⋅ ). The hidden dimensions are 128 for all datasets, and classifier F𝐹Fitalic_F maps the embedding from 128 dimensions to 7,6,3 dimensions, which is the number of classification classes for Cora, Citeseer, and Pubmed respectively.

Graph Augmentation Strategy. Generating views is a key component of contrastive learning methods. In the graph domain, different views of a graph provide different contexts for each node. We follow augmentation mentioned in Zhu et al. (2021b), Zhu et al. (2020) to construct a contrastive learning scheme. In FGSSL, we leverage two methods for new graph view generation, removing edges for topology and masking features for node attributes.

  • Removing edges (RE). It randomly removes a portion of edges in the original graph.

  • Masking node features (MF). It randomly masks a fraction of dimensions with zeros in node features.

Refer to caption
Figure 3: Visualization of training curves of the average test accuracy with Communication Epochs 200 with Citeseer dataset. Please see Sec. 4.2 for details.

Implement Details. We utilize the community detection algorithm: Louvain, to simulate the subgraph systems. To stimulate the non-iid scene, this algorithm partitions the graph into multiple clusters and then assigns them to distributed clients. To conduct the experiments uniformly and fairly, we split the nodes into train/valid/test sets, where the ratio is 60% : 20% : 20% . As for all networks, we use SGD Robbins and Monro (1951) as the selected optimizer with momentum 0.90.90.90.9 and weight decay 5e45𝑒45e-45 italic_e - 4. The communication round is 200200200200 and the local training epoch is 4444 for all datasets. The metric used in our experiments is the node classification accuracy on the testing nodes and we report the averaged accuracy and the standard deviation over several random repetitions.

Counterparts. (1) Local each client train their model locally, (2) Global the server leverage the complete graph for training. For rigorous evaluation, we compare our FGSSL against popular federated strategies in FGL setting. (3) FedAvg (AISTATS’17 McMahan et al. (2017)), (4) FedProx (MLSys’21 Li et al. (2020)), (5) FedOpt (ICLR’21 Reddi et al. (2021)), (6) FedSage (NeurIPS’21 Zhang et al. (2021)).

4.2 Experimental Results

Performance Comparison. The results of federated node classification for various methods under three non-IID settings are presented in Tab. 1. These results indicate that FGSSL outperforms all other baselines and demonstrates a significant and consistent improvement compared to the conventional FedAvg algorithm in the FGL setting. Additionally, personalized FL algorithms such as FedProx and FedOpt demonstrate better performance than vanilla aggregation by utilizing a universal solution to the non-IID problem. Specialized methods in the FGL field such as FedSage also perform better than common baselines, which is achieved through the simultaneous training of generative models for predicting missing links.

Convergence Analysis. Fig. 3 shows curves of the average test accuracy during the training process across five random runs conducted on the Citeseer datasets. It can be observed that FGSSL dominates the other methods in non-IID setting on the average test accuracy and achieves a stable convergence.

4.3 Ablation Study

Refer to caption
(a) Contrastive parameter τ𝜏\tauitalic_τ
Refer to caption
(b) Distillation parameter ω𝜔\omegaitalic_ω
Figure 4: Analysis on hyper-parameter in FGSSL. Node classification results on three datasets under different τ𝜏\tauitalic_τ and ω𝜔\omegaitalic_ω values with M = 5, in which green represents Cora, yellow represents Pubmed, and blue represents Citeseer. Refer to Tab. 2 for details.
  Cora    Citeseer
FNSC FGSD M=5𝑀5M\!=\!5italic_M = 5 M=7𝑀7M\!=\!7italic_M = 7 M=10𝑀10M\!=\!10italic_M = 10  M=5𝑀5M\!=\!5italic_M = 5 M=7𝑀7M\!=\!7italic_M = 7 M=10𝑀10M\!=\!10italic_M = 10
86.63 86.21 86.01    76.37 76.57 75.92
86.86 86.32 86.51    77.91 77.53 76.42
88.01 88.23 87.84    79.89 79.43 79.12
88.34 88.56 88.01  80.43 80.21 80.01
Table 2: Ablation study of key components of our method in Cora and Citeseer datasets with clients 5/7/10. See Tab. 2 for details.

Effects of Key Components Mechanism. To better understand the impact of specific design components on the overall performance of FGSSL, we conducted an ablation study in which we varied these components. For the variant without FNSC and FGSD, we utilize the vanilla FGL setting with 2-layer GAT. As shown in Tab. 2, by exploiting both components, the best performance is achieved in all three graph datasets. It also suggests that FNSC plays a more crucial role than FGSD, which means the calibration in node semantics is stronger than the calibration in graph structure, and feature heterogeneity is more serious than graph heterogeneity in non-IID setting. Moreover, the contribution made by FGSD is still not negligible and can benefit the learning process.

Hyper-parameter study. We compare the downstream task performance under different τ𝜏\tauitalic_τ and ω𝜔\omegaitalic_ω values with five clients. Results are shown in Tab. 1, where Fig. 4(a) shows results when ω𝜔\omegaitalic_ω is fixed at 5, and Fig. 4(b) shows results under τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1. It indicates that choosing τ𝜏\tauitalic_τ can affect the strength of the contrastive method, where a smaller temperature benefits training more than higher ones, but extremely low temperatures (0.01) are harder to train due to numerical instability. Across different datasets, the optimal τ𝜏\tauitalic_τ is constantly around 0.1. For choosing an appropriate τ𝜏\tauitalic_τ in (Eq. 9 and Eq. 10.), we find that the performance is not influenced much unless ω𝜔\omegaitalic_ω is set to extreme values like 0.1.

Discussion on Augmentation Strategies. As demonstrated in Tab. 3, different augmentation strategies were implemented within the augmentation module of proposed method. The experimental results indicate that utilizing two levels of augmentation improves performance. Specifically, on the one hand, using double-weak augmentation strategies did not result in a significant improvement when compared to baseline methods. On the other hand, double-strong augmentation strategies led to improved results as they allowed for exploration of rich semantic information through the supervised contrastive method. Additionally, the combination of strong and weak augmentation strategies at local and global levels, respectively, resulted in the highest overall performance, in accordance with our descriptions of them in Sec. 3.3.

Refer to caption
(a) FedAvg
Refer to caption
(b) FedSage
Refer to caption
(c) FGSSL
Figure 5: Visualization of classification result. The figure number corresponds to the method on the Citeseer dataset with m=5𝑚5m=5italic_m = 5. Logits are colored based on class labels.
  Cora    Citeseer
Local Global M=5𝑀5M\!=\!5italic_M = 5 M=7𝑀7M\!=\!7italic_M = 7 M=10𝑀10M\!=\!10italic_M = 10  M=5𝑀5M\!=\!5italic_M = 5 M=7𝑀7M\!=\!7italic_M = 7 M=10𝑀10M\!=\!10italic_M = 10
weak weak 87.24 87.10 86.99    77.72 77.45 77.30
weak strong 86.86 86.68 86.48    77.22 77.09 76.33
strong strong 87.91 87.93 87.52    79.59 79.12 78.81
strong weak 88.01 88.23 87.84  79.89 79.43 79.12
Table 3: Analysis on augmentation strategies : Effect of using weak or strong augmentations for two datasets trained on the sole FNSC component with 200 epochs. See Sec. 4.3 for details.

5 Conclusion

In this paper, we propose a novel federated graph learning framework, namely FGSSL, that mitigates the non-IID issues via appropriately calibrating the heterogeneity both on the node-level semantic and graph-level structure. We develop two key components to solve the problems respectively. On the one hand, we leverage the contrastive-based method to correct the drift node semantics from the global ones that have Identical semantic information and achieve a high level of semantic discrimination at node level. On the other hand, we consider transforming adjacency relationships into a similarity distribution and utilizing a global model to distill this information into the local model, which maintains the structural information and corrects the structure heterogeneity. Experimental results illustrate that FGSSL consistently outperforms the state-of-the-art methods in federated graph scenarios.

Acknowledgments

This work is partially supported by National Natural Science Foundation of China under Grant (62176188, 62225113), the Key Research and Development Program of Hubei Province (2021BAA187), Zhejiang lab (NO.2022NF0AB01), CCF-Huawei Populus Grove Fund (CCF-HuaweiTC2022003), the Special Fund of Hubei Luojia Laboratory (220100015) and the Science and Technology Major Project of Hubei Province (Next-Generation AI Technologies) under Grant (2019AEA170).

Contribution Statement

Wenke Huang and Guancheng Wan contributed equally to this work.

References

  • Baek et al. [2022] **heon Baek, Wonyong Jeong, Jiongdao **, Jaehong Yoon, and Sung Ju Hwang. Personalized subgraph federated learning. arXiv preprint arXiv:2206.10206, 2022.
  • Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607, 2020.
  • Fang et al. [2021] Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng Liu. Seed: Self-supervised distillation for visual representation. In ICLR, 2021.
  • Fu et al. [2022] Xingbo Fu, Binchi Zhang, Yushun Dong, Chen Chen, and Jundong Li. Federated graph machine learning: A survey of concepts, techniques, and applications. arXiv preprint arXiv:2207.11812, 2022.
  • Giles et al. [1998] C Lee Giles, Kurt D Bollacker, and Steve Lawrence. Citeseer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries, pages 89–98, 1998.
  • Grover and Leskovec [2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In ACM SIGKDD, pages 855–864, 2016.
  • Hassani and Khasahmadi [2020] Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In ICML, pages 4116–4126, 2020.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Hu et al. [2022] Kai Hu, Jiasheng Wu, Yaogen Li, Meixia Lu, Liguo Weng, and Min Xia. Fedgcn: Federated learning-based graph convolutional networks for non-euclidean spatial data. Mathematics, 10(6):1000, 2022.
  • Huang et al. [2022a] Wenke Huang, Mang Ye, and Bo Du. Learn from others and be yourself in heterogeneous federated learning. In CVPR, 2022.
  • Huang et al. [2022b] Wenke Huang, Mang Ye, Bo Du, and Xiang Gao. Few-shot model agnostic federated learning. In ACM MM, pages 7309–7316, 2022.
  • Huang et al. [2023a] Wenke Huang, Mang Ye, Zekun Shi, He Li, and Bo Du. Rethinking federated learning with domain shift: A prototype view. In CVPR, 2023.
  • Huang et al. [2023b] Wenke Huang, Mang Ye, Zekun Shi, Guancheng Wan, He Li, Bo Du, and Qiang Yang. A federated learning for generalization, robustness, fairness: A survey and benchmark. arXiv, 2023.
  • Kairouz et al. [2019] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
  • Khosla et al. [2020] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. volume 33, pages 18661–18673, 2020.
  • Kipf and Welling [2016] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
  • Kipf and Welling [2017] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • Kornblith et al. [2019] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In ICML, pages 3519–3529, 2019.
  • Li et al. [2020] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429–450, 2020.
  • Li et al. [2021] Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. In CVPR, pages 10713–10722, 2021.
  • Liu et al. [2022a] Yixin Liu, Ming **, Shirui Pan, Chuan Zhou, Yu Zheng, Feng Xia, and Philip Yu. Graph self-supervised learning: A survey. IEEE TKDE, 2022.
  • Liu et al. [2022b] Yixin Liu, Yu Zheng, Daokun Zhang, Hongxu Chen, Hao Peng, and Shirui Pan. Towards unsupervised deep graph structure learning. In Proceedings of the ACM Web Conference 2022, pages 1392–1403, 2022.
  • Liu et al. [2023a] Yixin Liu, Kaize Ding, Huan Liu, and Shirui Pan. Good-d: On unsupervised graph out-of-distribution detection. In WSDM, 2023.
  • Liu et al. [2023b] Yixin Liu, Yizhen Zheng, Daokun Zhang, Vincent Lee, and Shirui Pan. Beyond smoothing: Unsupervised graph representation learning with edge heterophily discriminating. In AAAI, 2023.
  • Liu et al. [2024] Zewen Liu, Guancheng Wan, B Aditya Prakash, Max SY Lau, and Wei **. A review of graph neural networks in epidemic modeling. arXiv preprint arXiv:2403.19852, 2024.
  • McCallum et al. [2000] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163, 2000.
  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, pages 1273–1282, 2017.
  • Mullapudi et al. [2019] Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. Online model distillation for efficient video inference. In ECCV, pages 3573–3582, 2019.
  • Panagopoulos et al. [2021] George Panagopoulos, Giannis Nikolentzos, and Michalis Vazirgiannis. Transfer graph neural networks for pandemic forecasting. In AAAI, pages 4838–4845, 2021.
  • Park et al. [2019] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, pages 3967–3976, 2019.
  • Perozzi et al. [2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In ACM SIGKDD, pages 701–710, 2014.
  • Reddi et al. [2021] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. In ICLR, 2021.
  • Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. AoMS, pages 400–407, 1951.
  • Romero et al. [2015] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
  • Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
  • Tan et al. [2023] Yue Tan, Yixin Liu, Guodong Long, **g Jiang, Qinghua Lu, and Chengqi Zhang. Federated learning on non-iid graphs via structural knowledge sharing. In AAAI, 2023.
  • Tejankar et al. [2021] A**kya Tejankar, Soroush Abbasi Koohpayegani, Vipin Pillai, Paolo Favaro, and Hamed Pirsiavash. Isd: Self-supervised learning by iterative similarity distillation. In ICCV, 2021.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30, 2017.
  • Veličković et al. [2017] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  • Wan et al. [2024a] Guancheng Wan, Wenke Huang, and Mang Ye. Federated graph learning under domain shift with generalizable prototypes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15429–15437, 2024.
  • Wan et al. [2024b] Guancheng Wan, Yijun Tian, Wenke Huang, Nitesh V Chawla, and Mang Ye. S3gcl: Spectral, swift, spatial graph contrastive learning. In ICML, 2024.
  • Wang and Yoon [2021] Lin Wang and Kuk-** Yoon. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE TPAMI, 2021.
  • Xie et al. [2021] Han Xie, **g Ma, Li Xiong, and Carl Yang. Federated graph classification over non-iid graphs. NeurIPS, 34:18839–18852, 2021.
  • Ye et al. [2019] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In CVPR, pages 6210–6219, 2019.
  • Ye et al. [2022] Mang Ye, Jianbing Shen, Xu Zhang, Pong C. Yuen, and Shih-Fu Chang. Augmentation invariant and instance spreading feature for softmax embedding. IEEE TPAMI, 44(2):924–939, 2022.
  • Zhang et al. [2021] Ke Zhang, Carl Yang, Xiaoxiao Li, Lichao Sun, and Siu Ming Yiu. Subgraph federated learning with missing neighbor generation. NeurIPS, 34:6671–6682, 2021.
  • Zhu et al. [2020] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep graph contrastive representation learning. In ICML, 2020.
  • Zhu et al. [2021a] Wei Zhu, Andrew White, and Jiebo Luo. Federated learning of molecular properties in a heterogeneous setting. arXiv preprint arXiv:2109.07258, 2021.
  • Zhu et al. [2021b] Yanqiao Zhu, Yichen Xu, Qiang Liu, and Shu Wu. An empirical study of graph contrastive learning. arXiv preprint arXiv:2109.01116, 2021.

APPENDIX

Appendix A Notation Table

Notations Description
hhitalic_h node-level feature, xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
z𝑧zitalic_z output logit
d𝑑ditalic_d dimension of hhitalic_h
cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT label
𝒞𝒞\mathcal{C}caligraphic_C classes collection
visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the node with indicator i𝑖iitalic_i in graph
Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the neighborhoods of the node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
j𝑗jitalic_j the neighbors indicator in 𝒜isubscript𝒜𝑖\mathcal{A}_{i}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
A𝐴Aitalic_A adjacency matrix
l𝑙litalic_l the layer of the model
G()𝐺G(\cdot)italic_G ( ⋅ ) the GNN feature extractor
F()𝐹F(\cdot)italic_F ( ⋅ ) the GNN classifier head
M𝑀Mitalic_M total number of clients
m𝑚mitalic_m the index of client
Dmsuperscript𝐷𝑚D^{m}italic_D start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT private data of the m𝑚mitalic_m client
R𝑅Ritalic_R communication round
θmsuperscript𝜃𝑚\theta^{m}italic_θ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT model parameter of client m𝑚mitalic_m
E𝐸Eitalic_E local training epochs
σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) ReLU activation function, σ()=max{,0}𝜎0\sigma(\cdot)=\max\{\cdot,0\}italic_σ ( ⋅ ) = roman_max { ⋅ , 0 }
τ𝜏\tauitalic_τ the parameter in contrastive method
ω𝜔\omegaitalic_ω the temperature in knowledge distillation
Aug()𝐴𝑢𝑔Aug(\cdot)italic_A italic_u italic_g ( ⋅ ) the augmentation module for graph
Augs()𝐴𝑢subscript𝑔𝑠Aug_{s}(\cdot)italic_A italic_u italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ⋅ ) the stronger augmentation module for graph
Augw()𝐴𝑢subscript𝑔𝑤Aug_{w}(\cdot)italic_A italic_u italic_g start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( ⋅ ) the weaker augmentation module for graph
𝒢~~𝒢\widetilde{\mathcal{G}}over~ start_ARG caligraphic_G end_ARG augmented graph
g𝑔gitalic_g the global signal
λCsubscript𝜆𝐶\lambda_{C}italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT the LiFNSCsubscriptsuperscript𝐿𝐹𝑁𝑆𝐶𝑖L^{FNSC}_{i}italic_L start_POSTSUPERSCRIPT italic_F italic_N italic_S italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT loss weight
λDsubscript𝜆𝐷\lambda_{D}italic_λ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT the LiFGSDsubscriptsuperscript𝐿𝐹𝐺𝑆𝐷𝑖L^{FGSD}_{i}italic_L start_POSTSUPERSCRIPT italic_F italic_G italic_S italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT loss weight
𝑷𝑷\bm{P}bold_italic_P the collection of the positive indicators
𝑲𝑲\bm{K}bold_italic_K the collection of the negative indicators
p𝑝pitalic_p the indicator of positive sample
k𝑘kitalic_k the indicator of negative sample
S𝑆Sitalic_S the similarity of the selected node vector
η𝜂\etaitalic_η learning rate
Table 4: Notations occurred in the paper.