\volumeheader

360

Meta-GCN: A Dynamically Weighted Loss Minimization Method for Dealing with the Data Imbalance in Graph Neural Networks

Abstract.

Although many real-world applications, such as disease prediction, and fault detection suffer from class imbalance, most existing graph-based classification methods ignore the skewness of the distribution of classes; therefore, tend to be biased towards the majority class(es). Conventional methods typically tackle this problem through the assignment of weights to each one of the class samples based on a function of their loss, which can lead to over-fitting on outliers. In this paper, we propose a meta-learning algorithm, named Meta-GCN, for adaptively learning the example weights by simultaneously minimizing the unbiased meta-data set loss and optimizing the model weights through the use of a small unbiased meta-data set. Through experiments, we have shown that Meta-GCN outperforms state-of-the-art frameworks and other baselines in terms of accuracy, the area under the receiver operating characteristic (AUC-ROC) curve, and macro F1-Score for classification tasks on two different datasets.

keywords:
Keywords: Graph neural network, Graph convolutional networks, Node classification, Imbalanced class label, Meta-learning
Mahdi Mohammadizadeh\upstairs\affilone, Arash Mozhdehi\upstairs\affilone, Yani Ioannou\upstairs\affilone, Xin Wang\upstairs\affilone,*
\upstairs\affilone University of Calgary
\copyrightnotice

1. Introduction

Graph-based data structures are ubiquitously used for modeling the pair-wise relation between the entities in a variety of real-world applications, including social networks [tong2020stratlearner], citation networks [khan2019multi], and protein-protein interactions [zaki2021identifying]. Graph structures are effectively capable of describing the complex relationship between the objects, i.e. nodes, through edges. Besides, Graph-based representation is an effective method for feature dimensionality reduction [zhang2018feature, 4016549]. GNNs, as powerful tools for representational learning on graph-structured data, have attracted increasing attention in recent years. GNNs are used for effective deep representational learning to perform graph analysis for tasks such as node classification, link prediction, and clustering in Euclidean and non-Euclidean domains [zhou2020graph]. Among the proposed methods for learning representations on graphs, GCNs proposed by Kipf et al. [welling2016semi] proved to be a simple and effective GNN model. This model is able to learn hidden representations comprising both node features and local graph structure while scaling linearly relative to the number of edges in the given graph. Most classification algorithms, in GNNs, tend to minimize the average loss over all training examples which produces reasonable outcomes for class-balanced datasets. However, various real-world classification tasks, such as disease prediction [kazi2019inceptiongcn], fraud detection [jiang2019anomaly], and fault detection [chen2019fault] manifest highly-skewed class distribution. In settings that exhibit such class imbalance, these methods favor the majority classes while neglecting the minority ones. In addition, over-smoothing, which is a general issue in the GNN, can be aggravated in the case of data imbalance, as minority class nodes’ representations become similar to that of majority ones [10.1145/1122445.1122456]. Hence, these classifiers cannot well-differentiate the boundary between the minority and majority classes. However, in applications like disease prediction, fault detection, and fraud detection, minorities are important, and classifying them correctly is crucial. Alternative to the existing methods, aiming to alleviate class imbalance for node classification on graph-structured data, we proposed an algorithm-based method, called Meta-GCN, that uses a meta-learning algorithm that adaptively assigns weights to the training examples in a way that minimizes the aggregated loss of an unbiased example set sampled from a meta-data set.

Our contributions are threefold:

  1. (1)

    Proposing a general-purpose online re-weighting algorithm for semi-supervised node classification in graphs that learns weighted loss function which is parameterized by weights that are learned in a meta-learning manner through the use of a small unbiased meta-data set.

  2. (2)

    Proposing a novel graph-based sampling method from the designated portion of the dataset to construct the meta-data set.

  3. (3)

    Through our experiment, we demonstrated that Meta-GCN outperforms state-of-the-art frameworks and other baselines in terms of accuracy, the area under the receiver operating characteristic (AUC-ROC) curve, and macro F1-Score for classification tasks on two different datasets.

2. Related Work

Proposed methods for addressing class imbalance issues on both graph and non-graph data fall into data-level, algorithm-level, or hybrid methods. Over-sampling and under-sampling are among the data-level solutions. While over-sampling methods aim to balance the ratio of classes by having more examples from minority classes in training, under-sampling solutions remedy the disproportion by removing the majority class(es) instances. Over-sampling with replicating the examples is known to tend to overfit; hence, a synthetic minority over-sampling technique (SMOTE) [chawla2002smote] has been proposed to overcome this issue by generating synthetic minority examples through interpolating neighboring minority examples. To improve SMOTE, several variants of it have been proposed. However, over-sampling methods are error-prone by synthesizing examples close to the boundary. Therefore, under-sampling methods are often preferred [huang2019deep], however, removing examples can lead to losing valuable instances required for discrimination, and as a result poor generalization. On the other hand, re-weighting methods, as algorithm-level solutions, aim to minimize a weighted loss on the training samples by assigning weights in a manner that pays more attention to minority examples. AdaBoost [freund1996experiments] is a re-weighting-based approach that creates an ensemble of classifiers and assigns higher weights to misclassified instances. In [6252738], authors proposed improvement to AdaBoost by a hybrid method that first combines it with over-sampling and then uses an optimization algorithm that further tunes the class-specific weights. [ren2018learning] proposed an algorithm that re-weights the training examples online based on gradient direction using a small clean validation set. Alternatively, in [9920039], authors proposed a generic method for evaluating the semantic completeness of datasets. Since these methods assume that the data is i.i.d they are not applicable to graph-based representations. As GNN-based classifiers are newly emerged, few works have instigated solving class imbalance issues in this area. GraphSMOTE [zhao2021graphsmote], a data-level approach for graph-structured data, has been proposed that generates the synthetic data in the embedding space. These methods proved to be error-prone. DR-GCN [9766044] is an algorithm-level approach that uses class-conditioned adversarial regularizers to overcome the imbalance in graphs. This approach, as a hybrid method with adversarial training, is shown to suffer from instability and is unable to scale to large graphs. Alternatively, Meta-GCN unlike most of the proposed approaches that deal with data imbalance is for semi-supervised node classification in graphs and we do not assume that the data is i.i.d. Unlike GraphSMOTE, our proposed method is an algorithm-level approach and does not generate synthetic data.

3. Method

In this section, we introduce the definitions, state the problem, and present our solution. Assume a feature matrix π’³βˆˆβ„NΓ—F𝒳superscriptℝ𝑁𝐹\mathcal{X}\in\mathbb{R}^{N\times F}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N Γ— italic_F end_POSTSUPERSCRIPT, with an imbalanced distribution of labels, where N𝑁Nitalic_N and F𝐹Fitalic_F are respectively the numbers of examples in the dataset and the number of features assigned to them. Let A∈{0,1}NΓ—N𝐴superscript01𝑁𝑁A\in\{0,1\}^{N\times N}italic_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N Γ— italic_N end_POSTSUPERSCRIPT be a binary adjacency matrix, showing the connectivity between the examples in the dataset. For given examples i𝑖iitalic_i and j𝑗jitalic_j the entry Ai,jsubscript𝐴𝑖𝑗A_{i,j}italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is 1 iff the examples are connected and 0 otherwise. Given the feature matrix 𝒳𝒳\mathcal{X}caligraphic_X and the adjacency matrix A𝐴Aitalic_A, we construct the undirected and unweighted graph 𝒒=(V,E,𝒳)𝒒𝑉𝐸𝒳\mathcal{G}=(V,E,\mathcal{X})caligraphic_G = ( italic_V , italic_E , caligraphic_X ), where V={1,…,N}𝑉1…𝑁V=\{1,\dots,N\}italic_V = { 1 , … , italic_N } stands for the set of vertices while E𝐸Eitalic_E is the edge set of the graph. Let the matrix π’΄βˆˆβ„N𝒴superscriptℝ𝑁\mathcal{Y}\in\mathbb{R}^{N}caligraphic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT be the ground-truth class labels corresponding to each vertex of the graph 𝒒𝒒\mathcal{G}caligraphic_G.

Let’s assume a graph 𝒒m⁒e⁒t⁒a=(Vm⁒e⁒t⁒a,Em⁒e⁒t⁒a,𝒳m⁒e⁒t⁒a)superscriptπ’’π‘šπ‘’π‘‘π‘Žsuperscriptπ‘‰π‘šπ‘’π‘‘π‘ŽsuperscriptπΈπ‘šπ‘’π‘‘π‘Žsuperscriptπ’³π‘šπ‘’π‘‘π‘Ž\mathcal{G}^{meta}=(V^{meta},E^{meta},\mathcal{X}^{meta})caligraphic_G start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT = ( italic_V start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT ) be meta-graph constructed by an unbiased meta-data set 𝒳m⁒e⁒t⁒asuperscriptπ’³π‘šπ‘’π‘‘π‘Ž\mathcal{X}^{meta}caligraphic_X start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT of size M𝑀Mitalic_M with the adjacency matrix Am⁒e⁒t⁒asuperscriptπ΄π‘šπ‘’π‘‘π‘ŽA^{meta}italic_A start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT. The feature set 𝒳m⁒e⁒t⁒asuperscriptπ’³π‘šπ‘’π‘‘π‘Ž\mathcal{X}^{meta}caligraphic_X start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT is constructed through unbiased sampling from a separate portion of the whole dataset apart from the training and validation set. Vm⁒e⁒t⁒asuperscriptπ‘‰π‘šπ‘’π‘‘π‘ŽV^{meta}italic_V start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT is a set of vertices where each of its items corresponds to each entry of the set 𝒳m⁒e⁒t⁒asuperscriptπ’³π‘šπ‘’π‘‘π‘Ž\mathcal{X}^{meta}caligraphic_X start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT. There is an edge ei,jm⁒e⁒t⁒asuperscriptsubscriptπ‘’π‘–π‘—π‘šπ‘’π‘‘π‘Že_{i,j}^{meta}italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT in the edge set Em⁒e⁒t⁒asuperscriptπΈπ‘šπ‘’π‘‘π‘ŽE^{meta}italic_E start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT if and only if i,j∈Vm⁒e⁒t⁒a𝑖𝑗superscriptπ‘‰π‘šπ‘’π‘‘π‘Ž{i,j}\in V^{meta}italic_i , italic_j ∈ italic_V start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT. We represent the ground-truth label set for the meta-data set by 𝒴m⁒e⁒t⁒aβˆˆβ„Msuperscriptπ’΄π‘šπ‘’π‘‘π‘Žsuperscriptℝ𝑀\mathcal{Y}^{meta}\in\mathbb{R}^{M}caligraphic_Y start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. Given the adjacency matrix A𝐴Aitalic_A and the feature matrix 𝒳𝒳\mathcal{X}caligraphic_X, we define a GCN model fθ⁒(𝒳,A)subscriptπ‘“πœƒπ’³π΄f_{\theta}(\mathcal{X},A)italic_f start_POSTSUBSCRIPT italic_ΞΈ end_POSTSUBSCRIPT ( caligraphic_X , italic_A ), where ΞΈπœƒ\thetaitalic_ΞΈ is the set of learnable model parameters. Let y^=fθ⁒(𝒳,A)^𝑦subscriptπ‘“πœƒπ’³π΄\hat{y}=f_{\theta}(\mathcal{X},A)over^ start_ARG italic_y end_ARG = italic_f start_POSTSUBSCRIPT italic_ΞΈ end_POSTSUBSCRIPT ( caligraphic_X , italic_A ) and y^m⁒e⁒t⁒a=fθ⁒(𝒳m⁒e⁒t⁒a,Am⁒e⁒t⁒a)superscript^π‘¦π‘šπ‘’π‘‘π‘Žsubscriptπ‘“πœƒsuperscriptπ’³π‘šπ‘’π‘‘π‘Žsuperscriptπ΄π‘šπ‘’π‘‘π‘Ž\hat{y}^{meta}=f_{\theta}(\mathcal{X}^{meta},A^{meta})over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_ΞΈ end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT ) be the predicted labels for the training set and meta-data set, respectively. Assume that ℒ⁒(𝒴^,𝒴)β„’^𝒴𝒴\mathcal{L}(\hat{\mathcal{Y}},\mathcal{Y})caligraphic_L ( over^ start_ARG caligraphic_Y end_ARG , caligraphic_Y ) is the loss function for training data, and β„’m⁒e⁒t⁒a⁒(𝒴^m⁒e⁒t⁒a,𝒴m⁒e⁒t⁒a)superscriptβ„’π‘šπ‘’π‘‘π‘Žsuperscript^π’΄π‘šπ‘’π‘‘π‘Žsuperscriptπ’΄π‘šπ‘’π‘‘π‘Ž\mathcal{L}^{meta}(\hat{\mathcal{Y}}^{meta},\mathcal{Y}^{meta})caligraphic_L start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT ( over^ start_ARG caligraphic_Y end_ARG start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT , caligraphic_Y start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT ) be that for the meta-data set. For the evaluation of this semi-supervised node classification problem, we use the cross-entropy loss function. We define an individual losses li⁒(y^i,yi)subscript𝑙𝑖subscript^𝑦𝑖subscript𝑦𝑖l_{i}(\hat{y}_{i},y_{i})italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for each training example i𝑖iitalic_i and ljm⁒e⁒t⁒a⁒(y^jm⁒e⁒t⁒a,yjm⁒e⁒t⁒a)subscriptsuperscriptπ‘™π‘šπ‘’π‘‘π‘Žπ‘—subscriptsuperscript^π‘¦π‘šπ‘’π‘‘π‘Žπ‘—subscriptsuperscriptπ‘¦π‘šπ‘’π‘‘π‘Žπ‘—l^{meta}_{j}(\hat{y}^{meta}_{j},y^{meta}_{j})italic_l start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) for each meta-data example j𝑗jitalic_j, where yiβˆˆπ’΄subscript𝑦𝑖𝒴y_{i}\in\mathcal{Y}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y, yim⁒e⁒t⁒aβˆˆπ’΄m⁒e⁒t⁒asuperscriptsubscriptπ‘¦π‘–π‘šπ‘’π‘‘π‘Žsuperscriptπ’΄π‘šπ‘’π‘‘π‘Žy_{i}^{meta}\in\mathcal{Y}^{meta}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT, y^iβˆˆπ’΄^subscript^𝑦𝑖^𝒴\hat{y}_{i}\in\mathcal{\hat{Y}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over^ start_ARG caligraphic_Y end_ARG, and y^im⁒e⁒t⁒aβˆˆπ’΄^m⁒e⁒t⁒asuperscriptsubscript^π‘¦π‘–π‘šπ‘’π‘‘π‘Žsuperscript^π’΄π‘šπ‘’π‘‘π‘Ž\hat{y}_{i}^{meta}\in\mathcal{\hat{Y}}^{meta}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT ∈ over^ start_ARG caligraphic_Y end_ARG start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT. We define our problem as minimizing a weighted loss parameterized with w={wi|1≀i≀N}𝑀conditional-setsubscript𝑀𝑖1𝑖𝑁w=\{w_{i}|1\leq i\leq N\}italic_w = { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | 1 ≀ italic_i ≀ italic_N }, as

ΞΈβˆ—β’(w)=arg⁑minΞΈβ’βˆ‘i=1Nwi⁒li⁒(ΞΈ).superscriptπœƒπ‘€subscriptπœƒsuperscriptsubscript𝑖1𝑁subscript𝑀𝑖subscriptπ‘™π‘–πœƒ\theta^{*}(w)=\arg\min_{\theta}\sum_{i=1}^{N}w_{i}l_{i}(\theta).italic_ΞΈ start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT ( italic_w ) = roman_arg roman_min start_POSTSUBSCRIPT italic_ΞΈ end_POSTSUBSCRIPT βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ΞΈ ) . (1)

At each step, ΞΈπœƒ\thetaitalic_ΞΈ is updated as estimated weights are updated through:

ΞΈt+1⁒(w)=ΞΈt⁒(w)βˆ’Ξ±β’βˆ‡β’βˆ‘i=1Nwi,t⁒li⁒(fθ⁒(𝒳,A),yi),subscriptπœƒπ‘‘1𝑀subscriptπœƒπ‘‘π‘€π›Όβˆ‡superscriptsubscript𝑖1𝑁subscript𝑀𝑖𝑑subscript𝑙𝑖subscriptπ‘“πœƒπ’³π΄subscript𝑦𝑖\theta_{t+1}(w)=\theta_{t}(w)-\alpha\nabla\sum_{i=1}^{N}{w_{i,t}l_{i}(f_{% \theta}(\mathcal{X},A),y_{i})},italic_ΞΈ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_w ) = italic_ΞΈ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_w ) - italic_Ξ± βˆ‡ βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_ΞΈ end_POSTSUBSCRIPT ( caligraphic_X , italic_A ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (2)
ΞΈ^t+1⁒(Ξ³)=ΞΈt⁒(Ξ³)βˆ’Ξ±β’βˆ‡β’βˆ‘i=1NΞ³i,t⁒li⁒(fθ⁒(𝒳,A),yi),subscript^πœƒπ‘‘1𝛾subscriptπœƒπ‘‘π›Ύπ›Όβˆ‡superscriptsubscript𝑖1𝑁subscript𝛾𝑖𝑑subscript𝑙𝑖subscriptπ‘“πœƒπ’³π΄subscript𝑦𝑖\hat{\theta}_{t+1}(\gamma)=\theta_{t}(\gamma)-\alpha\nabla\sum_{i=1}^{N}{% \gamma_{i,t}l_{i}(f_{\theta}(\mathcal{X},A),y_{i})},over^ start_ARG italic_ΞΈ end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_Ξ³ ) = italic_ΞΈ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_Ξ³ ) - italic_Ξ± βˆ‡ βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Ξ³ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_ΞΈ end_POSTSUBSCRIPT ( caligraphic_X , italic_A ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (3)
w~i,t=max⁑(0,βˆ’Ξ·β’βˆ‚βˆ‚Ξ³i,t⁒1mβ’βˆ‘j=1mljm⁒e⁒t⁒a⁒(fΞΈ^m⁒e⁒t⁒a⁒(𝒳m⁒e⁒t⁒a,Am⁒e⁒t⁒a),yim⁒e⁒t⁒a)),subscript~𝑀𝑖𝑑0πœ‚subscript𝛾𝑖𝑑1π‘šsuperscriptsubscript𝑗1π‘šsubscriptsuperscriptπ‘™π‘šπ‘’π‘‘π‘Žπ‘—subscriptsuperscriptπ‘“π‘šπ‘’π‘‘π‘Ž^πœƒsuperscriptπ’³π‘šπ‘’π‘‘π‘Žsuperscriptπ΄π‘šπ‘’π‘‘π‘Žsubscriptsuperscriptπ‘¦π‘šπ‘’π‘‘π‘Žπ‘–\tilde{w}_{i,t}=\max(0,-\eta\displaystyle\frac{\partial}{\partial{\gamma_{i,t}% }}\frac{1}{m}\sum_{j=1}^{m}{l^{meta}_{j}(f^{meta}_{\hat{\theta}}(\mathcal{X}^{% meta},A^{meta}),y^{meta}_{i})}),over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = roman_max ( 0 , - italic_Ξ· divide start_ARG βˆ‚ end_ARG start_ARG βˆ‚ italic_Ξ³ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG βˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_ΞΈ end_ARG end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT ) , italic_y start_POSTSUPERSCRIPT italic_m italic_e italic_t italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (4)
wi,t=w~i,t(βˆ‘j=1Mw~j,t)+δ⁒(βˆ‘j=1Mw~j,t),subscript𝑀𝑖𝑑subscript~𝑀𝑖𝑑superscriptsubscript𝑗1𝑀subscript~𝑀𝑗𝑑𝛿superscriptsubscript𝑗1𝑀subscript~𝑀𝑗𝑑\displaystyle w_{i,t}=\frac{\tilde{w}_{i,t}}{(\sum_{j=1}^{M}{\tilde{w}_{j,t}})% +\delta(\sum_{j=1}^{M}{\tilde{w}_{j,t}})},italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = divide start_ARG over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( βˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT ) + italic_Ξ΄ ( βˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT ) end_ARG , (5)

where δ⁒(z)𝛿𝑧\delta(z)italic_Ξ΄ ( italic_z ) is 1 if z=0𝑧0z=0italic_z = 0 and 0 otherwise. Ξ·πœ‚\etaitalic_Ξ· is the meta-learning rate. γ𝛾\gammaitalic_Ξ³ is the weight perturbing parameter. For the GCN model, we define the normalized adjacency matrix as A~=A+IN~𝐴𝐴subscript𝐼𝑁\tilde{A}=A+I_{N}over~ start_ARG italic_A end_ARG = italic_A + italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. A degree matrix D~i,i=βˆ‘j=1NA~i,jsubscript~𝐷𝑖𝑖superscriptsubscript𝑗1𝑁subscript~𝐴𝑖𝑗\tilde{D}_{i,i}=\sum_{j=1}^{N}\tilde{A}_{i,j}over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = βˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT can be defined for each node i𝑖iitalic_i. We also calculate the renormalized graph Laplacian A^=D~βˆ’12⁒A~⁒D~βˆ’12^𝐴superscript~𝐷12~𝐴superscript~𝐷12\hat{A}=\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}over^ start_ARG italic_A end_ARG = over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over~ start_ARG italic_A end_ARG over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT. Considering an n-layer graph-based neural network GCN model for node classification, we compute the output for the layer l𝑙litalic_l as

Zl=σ⁒(A^⁒Zlβˆ’1⁒θl),superscriptπ‘π‘™πœŽ^𝐴superscript𝑍𝑙1superscriptπœƒπ‘™Z^{l}=\sigma(\hat{A}Z^{l-1}{\theta}^{l}),italic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_Οƒ ( over^ start_ARG italic_A end_ARG italic_Z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_ΞΈ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , (6)

where Οƒ(.)\sigma(.)italic_Οƒ ( . ) and ΞΈlsuperscriptπœƒπ‘™{\theta}^{l}italic_ΞΈ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are respectively corresponding to the layer’s activation function and weights. Finally, we compute the neural networks classifier’s results through fθ⁒(𝒳,A)=S⁒i⁒m⁒o⁒i⁒d⁒(Zn)subscriptπ‘“πœƒπ’³π΄π‘†π‘–π‘šπ‘œπ‘–π‘‘superscript𝑍𝑛f_{\theta}(\mathcal{X},A)=Simoid(Z^{n})italic_f start_POSTSUBSCRIPT italic_ΞΈ end_POSTSUBSCRIPT ( caligraphic_X , italic_A ) = italic_S italic_i italic_m italic_o italic_i italic_d ( italic_Z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ). Figure 1 depicts the overall training process of Meta-GCN.

Refer to caption
Figure 1. The overall procedure of Meta-GCN’s training in a meta-learning manner using a small unbiased meta-data set.

4. Experiments

To compare the node classification performance of our method with that of the baselines, we ran experiments on two medical datasets, namely Haberman and Diabetes using Meta-GCN and the five other baselines. For the SMOTE method, we used the scikit-learn library in Python [pedregosa2011scikit]. For training, validation, testing, and meta-set generation we respectively used 60%, 10%, 20%, and 10% of the data. For the GraphSMOTE and GCN, we used the same hyperparameters in the publications and the evaluations are done by the source codes shared by the authors. For both oversampling methods, SMOTE and GraphSMOTE, we used an over-sampling scale of 0.8, as it is the best value. We adopted two layers GCN model with 32 hidden units. For the activation function of the layers, we used Rectified Linear Unit (ReLU). We summarized the resources for the three metrics, accuracy, macro F1, and AUC-ROC, in Table 1. It is perceivable that our method, Meta-GCN, outperformed all five baselines on all three metrics. Meta-GCN’s higher macro F1-score on both datasets suggests that our approach has better performance for minorities as well as the majorities. MLP performed worse than other graph-based baselines which are methods on both datasets for all three metrics. Since SMOTE oversampling method simply performs the interpolation on neighboring nodes without considering the graph structure, its performance is worse than the GCN without oversampling.

For the Diabetes dataset, it is evident that utilizing the graph structure led to substantial improvement. Since GraphSMOTE constructs an embedding space for interpolation and considers the graph structure, its performance is significantly better than SMOTE on this dataset, where much information can be gained from the graph information. However, in terms of accuracy, GraphSMOTE does not perform significantly better than GCN-Weighted. It performs even worse compared to the standard GCN in terms of macro F1 and AUC-ROC. For Haberman, nodes have less number of features compared to Diabetes. By comparing the results between GCN and GCN-Weighted on macro F1, it can be concluded that the weighting effect is more considerable in this dataset. Judging by insignificant improvement on three metrics it can be inferred that gaining from the graph information in this dataset is limited. Limited information gain from the graph structure can also explain why the improvement of using GraphSMOTE compared to SMOTE is insignificant. However, in this case, Meta-GCN through the use of the meta-set was able to significantly improve the performance on all three different metrics. Considerable improvement of macro F1 by Meta-GCN demonstrates its superior performance in classifying both minority classes and majority classes. Judging by the improvement of macro F1-score on Diabetes compared to Haberman, which has a higher imbalance ratio, the gain from weighting is more significant.

Based on the results of this experiment, it can be concluded that: 1) For a better comparison and understanding of models none of the metrics are sufficient by themselves. 2) Depending on the dataset the performance can be increased by gaining information from graph structure. 3) Higher performance by the weighting method can be gained with datasets with a higher imbalance ratio. 4) Using the Meta-GCN method significantly improved the classification performance in all three metrics, suggesting that not only improves the overall accuracy of the model in the classification of the examples but also improves the discrimination power of the model for classifying both majorities and minorities.

Diabetes Haberman Methods Accuracy Macro F1 AUC-ROC Accuracy Macro F1 AUC-ROC MLP 0.58 Β± 0.07 0.43 Β± 0.07 0.55 Β± 0.07 0.71 Β± 0.03 0.46 Β± 0.04 0.45 Β± 0.17 GCN 0.70 Β± 0.10 0.65 Β± 0.09 0.72 Β± 0.07 0.74 Β± 0.04 0.47 Β± 0.05 0.55 Β± 0.06 GCN-Weighted 0.71 Β± 0.05 0.64 Β± 0.24 0.66 Β± 0.19 0.71 Β± 0.05 0.57 Β± 0.05 0.46 Β± 0.09 SMOTE 0.65 0.57 0.58 0.71 0.50 0.52 GraphSMOTE 0.72 Β± 0.14 0.65 Β± 0.09 0.70 Β± 0.27 0.74 Β± 0.24 0.47 Β± 0.05 0.54 Β± 0.12 Meta-GCN (ours) 0.74 Β± 0.17 0.70 Β± 0.12 0.75 Β± 0.15 0.76 Β± 0.17 0.65 Β± 0.07 0.62 Β± 0.09

Table 1. Comparison of different methods for imbalanced node classification.

5. Conclusion and Future Works

In this paper, we proposed a meta-learning-based method, name Meta-GCN for dealing with label class imbalance for semi-supervised node classification. The proposed method leverages a small unbiased meta-data set to adaptively learns the example weights through minimization of the meta-data set loss simultaneous to optimizing the model weights. This approach is general purpose and applicable to any graph-structured dataset that suffers from class imbalance. Compared to the other traditional re-weighting methods, Meta-GCN is an end-to-end method and does not require any manual weight setting and extra hyperparameter searching. Our empirical results demonstrate the superiority of Meta-GCN compared to the representative and state-of-the-art approaches in terms of accuracy, macro F1, and AUC-ROC. The comparison of experimental results shows that our model is effective in semi-supervised node classification for graph-structured datasets with the class imbalanced distribution. There are several avenues for further investigation. Firstly, in this paper, we proposed a graph-based sampling method that chooses examples for each class with equal probability without considering the graph structure. Therefore, we intend to improve the sampling method for gaining better performance. Second, the proposed method is for node classification and we intend to extend our approach to other applications such as edge prediction or regression tasks.

\printbibliography

[heading=subbibintoc]