CPT: Competence-progressive Training Strategy for Few-shot Node Classification

Qilong Yan

{}^{1}

, Yufeng Zhang

{}^{3}

, **ghao Zhang

{}^{3}

, **gpu Duan

{}^{2}

, Jian Yin

{}^{1}

Abstract

Graph Neural Networks (GNNs) have made significant advancements in node classification, but their success relies on sufficient labeled nodes per class in the training data. Real-world graph data often exhibits a long-tail distribution with sparse labels, emphasizing the importance of GNNs’ ability in few-shot node classification, which entails categorizing nodes with limited data. Traditional episodic meta-learning approaches have shown promise in this domain, but they face an inherent limitation: it might lead the model to converge to suboptimal solutions because of random and uniform task assignment, ignoring task difficulty levels. This could lead the meta-learner to face complex tasks too soon, hindering proper learning. Ideally, the meta-learner should start with simple concepts and advance to more complex ones, like human learning. So, we introduce CPT, a novel two-stage curriculum learning method that aligns task difficulty with the meta-learner’s progressive competence, enhancing overall performance. Specifically, in CPT’s initial stage, the focus is on simpler tasks, fostering foundational skills for engaging with complex tasks later. Importantly, the second stage dynamically adjusts task difficulty based on the meta-learner’s growing competence, aiming for optimal knowledge acquisition. Extensive experiments on popular node classification datasets demonstrate significant improvements of our strategy over existing methods.

1 Introduction

In recent years, there has been a lot of research focused on node classification, which is about predicting labels for unlabeled nodes in a graph. This task has many practical applications in real-world situations Yan et al. (2021); Zhang et al. (2020). For example, in bioinformatics Szklarczyk et al. (2019), predicting chemical properties for proteins in a protein network is an important problem that can be tackled using node classification. State-of-the-art methods for node classification often use Graph Neural Networks (GNNs)Wu et al. (2021); Xu et al. (2018) in a semi-supervised manner Kipf and Welling (2016). GNNs aim to learn representations for each node by considering information from its neighbors. These representations are then used for classification. However, these methods usually require a sufficient number of labeled nodes for all classes to achieve good results Fan et al. (2019). In practice, we often have many labeled nodes for some classes, but only a few for others. We refer to classes with many labels as base classes and classes with few labels as novel classes Wang et al. (2022). Since novel classes are common in real-world graphs, it’s important for GNNs to be capable of classifying nodes even with limited labeled nodes, known as the few-shot node classification problem.

Refer to caption — Figure 1: Tasks influence a model’s intelligence differently depending on its growth stage, mirroring human learning patterns.

Most existing studies on the few-shot node classification utilize the episodic meta-learning strategy Wang et al. (2022, 2020); Liu et al. (2021a). In this paradigm, the meta-learner learns from base classes through a series of meta-training tasks and evaluates the model using meta-test tasks derived from novel classes. The aim of this scheme is to allow the meta-learner to extract meta-knowledge from base classes, enabling it to classify using just a few labeled nodes from the novel classes. Despite the success of existing methods using the episodic meta-learning scheme, the episodic training strategy would introduce some issues. First and foremost, a fundamental limitation of episodic training methods is their random selection of tasks from base classes, neglecting the varied difficulty levels of these tasks. This differs significantly from the natural human learning process, where we typically begin with basic concepts and progressively advance to more complex ones. As a result, in such a training model, there’s a risk that the meta-learner might be exposed to challenging tasks too early and simpler tasks later, potentially leading to less effective knowledge acquisition. As shown in Figure 1, when the model is first initialized (akin to a toddler phase), it is not yet capable of handling difficult tasks, but it can acquire knowledge from easy tasks. At this stage, easy tasks are informative and beneficial for the model. After sufficient training, the model develops intelligence to tackle more complex tasks, resembling a teenager phase. By then, easy tasks become uninformative for it, and it is the hard tasks that truly contribute to enhancing its intelligence. Inspired by the human learning process, the idea of learning from easy to hard has been effectively implemented in curriculum learning Hacohen and Weinshall (2019) and self-paced learning Jiang et al. (2014). In these frameworks, individual samples are given different levels of importance (i.e., weights) to enhance generalization. Although the method of using curriculum learning to address problems in few-shot tasks has been studied Sun et al. (2019); Liu et al. (2020), it is not applicable to graph data. Graph data has its own complex relational dependency patterns, and existing work cannot be directly transferred over.

In this paper, we propose Competence-Progressive Training strategy for few-shot node classification, named as CPT. CPT aims to provide tasks of different difficulty levels for meta-learners at different stages to adapt to their current abilities and alleviate the issue of suboptimal points. Specifically, CPT uses the two-stage strategy Bengio et al. (2009); Hu et al. (2022) to train a meta-learner. In the first stage, CPT follows previous research by randomly sampling nodes to construct meta tasks and then employs the default training strategy(episodic learning), i.e., simply minimizing the given supervised loss, to produce a strong base meta-learner to start with. Although the base meta-learner may perform well on easy tasks and has gained some knowledge, it no longer benefits or improves its abilities from these tasks, and it might converge to suboptimal points. So the second stage focuses on generating more difficult tasks and fine-tuning it. Specifically, CPT increases the difficulty of each task by drop** the edges of its graph. The underlying motivation for this approach is to create a larger number of ’tail nodes’ within the meta-training tasks. This alteration is intended to enrich the meta-learning model with additional ’tail node’ meta-knowledge, as these nodes, with their fewer connections, typically present unique challenges in learning processes. Finally, CPT finetunes the base meta-learner by minimizing the loss over the increased hard task data. CPT can be easily implemented in the existing meta-learner training pipeline on graphs, as demonstrated in Algorithm 1. Additionally, CPT is compatible with any GNN-based meta-learner and can be employed with any desired supervised loss function.

Our main contributions can be summarized as follows:

•

Problem. We explore the limitations of the episodic meta-learning scheme and discuss its associated issues in few-shot node classification. Specifically, the limitation of the episodic meta-learning scheme is that it samples tasks randomly and uniformly, which can lead the meta-learner to converge to a suboptimal point and thus make further optimization more challenging.
•

Method. We present CPT, a novel two-stage curriculum learning strategy for few-shot node classification. CPT is the first method to solve such problems in few-shot tasks on graphs. Notably, this is a general strategy that can be applied to any GNN architecture and loss function.
•

Experiments. We conduct experiments on four benchmark node classification datasets under the few-shot setting and demonstrate significant improvements of our strategy over existing methods.

2 Related Work

In this section, we briefly review some existing few-shot learning methods and curriculum learning methods.

2.1 Few-shot Learning on Graphs

Few-shot Learning (FSL) aims to apply knowledge from tasks with abundant supervision to new tasks with limited labeled data. FSL methods generally fall into two categories: metric-based approaches Sung et al. (2018); Liu et al. (2019), like Prototypical Networks Snell et al. (2017), which measure similarity between new instances and support examples, and meta-optimizer-based approaches Ravi and Larochelle (2017); Mishra et al. (2017), like MAML Finn et al. (2017), focusing on optimizing model parameters with limited examples. In the field of graphs, there have been several recent works proposing the application of few-shot learning to graph-based tasks Ma et al. (2020); Chauhan et al. (2020), and they have achieved notable success, particularly in the context of attributed networks. Among these works, GPN Ding et al. (2020) introduces the utilization of node importance derived from Prototypical Networks Snell et al. (2017), leading to improved performance. On the other hand, G-Meta Huang and Zitnik (2020) employs local subgraphs to learn node representations and enhances model generalization through meta-learning. TENT Wang et al. (2022) leverages both local subgraphs and prototypes to adapt to tasks, effectively addressing the task-variance issue. It considers three perspectives: node-level, class-level, and task-level.

Although there has been some progress in few-shot learning on graphs, they all share a common issue: the use of random sampling without considering the distribution of task difficulty, which can lead to inefficient knowledge acquisition by the meta-learner. To address this, we propose a task scheduler based on curriculum learning that progresses from easy to difficult tasks, further enhancing the model’s knowledge acquisition efficiency.

2.2 Curriculum Learning

Inspired by the human learning process, Curriculum learning(CL) aims to adopt a meaningful learning sequence, such as from easy to hard patternsBengio et al. (2009). As a general and flexible plug-in, the CL strategy has proven its ability to improve model performance, generalization, robustness, and convergence in various scenarios, including image classification Cascante-Bonilla et al. (2022); Zhou et al. (2020) , semantic segmentation Dai et al. (2019); Feng et al. (2020), neural machine translation Guo et al. (2020); Platanios et al. (2019), etc. The existing CL methods fall into two groups Li et al. (2023): (1) Predefined CL Wei et al. (2023); Hu et al. (2022); Wang et al. (2023) that involves manually designing heuristic-based policies to determine the training order of data samples, and (2) automatic CL Qu et al. (2018); Vakil and Amiri (2022); Gu et al. (2021) that relies on computable metrics (e.g., the training loss) to dynamically design the curriculum for model training. CL for graph data is an emerging area of research. Recent studies have demonstrated its effectiveness in various graph-related tasks like node classification Wei et al. (2023); Qu et al. (2018), link prediction Hu et al. (2022); Vakil and Amiri (2022), and graph classification Wang et al. (2023); Gu et al. (2021). However, applying CL methods to the challenge of few-shot node classification is still an area with significant hurdles.

To overcome this, our approach introduces a pioneering training strategy that merges curriculum learning with few-shot learning on graphs, significantly enhancing the model’s performance in few-shot node classification scenarios.

3 PRELIMINARIES

3.1 Problem Statement

Formally, let $G=(\mathcal{V},\mathcal{E},\mathcal{X})$ denote an attributed graph, where $\mathcal{V}$ is the set of nodes, $\mathcal{E}$ is the set of edges, and $\mathcal{X}\in\mathbb{R}^{|\mathcal{V}|\times d}$ is the feature matrix of nodes with $d$ denoting the feature dimension. Moreover, we denote the entire set of node classes as $C$ , which can be further divided into two categories: $C_{b}$ and $C_{n}$ , where $C=C_{b}\cup C_{n}$ and $C_{b}\cap C_{n}=\emptyset$ . Here $C_{b}$ and $C_{n}$ denote the sets of base and novel classes, respectively. It is worth mentioning that the number of labeled nodes in $C_{b}$ is sufficient, while it is typically small in $C_{n}$ . Then we can formulate the studied problem of few-shot node classification as follows:

Few-shot Node Classification: Given an attributed graph $G=(\mathcal{V},\mathcal{E},\mathcal{X})$ our goal is to develop a machine learning model such that after training on labeled nodes in $C_{b}$ , the model can accurately predict labels for the nodes (i.e., query set $Q$ ) in $C_{n}$ with only a limited number of labeled nodes (i.e., support set $S$ ).

More specifically, if the support set $S$ contains exactly $K$ nodes for each of $N$ classes from $C_{n}$ , and the query set $Q$ are sampled from these $N$ classes, the problem is called $N$ -way $K$ -shot node classification. Essentially, the objective of few-shot node classification is to learn a classifier that can be fast adapted to $C_{n}$ with only limited labeled nodes. Thus, the crucial part is to learn transferable knowledge from $C_{b}$ and generalize it to $C_{n}$ .

3.2 Episodic Learning

The meta-training and meta-test processes are conducted on a certain number of meta-training tasks and meta-test tasks, respectively. These meta-tasks share a similar structure, except that meta-training tasks are sampled from $C_{b}$ , while meta-test tasks are sampled from $C_{n}$ . The main idea of few-shot node classification is to keep the consistency between meta-training and meta-test to improve the generalization performance.

To construct a meta-training (or meta-test) task $\mathcal{T}_{t}$ , we first randomly sample $N$ classes from $C_{b}$ (or $C_{n}$ ). Then we randomly sample $K$ nodes from each of the $N$ classes (i.e., $N$ -way $K$ -shot) to establish the support set $\mathcal{S}_{t}$ . Similarly, the query set $Q_{t}$ consists of $Q$ different nodes (distinct from $\mathcal{S}_{t}$ ) from the same $N$ classes. The components of the sampled meta-task $\mathcal{T}_{t}$ can be denoted as follows:

		$\displaystyle\mathcal{S}_{t}=\{(v_{1},y_{1}),(v_{2},y_{2}),...,(v_{N\times K},% y_{N\times K})\},$
		$\displaystyle Q_{t}=\{(q_{1},y^{{}^{\prime}}_{1}),(q_{2},y_{2}^{{}^{\prime}}),% ...,(q_{Q},y_{Q}^{{}^{\prime}})\},$
		$\displaystyle\mathcal{T}_{t}=\{\mathcal{S}_{t},Q_{t}\},\vspace{10mm}$

where $v_{i}$ (or $q_{i}$ ) is a node in $\mathcal{V}$ , and $y_{i}$ (or $y_{i}^{{}^{\prime}}$ ) is the corresponding label. In this way, the whole training process is conducted on a set of $T$ meta-training tasks $\mathcal{T}_{train}=\{\mathcal{T}_{t}\}^{T}_{t=1}$ . After training, the model has learned the transferable knowledge from $\mathcal{T}_{train}$ and will generalize it to meta-test tasks $\mathcal{T}_{test}=\{\mathcal{T}_{t}^{{}^{\prime}}\}^{T_{test}}_{t=1}$ sampled from $C_{n}$ .

4 METHODOLOGY

In this section, we introduce CPT which uses a curriculum learning strategy to better train a Graph Neural Network meta-learner (GNN meta-learner) in a few-shot node classification problem. In general, CPT applies the two-stage strategy to train a GNN meta-learner. The first stage aims to produce a strong base GNN meta-learner that performs well on easy tasks. In the first stage, the meta-learner has acquired knowledge of easy tasks and gained some preliminary abilities, enabling it to tackle more challenging tasks. Then, in the second stage, the meta-learner focuses on increasing the difficulty of tasks, progressively learning more complex tasks based on its growing competencies. This strategy allows the meta-learner to gradually adapt and master increasingly challenging tasks.

4.1 First stage: Default meta-training.

In the first stage, we aim to obtain a base meta-learner $f_{\theta}$ that has some preliminary abilities, enabling it to tackle more challenging tasks in the second stage. Specifically, we use the episodic meta-learning scheme to converge the GNN meta-learner on foundational tasks. These foundational tasks are considered easy because the tasks in the second stage are designed to increase in difficulty based on them. Next, we calculate the loss based on the model’s loss function. For the sake of brevity, we employ the cross-entropy loss as the loss function for the model when considering the base classes $C_{b}$ . Formally, this can be expressed as follows:

Z_{i}={\rm softmax}\left(f_{\theta}(h_{i})\right)

(1)

\mathcal{L}=-\sum^{Q}_{i=1}\limits\sum\limits^{|C_{b}|}_{j=1}Y_{i,j}{\rm log}(% Z_{i,j})

(2)

where $Z_{i}$ is the probability that the $i$ -th query node in $Q$ belongs to each class in $C_{b}$ . $Y_{i,j}=1$ if the $i$ -th node belongs to the $j$ -th class, and $Y_{i,j}=0$ , otherwise. $Z_{i,j}$ is the $j$ -th element in $Z_{i}$ . Then we update meta-learner parameters $\theta$ via one gradient descent step in task $\mathcal{T}_{i}$ :

\theta^{{}^{\prime}}=\theta-\alpha_{1}\frac{\partial\mathcal{L}_{\mathcal{T}_{% i}}(f_{\theta})}{\partial\theta}

(3)

where $\alpha_{1}$ is the task-learning rate and the model parameters are trained to optimize the performance of $f_{\theta^{{}^{\prime}}}$ across meta-training tasks. The model parameters $\theta$ are updated as follows:

\theta=\theta-\alpha_{2}\frac{\partial\sum_{\mathcal{T}_{i}\sim p(\mathcal{T}_% {train})}\mathcal{L}_{\mathcal{T}_{i}}(f_{\theta^{{}^{\prime}}_{i}})}{\partial\theta}

(4)

where $\alpha_{2}$ s the meta-learning rate, $p(\mathcal{T}_{train})$ is the distribution of meta-training tasks. The detailed training procedure is described in $L2-8$ of Algorithm 1.

4.2 Second stage: Competence-progressive training.

After the first stage, although the meta-learner performs well on foundational tasks, it is prone to converging to suboptimal points. To solve this issue, we generate increasingly hard tasks to help the meta-learner break through suboptimal points. There are many possible ways of controlling the difficulty of meta tasks in graph data. In our approach, we utilize the degree of nodes as the key factor to modulate the complexity of these tasks. Node degree is the number of immediate neighbors of a node in a graph. Our investigations have uncovered an implicit correlation between the degree of nodes and their performance. specifically, nodes with fewer connections—termed ’tail nodes’— tend to present greater challenges in learning processesLiu et al. (2021b). We have observed a similar phenomenon in few-shot node classification tasks, as depicted in Figure 2. We can see that the few-shot node classification also faces performance issues on tail nodes. So, we use the DropEdge methods Rong et al. (2019) to randomly delete edges from the entire graph. The underlying motivation for this approach is to create a larger number of ’tail nodes’ within the meta-training tasks to control task difficulty. The DropEdge ratio $\beta$ is determined by the model’s competence $c$ , where $\beta$ = $c$ . We employ the following function Platanios et al. (2019); Vakil and Amiri (2023) to quantify model’s competence:

c(t)=min\left(1,\sqrt[p]{t\left(\frac{1-c_{0}^{p}}{T}\right)+c_{0}^{p}}\right)

(5)

where $t$ is the current training iteration, $p$ controls the sharpness of the curriculum so that more time is spent on the examples added later in the training, $T$ is the maximum number of iterations, and $c_{0}$ is the initial value of the competence. $p$ and $c_{0}$ are manually controlled hyper-parameters. We can find that the model’s competence $c$ increases over time, and the DropEdge ratio $\beta$ also correspondingly rises, leading the model to face harder tasks. The detailed training procedure is described in $L9-17$ of Algorithm 1.

Algorithm 1 Detailed learning process of CPT.

Input: graph $G=(\mathcal{V},\mathcal{E},\mathcal{X})$ , a meta-test task $\mathcal{T}_{test}=\{\mathcal{S},Q\}$ , meta-learner $f_{\theta}$ , base classess $C_{b}$ , meta-training epochs $T$ , The number of classes $N$ , and the number of labeled nodes for each class $K$ , DropEdge ratio $\beta$ , task-learning rate $\alpha_{1}$ , meta-learning rate $\alpha_{2}$ .
Output: Predicted labels of the query nodes $Q$

1: // Meta-training process

2: # First stage: Default meta-training to obtain a base meta-learner.

3: for

t=1,2,...,T

4: Sample a meta-training task

\mathcal{T}_{i}=\{\mathcal{S}_{i},Q_{i}\}

from

C_{b}

;

5: Use

f_{\theta}

to Compute node representations for the nodes in set

\mathcal{S}_{i}

and

Q_{i}

;

6: Compute the meta-training loss according to the loss function defined by the model;

7: Update the model parameters using the meta-training loss via one gradient descent step, according to Eq. (4) and (5);

8: end for

9: # Second stage: Competence-progressive training for the base meta-learner using increasingly hard tasks.

10: for

t=1,2,...,T

11:

c(t)\leftarrow

competence from Eq. (6);

12: DropEdge ratio

\beta=

c(t)

;

13: Generate hard task,i.e., randomly drop

\beta

of edges :

G\stackrel{{\scriptstyle DropEdge}}{{\longrightarrow}}\widetilde{G}

;

14: Sample a meta-training task

\mathcal{T}_{i}=\{\mathcal{S}_{i},Q_{i}\}

from

C_{b}

;

15: Use

f_{\theta}

to Compute node representations for the nodes in set

\mathcal{S}_{i}

and

Q_{i}

16: Compute the meta-training loss according to the loss function defined by the model;

17: Update the model parameters using the meta-training loss via one gradient descent step, according to Eq. (4) and (5);

18: end for

19: // Meta-test process

20: Use

f_{\theta}

to Compute node representations for the nodes in set

\mathcal{S}

and

Q

21: Predict labels for the nodes in the query set

Q

;

Notably, CPT is a curriculum-based learning strategy designed to train any model architecture using any supervised loss to acquire a better meta-learner over graphs, which only adds a few simple components.

5 EXPERIMENTS

In this section, we conduct a comprehensive performance evaluation of CPT. Additionally, through experimental observations, we find that CPT can effectively mitigate the suboptimal solution problem. Through ablation experiments, we verify the effectiveness of both the motivation behind CPT and its various components.

5.1 Datasets

To evaluate our framework on few-shot node classification tasks, we conduct experiments on four prevalent real-world graph datasets:Amazon-E McAuley et al. (2015), DBLP Tang et al. (2008), Cora-full Bojchevski and Günnemann (2017), and OGBN-arxiv Hu et al. (2020). We summarize the detailed statistics of these datasets in Table 1. Specifically, Nodes and Edges denote the number of nodes and edges in the graph, respectively. Features denotes the dimension of node features. Class Split denotes the number of classes used for meta-training/validation/meta-test. This selection aligns with the default settings established for the respective dataset.

Table 1: Statistics of four node classification datasets.

Dataset	Nodes	Edges	Features	Class Split
Amazon-E	42,318	43,556	8,669	90/37/40
DBLP	40,672	288,270	7,202	80/27/30
Cora-full	19,793	65,311	8,710	25/20/25
OGBN-arxiv	169,343	1,166,243	128	15/5/20

5.2 Baselines

To validate the effectiveness of our proposed strategy CPT, we conduct experiments with the following baseline methods to examine performance:

•

Meta-GNN Fan et al. (2019): integrates attributed networks with MAML Finn et al. (2017) using GNNs.
•

AMM-GNN Wang et al. (2020): AMM-GNN suggests enhancing MAML through an attribute matching mechanism. In particular, node embeddings are adaptively modified based on the embeddings of nodes across the entire meta-task.
•

G-Meta Huang and Zitnik (2020): G-meta employs subgraphs to produce node representations, establishing it as a scalable and inductive meta-learning technique for graphs.
•

TENT Wang et al. (2022): TENT investigate the limitations of existing few-shot node classification methods from the lens of task variance and develop a novel task-adaptive framework that include node-level, class-level and task-level.

5.3 Implementation Details

During training, we sample a certain number of meta-training tasks from training classes (i.e., base classes) and train the model with these meta-tasks. Then we evaluate the model based on a series of randomly sampled meta-test tasks from test classes (i.e., novel classes). For consistency, the class splitting is identical for all baseline methods. Then the final result of the average classification accuracy is obtained based on these meta-test tasks.

Our methodology is formulated utilizing the PyTorch¹¹1https://pytorch.org/ framework, and all subsequent evaluations are executed on a Linux-based server equipped with 5 NVIDIA Tesla V100 GPUs. For the specific implementation setting, we set the number of training epochs $T$ as 2000, the weight decay rate is set as 0.0005, the loss weight $\gamma$ is set as 1, the graph layers are set as 2 and the hidden size is set as in $\left\{16,32\right\}$ . The READOUT function is implemented as mean-pooling. The other optimal hyper-parameters were determined via grid search on the validation set: the learning rate was searched in $\left\{0.03,0.01,0.005\right\}$ , the dropout in $\left\{0.2,0.5,0.8\right\}$ . Furthermore, to keep consistency, the test tasks are identical for all baselines.

5.4 Evaluation Methodology

Following previous few-shot node classification works, we use the average classification accuracy as the evaluation metric and adopt these four few-shot node classification tasks to evaluate the performance of all algorithms: 5-way 3-shot, 5-way 5-shot, 10-way 3-shot, and 10-way 5-shot. The average classification accuracy is the mean result of five complete runs of the model. To further ensure the reliability of the experiment, the model’s test performance is repeated ten times, and the average value is taken.

5.5 Overall Evaluation Results

Table 2: Few-shot node classification performance on the four datasets.

Model	Cora-full				DBLP
Model	5-way 3-shot	5-way 5-shot	10-way 3-shot	10-way 5-shot	5-way 3-shot	5-way-5 shot	10-way 3-shot	10-way 5-shot
Meta-GNN	57.60	60.92	38.48	40.68	78.38	79.34	67.02	68.54
Meta-GNN+CPT	59.40	62.36	42.45	43.81	79.68	81.60	68.84	70.24
$\Delta$ Gain	+3.1%	+2.4%	+10.3%	+7.7%	+1.7%	+2.8%	+2.7%	+2.5%
G-META	57.93	60.30	45.67	47.76	75.49	76.38	57.96	61.18
G-META+CPT	59.13	62.16	46.45	49.02	76.37	76.75	60.06	63.64
$\Delta$ Gain	+2.1%	+3.0%	+1.7%	+2.6%	+1.2%	+0.5%	+3.6%	+4.0%
AMM-GNN	59.06	61.66	20.82	27.98	79.12	82.34	65.93	68.36
AMM-GNN+CPT	63.58	64.04	30.25	40.31	81.54	84.52	68.26	68.83
$\Delta$ Gain	+7.7%	+3.9%	+45.3%	+44.1%	+3.1%	+2.6%	+3.5%	+0.7%
TENT	64.80	69.24	51.73	56.00	75.76	79.38	67.59	69.77
TENT+CPT	67.48	70.50	52.99	58.82	81.40	82.98	70.51	70.89
$\Delta$ Gain	+4.1%	+1.8%	+2.4%	+5.1%	+7.4%	+4.5%	+4.3%	+1.6%
Averaged $\Delta$ Gain	+4.3%	+2.8%	+14.8%	+14.9%	+3.4%	+2.6%	+3.5%	+2.2%

Model	Amazon-E				OGBN-arxiv
Model	5-way 3-shot	5-way 5-shot	10-way 3-shot	10-way 5-shot	5-way 3-shot	5-way-5 shot	10-way 3-shot	10-way 5-shot
Meta-GNN	70.26	75.10	61.14	66.01	48.22	50.24	31.30	31.98
Meta-GNN+CPT	72.32	78.32	63.95	67.58	49.18	51.92	33.87	35.11
$\Delta$ Gain	+2.9%	+4.3%	+4.6%	+2.4%	+2.0%	+3.3%	+8.2%	+9.8%
G-META	73.50	77.02	61.85	62.93	44.29	48.72	41.26	46.64
G-META+CPT	74.64	78.26	62.72	67.07	47.28	51.08	42.24	48.68
$\Delta$ Gain	+1.5%	+1.6%	+1.4%	+6.6%	+6.8%	+4.8%	+2.4%	+4.3%
AMM-GNN	73.95	76.10	62.91	68.34	50.54	52.44	30.96	32.66
AMM-GNN+CPT	76.54	78.64	67.57	69.74	52.58	56.44	32.87	35.76
$\Delta$ Gain	+3.5%	+3.3%	+7.4%	+2.0%	+4.0%	+7.6%	+6.2%	+9.5%
TENT	74.26	78.12	65.66	70.49	55.62	62.96	41.13	46.02
TENT+CPT	79.32	83.98	70.25	74.35	59.24	66.56	57.64	64.50
$\Delta$ Gain	+6.8%	+7.5%	+7.0%	+5.4%	+6.5%	+4.1%	+41.8%	+40.2%
Averaged $\Delta$ Gain	+3.7%	+4.2%	+5.1%	+4.1%	+4.8%	+4.9%	+14.6%	+15.9%

We compare the few-shot node classification performance of four baseline methods and their CPT-enhanced versions in Table 2. This comprehensive comparison allows us to make the following observations:

•

CPT brings performance improvement to all baseline methods, demonstrating that CPT technology can effectively enhance the performance of few-shot node classification models. Moreover, the average gain exceeds 3% in most cases, which is a significant improvement. The gain effect brought by CPT varies for different models, which may be related to the structure and characteristics of the model. Among them, TENT+CPT achieves the highest performance in most cases.
•

The 10-way task setting is more challenging than the 5-way setting because it involves more categories. We find that the cases with very high gains all come from 10-way tasks, such as AMM-GNN on Cora-full and TENT on OGBN-arxiv. This indicates that CPT may be more effective for more complex classification tasks.
•

The 3-shot task setting is more challenging than the 5-shot setting because it has fewer samples to learn from. However, based on the experimental results, the difference in gains between 3-shot and 5-shot tasks is not significant, which suggests that the effect of CPT may not be strongly related to the number of samples.

5.6 Effectiveness in Mitigating Suboptimal Solutions.

To further verify whether our method can help the model break through suboptimal points, we observe the loss changes of TENT and TENT+CPT, as depicted in Figure 3. In the figure, the loss curve of TENT+CPT is represented in blue, the loss curve of TENT is represented in red, and the black vertical line represents the boundary between the first and second stages. From this observation, we can infer the following:

In the first stage, the training loss and validation loss of the model are essentially consistent, as they follow the same training setup. However, in the second stage, TENT+CPT faces more challenging tasks, causing the training loss to remain relatively high. Despite this, its validation loss gradually decreases and becomes lower than TENT’s validation loss. In contrast, TENT exhibits low training loss but high validation loss. This suggests that by applying the CPT technology, the model’s generalization ability is improved, further mitigating the suboptimal solution problem.

CPT introduces a certain level of difficulty to the training tasks, pushing the model to learn more general and robust features, which eventually leads to better performance in the validation phase. This outcome supports the idea that the CPT technology can effectively address the suboptimal solution problem and improve the model’s performance in few-shot node classification tasks.

5.7 Ablation Study

In this section, we conducted an ablation study to verify the effectiveness of the motivation in CPT. Firstly, we remove the second stage from CPT and retained only the first stage, using the default meta-learning strategy to train until convergence. We refer to this variant as CPT w/o -SS. Secondly, we remove the first stage and keep the second stage, directly using the initial model to learn gradually difficult tasks. This variant is referred to as CPT w/o -FS. Finally, we reverse the order of the first and second stages in CPT. In this variant, we first learn hard tasks and then learn easy tasks, and we refer to it as CPT-reverse. The specific experimental results are shown in Table 3.

Table 3: Ablation study with TENT on the Amazon-E.

Variant	5way-3shot	5way-5shot	10way-3shot	10way-5shot
CPT w/o -SS	74.26	78.12	65.66	70.49
CPT w/o -FS	74.78	80.61	67.23	71.64
CPT-reverse	76.09	81.14	67.86	72.26
CPT	79.32	83.98	70.25	74.35

Through the results of the ablation experiments, we have made the following findings:

•

We find that the performance of the variant CPT w/o -FS is somewhat better than that of the variant CPT w/o -SS. This indicates that the capability-progressive training strategy can help the model converge more effectively compared to the default episodic meta-learning.
•

We also find that the performance of the aforementioned two variants is significantly lower than the experimental results of CPT. To understand this, we observe their training process and found that both variants might fall into suboptimal solutions, as depicted in Figure 4. From this, we can observe that although the validation loss curve of the CPT w/o -FS variant is slightly better than that of the CPT w/o -SS variant, overall, their curve trajectories and loss values are very similar. The possible reason for this similarity is that both variants involve random task sampling, which might lead to encountering very challenging tasks early in training. This difficulty in knowledge acquisition could consequently result in convergence towards suboptimal points. In contrast, CPT effectively tackles this problem through its two-stage training approach.
•

Moreover, we find that if we first learn some difficult tasks and then learn easy tasks, i.e., CPT-reverse, the effect is also worse than CPT. From the perspective of experimental results, the order of learning is important, and the order of learning from easy to hard can help the model learn better.

These phenomena indicate that models have different level competencies at different stages, and the learning process from easy to hard can enhance the model’s generalization ability. Furthermore, the two-stage learning strategy can effectively alleviate the suboptimal solution problem.

6 CONCLUSION

In this paper, we discuss the limitations of the episodic meta-learning approach, where random task sampling might lead the model to converge to suboptimal solutions. To tackle this issue, we propose a novel curriculum-based training strategy called CPT to better train the GNN meta-learner. This strategy is divided into two stages: the first stage trains on simple tasks to acquire preliminary capabilities, and the second stage progressively increases task difficulty as the model’s capabilities improve. We perform thorough experiments using widely-recognized datasets and methods. Our results show that CPT improves all the baseline methods, resulting in a substantial increase in the performance of models for few-shot node classification and offering important insights. In most cases, the average improvement is more than 3%. Notably, this is a general strategy that can be applied to any GNN architecture and loss function. Therefore, future work can be further extended to other tasks, such as few-shot link prediction and graph classification.

References

Bengio et al. [2009] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Jun 2009.
Bojchevski and Günnemann [2017] Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. Cornell University - arXiv, Jul 2017.
Cascante-Bonilla et al. [2022] Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, and Vicente Ordonez. Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. Proceedings of the AAAI Conference on Artificial Intelligence, page 6912–6920, Sep 2022.
Chauhan et al. [2020] Jatin Chauhan, Deepak Nathani, and Manohar Kaul. Few-shot learning on graphs via super-classes based on graph spectral measures. arXiv: Learning,arXiv: Learning, Feb 2020.
Dai et al. [2019] Dengxin Dai, Christos Sakaridis, Simon Hecker, and LucVan Gool. Curriculum model adaptation with synthetic and real data for semantic foggy scene understanding. Cornell University - arXiv,Cornell University - arXiv, Jan 2019.
Ding et al. [2020] Kaize Ding, Jianling Wang, Jundong Li, Kai Shu, Chenghao Liu, and Huan Liu. Graph prototypical networks for few-shot learning on attributed networks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Oct 2020.
Fan et al. [2019] Zhou Fan, Cao Chengtai, Zhang Kunpeng, Trajcevski Goce, Zhong Ting, and Geng Ji. Meta-gnn on few-shot node classification in graph meta-learning. ACM Proceedings, Jan 2019.
Feng et al. [2020] Zhengyang Feng, Qianyu Zhou, Guangliang Cheng, ** Shi, and Lizhuang Ma. Semi-supervised semantic segmentation via dynamic self-training and class-balanced curriculum. ArXiv, abs/2004.08514, 2020.
Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv: Learning, Mar 2017.
Gu et al. [2021] Yaowen Gu, Si Zheng, and Jiao Li. Currmg: A curriculum learning approach for graph based molecular property prediction. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec 2021.
Guo et al. [2020] Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen, and Tie-Yan Liu. Fine-tuning by curriculum learning for non-autoregressive neural machine translation. Proceedings of the AAAI Conference on Artificial Intelligence, page 7839–7846, Jun 2020.
Hacohen and Weinshall [2019] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. Cornell University - arXiv, Apr 2019.
Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Cornell University - arXiv, May 2020.
Hu et al. [2022] Weihua Hu, Kaidi Cao, Kexin Huang, Edward W Huang, Karthik Subbian, and Jure Leskovec. Tuneup: A training strategy for improving generalization of graph neural networks. arXiv preprint arXiv:2210.14843, 2022.
Huang and Zitnik [2020] Kexin Huang and Marinka Zitnik. Graph meta learning via local subgraphs. Cornell University - arXiv, Jan 2020.
Jiang et al. [2014] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and AlexanderG. Hauptmann. Self-paced learning with diversity. Neural Information Processing Systems, Dec 2014.
Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Li et al. [2023] Haoyang Li, Xin Wang, and Wenwu Zhu. Curriculum graph machine learning: A survey. arXiv preprint arXiv:2302.02926, Feb 2023.
Liu et al. [2019] Lu Liu, Tianyi Zhou, Guodong Long, **g Jiang, and Chengqi Zhang. Learning to propagate for graph meta-learning. Cornell University - arXiv,Cornell University - arXiv, Sep 2019.
Liu et al. [2020] Chenghao Liu, Zhihao Wang, Doyen Sahoo, Yuan Fang, Kun Zhang, and Steven CH Hoi. Adaptive task sampling for meta-learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 752–769. Springer, 2020.
Liu et al. [2021a] Zemin Liu, Yuan Fang, Chenghao Liu, and StevenC.H. Hoi. Relative and absolute location embedding for few-shot node classification on graph. Proceedings of the … AAAI Conference on Artificial Intelligence, May 2021.
Liu et al. [2021b] Zemin Liu, Trung-Kien Nguyen, and Yuan Fang. Tail-gnn: Tail-node graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1109–1119, 2021.
Ma et al. [2020] Ning Ma, Jiajun Bu, Jieyu Yang, Zhen Zhang, Chengwei Yao, Zhi Yu, Sheng Zhou, and Xifeng Yan. Adaptive-step graph meta-learner for few-shot graph classification. Cornell University - arXiv,Cornell University - arXiv, Mar 2020.
McAuley et al. [2015] Julian McAuley, Rahul Pandey, and Jure Leskovec. Inferring networks of substitutable and complementary products. arXiv: Social and Information Networks, Jun 2015.
Mishra et al. [2017] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. Learning,Learning, Jul 2017.
Platanios et al. [2019] EmmanouilAntonios Platanios, Otilia Stretcu, Graham Neubig, Barnabás Póczos, and TomM. Mitchell. Competence-based curriculum learning for neural machine translation. Cornell University - arXiv,Cornell University - arXiv, Mar 2019.
Qu et al. [2018] Meng Qu, Jian Tang, and Jiawei Han. Curriculum learning for heterogeneous star network embedding via deep reinforcement learning. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Feb 2018.
Ravi and Larochelle [2017] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. International Conference on Learning Representations,International Conference on Learning Representations, Apr 2017.
Rong et al. [2019] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. Learning,Learning, Jul 2019.
Snell et al. [2017] Jake Snell, Kevin Swersky, and RichardS. Zemel. Prototypical networks for few-shot learning. Neural Information Processing Systems, Mar 2017.
Sun et al. [2019] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shot learning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
Sung et al. [2018] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H.S. Torr, and Timothy M. Hospedales. Learning to compare: Relation network for few-shot learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
Szklarczyk et al. [2019] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. String v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids research, 47(D1):D607–D613, 2019.
Tang et al. [2008] Jie Tang, **g Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. Arnetminer. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Aug 2008.
Vakil and Amiri [2022] Nidhi Vakil and Hadi Amiri. Generic and trend-aware curriculum learning for relation extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2202–2213, 2022.
Vakil and Amiri [2023] Nidhi Vakil and Hadi Amiri. Curriculum learning for graph neural networks: A multiview competence-based approach. arXiv preprint arXiv:2307.08859, 2023.
Wang et al. [2020] Ning Wang, Minnan Luo, Kaize Ding, Lingling Zhang, Jundong Li, and Qinghua Zheng. Graph few-shot learning with attribute matching. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Oct 2020.
Wang et al. [2022] Song Wang, Kaize Ding, Chuxu Zhang, Chen Chen, and Jundong Li. Task-adaptive few-shot node classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1910–1919, 2022.
Wang et al. [2023] Hui Wang, Kun Zhou, Xin Zhao, **gyuan Wang, and Ji-Rong Wen. Curriculum pre-training heterogeneous subgraph transformer for top-n recommendation. ACM Transactions on Information Systems, page 1–28, Jan 2023.
Wei et al. [2023] Xiaowen Wei, Xiuwen Gong, Yibing Zhan, Bo Du, Yong Luo, and Wenbin Hu. Clnode: Curriculum learning for node classification. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 670–678, 2023.
Wu et al. [2021] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, page 4–24, Jan 2021.
Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
Yan et al. [2021] Qilong Yan, Yufeng Zhang, Qiang Liu, Shu Wu, and Liang Wang. Relation-aware heterogeneous graph for user profiling. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 3573–3577, 2021.
Zhang et al. [2020] Yufeng Zhang, Xueli Yu, Zeyu Cui, Shu Wu, Zhongzhen Wen, and Liang Wang. Every document owns its structure: Inductive text classification via graph neural networks. arXiv preprint arXiv:2004.13826, 2020.
Zhou et al. [2020] Tianyi Zhou, Shengjie Wang, and JeffA. Bilmes. Curriculum learning by dynamic instance hardness. Neural Information Processing Systems,Neural Information Processing Systems, Jan 2020.