License: CC BY 4.0
arXiv:2402.00450v2 [cs.LG] 23 Feb 2024

CPT: Competence-progressive Training Strategy for Few-shot Node Classification

Qilong Yan11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yufeng Zhang33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, **ghao Zhang33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, **gpu Duan22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Jian Yin11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
Abstract

Graph Neural Networks (GNNs) have made significant advancements in node classification, but their success relies on sufficient labeled nodes per class in the training data. Real-world graph data often exhibits a long-tail distribution with sparse labels, emphasizing the importance of GNNs’ ability in few-shot node classification, which entails categorizing nodes with limited data. Traditional episodic meta-learning approaches have shown promise in this domain, but they face an inherent limitation: it might lead the model to converge to suboptimal solutions because of random and uniform task assignment, ignoring task difficulty levels. This could lead the meta-learner to face complex tasks too soon, hindering proper learning. Ideally, the meta-learner should start with simple concepts and advance to more complex ones, like human learning. So, we introduce CPT, a novel two-stage curriculum learning method that aligns task difficulty with the meta-learner’s progressive competence, enhancing overall performance. Specifically, in CPT’s initial stage, the focus is on simpler tasks, fostering foundational skills for engaging with complex tasks later. Importantly, the second stage dynamically adjusts task difficulty based on the meta-learner’s growing competence, aiming for optimal knowledge acquisition. Extensive experiments on popular node classification datasets demonstrate significant improvements of our strategy over existing methods.

1 Introduction

In recent years, there has been a lot of research focused on node classification, which is about predicting labels for unlabeled nodes in a graph. This task has many practical applications in real-world situations Yan et al. (2021); Zhang et al. (2020). For example, in bioinformatics Szklarczyk et al. (2019), predicting chemical properties for proteins in a protein network is an important problem that can be tackled using node classification. State-of-the-art methods for node classification often use Graph Neural Networks (GNNs)Wu et al. (2021); Xu et al. (2018) in a semi-supervised manner Kipf and Welling (2016). GNNs aim to learn representations for each node by considering information from its neighbors. These representations are then used for classification. However, these methods usually require a sufficient number of labeled nodes for all classes to achieve good results Fan et al. (2019). In practice, we often have many labeled nodes for some classes, but only a few for others. We refer to classes with many labels as base classes and classes with few labels as novel classes Wang et al. (2022). Since novel classes are common in real-world graphs, it’s important for GNNs to be capable of classifying nodes even with limited labeled nodes, known as the few-shot node classification problem.

Refer to caption
Figure 1: Tasks influence a model’s intelligence differently depending on its growth stage, mirroring human learning patterns.

Most existing studies on the few-shot node classification utilize the episodic meta-learning strategy Wang et al. (2022, 2020); Liu et al. (2021a). In this paradigm, the meta-learner learns from base classes through a series of meta-training tasks and evaluates the model using meta-test tasks derived from novel classes. The aim of this scheme is to allow the meta-learner to extract meta-knowledge from base classes, enabling it to classify using just a few labeled nodes from the novel classes. Despite the success of existing methods using the episodic meta-learning scheme, the episodic training strategy would introduce some issues. First and foremost, a fundamental limitation of episodic training methods is their random selection of tasks from base classes, neglecting the varied difficulty levels of these tasks. This differs significantly from the natural human learning process, where we typically begin with basic concepts and progressively advance to more complex ones. As a result, in such a training model, there’s a risk that the meta-learner might be exposed to challenging tasks too early and simpler tasks later, potentially leading to less effective knowledge acquisition. As shown in Figure 1, when the model is first initialized (akin to a toddler phase), it is not yet capable of handling difficult tasks, but it can acquire knowledge from easy tasks. At this stage, easy tasks are informative and beneficial for the model. After sufficient training, the model develops intelligence to tackle more complex tasks, resembling a teenager phase. By then, easy tasks become uninformative for it, and it is the hard tasks that truly contribute to enhancing its intelligence. Inspired by the human learning process, the idea of learning from easy to hard has been effectively implemented in curriculum learning Hacohen and Weinshall (2019) and self-paced learning Jiang et al. (2014). In these frameworks, individual samples are given different levels of importance (i.e., weights) to enhance generalization. Although the method of using curriculum learning to address problems in few-shot tasks has been studied Sun et al. (2019); Liu et al. (2020), it is not applicable to graph data. Graph data has its own complex relational dependency patterns, and existing work cannot be directly transferred over.

In this paper, we propose Competence-Progressive Training strategy for few-shot node classification, named as CPT. CPT aims to provide tasks of different difficulty levels for meta-learners at different stages to adapt to their current abilities and alleviate the issue of suboptimal points. Specifically, CPT uses the two-stage strategy Bengio et al. (2009); Hu et al. (2022) to train a meta-learner. In the first stage, CPT follows previous research by randomly sampling nodes to construct meta tasks and then employs the default training strategy(episodic learning), i.e., simply minimizing the given supervised loss, to produce a strong base meta-learner to start with. Although the base meta-learner may perform well on easy tasks and has gained some knowledge, it no longer benefits or improves its abilities from these tasks, and it might converge to suboptimal points. So the second stage focuses on generating more difficult tasks and fine-tuning it. Specifically, CPT increases the difficulty of each task by drop** the edges of its graph. The underlying motivation for this approach is to create a larger number of ’tail nodes’ within the meta-training tasks. This alteration is intended to enrich the meta-learning model with additional ’tail node’ meta-knowledge, as these nodes, with their fewer connections, typically present unique challenges in learning processes. Finally, CPT finetunes the base meta-learner by minimizing the loss over the increased hard task data. CPT can be easily implemented in the existing meta-learner training pipeline on graphs, as demonstrated in Algorithm 1. Additionally, CPT is compatible with any GNN-based meta-learner and can be employed with any desired supervised loss function.

Our main contributions can be summarized as follows:

  • Problem. We explore the limitations of the episodic meta-learning scheme and discuss its associated issues in few-shot node classification. Specifically, the limitation of the episodic meta-learning scheme is that it samples tasks randomly and uniformly, which can lead the meta-learner to converge to a suboptimal point and thus make further optimization more challenging.

  • Method. We present CPT, a novel two-stage curriculum learning strategy for few-shot node classification. CPT is the first method to solve such problems in few-shot tasks on graphs. Notably, this is a general strategy that can be applied to any GNN architecture and loss function.

  • Experiments. We conduct experiments on four benchmark node classification datasets under the few-shot setting and demonstrate significant improvements of our strategy over existing methods.

2 Related Work

In this section, we briefly review some existing few-shot learning methods and curriculum learning methods.

2.1 Few-shot Learning on Graphs

Few-shot Learning (FSL) aims to apply knowledge from tasks with abundant supervision to new tasks with limited labeled data. FSL methods generally fall into two categories: metric-based approaches Sung et al. (2018); Liu et al. (2019), like Prototypical Networks Snell et al. (2017), which measure similarity between new instances and support examples, and meta-optimizer-based approaches Ravi and Larochelle (2017); Mishra et al. (2017), like MAML Finn et al. (2017), focusing on optimizing model parameters with limited examples. In the field of graphs, there have been several recent works proposing the application of few-shot learning to graph-based tasks Ma et al. (2020); Chauhan et al. (2020), and they have achieved notable success, particularly in the context of attributed networks. Among these works, GPN Ding et al. (2020) introduces the utilization of node importance derived from Prototypical Networks Snell et al. (2017), leading to improved performance. On the other hand, G-Meta Huang and Zitnik (2020) employs local subgraphs to learn node representations and enhances model generalization through meta-learning. TENT Wang et al. (2022) leverages both local subgraphs and prototypes to adapt to tasks, effectively addressing the task-variance issue. It considers three perspectives: node-level, class-level, and task-level.

Although there has been some progress in few-shot learning on graphs, they all share a common issue: the use of random sampling without considering the distribution of task difficulty, which can lead to inefficient knowledge acquisition by the meta-learner. To address this, we propose a task scheduler based on curriculum learning that progresses from easy to difficult tasks, further enhancing the model’s knowledge acquisition efficiency.

2.2 Curriculum Learning

Inspired by the human learning process, Curriculum learning(CL) aims to adopt a meaningful learning sequence, such as from easy to hard patternsBengio et al. (2009). As a general and flexible plug-in, the CL strategy has proven its ability to improve model performance, generalization, robustness, and convergence in various scenarios, including image classification Cascante-Bonilla et al. (2022); Zhou et al. (2020) , semantic segmentation Dai et al. (2019); Feng et al. (2020), neural machine translation Guo et al. (2020); Platanios et al. (2019), etc. The existing CL methods fall into two groups Li et al. (2023): (1) Predefined CL Wei et al. (2023); Hu et al. (2022); Wang et al. (2023) that involves manually designing heuristic-based policies to determine the training order of data samples, and (2) automatic CL Qu et al. (2018); Vakil and Amiri (2022); Gu et al. (2021) that relies on computable metrics (e.g., the training loss) to dynamically design the curriculum for model training. CL for graph data is an emerging area of research. Recent studies have demonstrated its effectiveness in various graph-related tasks like node classification Wei et al. (2023); Qu et al. (2018), link prediction Hu et al. (2022); Vakil and Amiri (2022), and graph classification Wang et al. (2023); Gu et al. (2021). However, applying CL methods to the challenge of few-shot node classification is still an area with significant hurdles.

To overcome this, our approach introduces a pioneering training strategy that merges curriculum learning with few-shot learning on graphs, significantly enhancing the model’s performance in few-shot node classification scenarios.

3 PRELIMINARIES

3.1 Problem Statement

Formally, let G=(𝒱,,𝒳)𝐺𝒱𝒳G=(\mathcal{V},\mathcal{E},\mathcal{X})italic_G = ( caligraphic_V , caligraphic_E , caligraphic_X ) denote an attributed graph, where 𝒱𝒱\mathcal{V}caligraphic_V is the set of nodes, \mathcal{E}caligraphic_E is the set of edges, and 𝒳|𝒱|×d𝒳superscript𝒱𝑑\mathcal{X}\in\mathbb{R}^{|\mathcal{V}|\times d}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_V | × italic_d end_POSTSUPERSCRIPT is the feature matrix of nodes with d𝑑ditalic_d denoting the feature dimension. Moreover, we denote the entire set of node classes as C𝐶Citalic_C, which can be further divided into two categories: Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where C=CbCn𝐶subscript𝐶𝑏subscript𝐶𝑛C=C_{b}\cup C_{n}italic_C = italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∪ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and CbCn=subscript𝐶𝑏subscript𝐶𝑛C_{b}\cap C_{n}=\emptysetitalic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∩ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∅. Here Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the sets of base and novel classes, respectively. It is worth mentioning that the number of labeled nodes in Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is sufficient, while it is typically small in Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Then we can formulate the studied problem of few-shot node classification as follows:

Few-shot Node Classification: Given an attributed graph G=(𝒱,,𝒳)𝐺𝒱𝒳G=(\mathcal{V},\mathcal{E},\mathcal{X})italic_G = ( caligraphic_V , caligraphic_E , caligraphic_X ) our goal is to develop a machine learning model such that after training on labeled nodes in Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , the model can accurately predict labels for the nodes (i.e., query set Q𝑄Qitalic_Q) in Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with only a limited number of labeled nodes (i.e., support set S𝑆Sitalic_S).

More specifically, if the support set S𝑆Sitalic_S contains exactly K𝐾Kitalic_K nodes for each of N𝑁Nitalic_N classes from Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and the query set Q𝑄Qitalic_Q are sampled from these N𝑁Nitalic_N classes, the problem is called N𝑁Nitalic_N-way K𝐾Kitalic_K-shot node classification. Essentially, the objective of few-shot node classification is to learn a classifier that can be fast adapted to Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with only limited labeled nodes. Thus, the crucial part is to learn transferable knowledge from Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and generalize it to Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

3.2 Episodic Learning

The meta-training and meta-test processes are conducted on a certain number of meta-training tasks and meta-test tasks, respectively. These meta-tasks share a similar structure, except that meta-training tasks are sampled from Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, while meta-test tasks are sampled from Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The main idea of few-shot node classification is to keep the consistency between meta-training and meta-test to improve the generalization performance.

To construct a meta-training (or meta-test) task 𝒯tsubscript𝒯𝑡\mathcal{T}_{t}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we first randomly sample N𝑁Nitalic_N classes from Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (or Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT). Then we randomly sample K𝐾Kitalic_K nodes from each of the N𝑁Nitalic_N classes (i.e., N𝑁Nitalic_N-way K𝐾Kitalic_K-shot) to establish the support set 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Similarly, the query set Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT consists of Q𝑄Qitalic_Q different nodes (distinct from 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) from the same N𝑁Nitalic_N classes. The components of the sampled meta-task 𝒯tsubscript𝒯𝑡\mathcal{T}_{t}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be denoted as follows:

𝒮t={(v1,y1),(v2,y2),,(vN×K,yN×K)},subscript𝒮𝑡subscript𝑣1subscript𝑦1subscript𝑣2subscript𝑦2subscript𝑣𝑁𝐾subscript𝑦𝑁𝐾\displaystyle\mathcal{S}_{t}=\{(v_{1},y_{1}),(v_{2},y_{2}),...,(v_{N\times K},% y_{N\times K})\},caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_v start_POSTSUBSCRIPT italic_N × italic_K end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_N × italic_K end_POSTSUBSCRIPT ) } ,
Qt={(q1,y1),(q2,y2),,(qQ,yQ)},subscript𝑄𝑡subscript𝑞1subscriptsuperscript𝑦1subscript𝑞2superscriptsubscript𝑦2subscript𝑞𝑄superscriptsubscript𝑦𝑄\displaystyle Q_{t}=\{(q_{1},y^{{}^{\prime}}_{1}),(q_{2},y_{2}^{{}^{\prime}}),% ...,(q_{Q},y_{Q}^{{}^{\prime}})\},italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , … , ( italic_q start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) } ,
𝒯t={𝒮t,Qt},subscript𝒯𝑡subscript𝒮𝑡subscript𝑄𝑡\displaystyle\mathcal{T}_{t}=\{\mathcal{S}_{t},Q_{t}\},\vspace{10mm}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ,

where visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (or qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) is a node in 𝒱𝒱\mathcal{V}caligraphic_V, and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (or yisuperscriptsubscript𝑦𝑖y_{i}^{{}^{\prime}}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT) is the corresponding label. In this way, the whole training process is conducted on a set of T𝑇Titalic_T meta-training tasks 𝒯train={𝒯t}t=1Tsubscript𝒯𝑡𝑟𝑎𝑖𝑛subscriptsuperscriptsubscript𝒯𝑡𝑇𝑡1\mathcal{T}_{train}=\{\mathcal{T}_{t}\}^{T}_{t=1}caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT = { caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT. After training, the model has learned the transferable knowledge from 𝒯trainsubscript𝒯𝑡𝑟𝑎𝑖𝑛\mathcal{T}_{train}caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and will generalize it to meta-test tasks 𝒯test={𝒯t}t=1Ttestsubscript𝒯𝑡𝑒𝑠𝑡subscriptsuperscriptsuperscriptsubscript𝒯𝑡subscript𝑇𝑡𝑒𝑠𝑡𝑡1\mathcal{T}_{test}=\{\mathcal{T}_{t}^{{}^{\prime}}\}^{T_{test}}_{t=1}caligraphic_T start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT = { caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT sampled from Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Refer to caption
Figure 2: Degree-specific generalization performance of the few-shot node classification task. The x-axis represents the node degrees in the training graph and the y-axis is the generalization performance averaged over nodes with the specific degrees.

4 METHODOLOGY

In this section, we introduce CPT which uses a curriculum learning strategy to better train a Graph Neural Network meta-learner (GNN meta-learner) in a few-shot node classification problem. In general, CPT applies the two-stage strategy to train a GNN meta-learner. The first stage aims to produce a strong base GNN meta-learner that performs well on easy tasks. In the first stage, the meta-learner has acquired knowledge of easy tasks and gained some preliminary abilities, enabling it to tackle more challenging tasks. Then, in the second stage, the meta-learner focuses on increasing the difficulty of tasks, progressively learning more complex tasks based on its growing competencies. This strategy allows the meta-learner to gradually adapt and master increasingly challenging tasks.

4.1 First stage: Default meta-training.

In the first stage, we aim to obtain a base meta-learner fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT that has some preliminary abilities, enabling it to tackle more challenging tasks in the second stage. Specifically, we use the episodic meta-learning scheme to converge the GNN meta-learner on foundational tasks. These foundational tasks are considered easy because the tasks in the second stage are designed to increase in difficulty based on them. Next, we calculate the loss based on the model’s loss function. For the sake of brevity, we employ the cross-entropy loss as the loss function for the model when considering the base classes Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Formally, this can be expressed as follows:

Zi=softmax(fθ(hi))subscript𝑍𝑖softmaxsubscript𝑓𝜃subscript𝑖Z_{i}={\rm softmax}\left(f_{\theta}(h_{i})\right)italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (1)
=i=1Qj=1|Cb|Yi,jlog(Zi,j)subscriptsuperscript𝑄𝑖1subscriptsuperscriptsubscript𝐶𝑏𝑗1subscript𝑌𝑖𝑗logsubscript𝑍𝑖𝑗\mathcal{L}=-\sum^{Q}_{i=1}\limits\sum\limits^{|C_{b}|}_{j=1}Y_{i,j}{\rm log}(% Z_{i,j})caligraphic_L = - ∑ start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT | italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT roman_log ( italic_Z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) (2)

where Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the probability that the i𝑖iitalic_i-th query node in Q𝑄Qitalic_Q belongs to each class in Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Yi,j=1subscript𝑌𝑖𝑗1Y_{i,j}=1italic_Y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 if the i𝑖iitalic_i-th node belongs to the j𝑗jitalic_j-th class, and Yi,j=0subscript𝑌𝑖𝑗0Y_{i,j}=0italic_Y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0, otherwise. Zi,jsubscript𝑍𝑖𝑗Z_{i,j}italic_Z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the j𝑗jitalic_j-th element in Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then we update meta-learner parameters θ𝜃\thetaitalic_θ via one gradient descent step in task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

θ=θα1𝒯i(fθ)θsuperscript𝜃𝜃subscript𝛼1subscriptsubscript𝒯𝑖subscript𝑓𝜃𝜃\theta^{{}^{\prime}}=\theta-\alpha_{1}\frac{\partial\mathcal{L}_{\mathcal{T}_{% i}}(f_{\theta})}{\partial\theta}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_θ - italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ end_ARG (3)

where α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the task-learning rate and the model parameters are trained to optimize the performance of fθsubscript𝑓superscript𝜃f_{\theta^{{}^{\prime}}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT across meta-training tasks. The model parameters θ𝜃\thetaitalic_θ are updated as follows:

θ=θα2𝒯ip(𝒯train)𝒯i(fθi)θ𝜃𝜃subscript𝛼2subscriptsimilar-tosubscript𝒯𝑖𝑝subscript𝒯𝑡𝑟𝑎𝑖𝑛subscriptsubscript𝒯𝑖subscript𝑓subscriptsuperscript𝜃𝑖𝜃\theta=\theta-\alpha_{2}\frac{\partial\sum_{\mathcal{T}_{i}\sim p(\mathcal{T}_% {train})}\mathcal{L}_{\mathcal{T}_{i}}(f_{\theta^{{}^{\prime}}_{i}})}{\partial\theta}italic_θ = italic_θ - italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG ∂ ∑ start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ end_ARG (4)

where α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT s the meta-learning rate, p(𝒯train)𝑝subscript𝒯𝑡𝑟𝑎𝑖𝑛p(\mathcal{T}_{train})italic_p ( caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ) is the distribution of meta-training tasks. The detailed training procedure is described in L28𝐿28L2-8italic_L 2 - 8 of Algorithm 1.

4.2 Second stage: Competence-progressive training.

After the first stage, although the meta-learner performs well on foundational tasks, it is prone to converging to suboptimal points. To solve this issue, we generate increasingly hard tasks to help the meta-learner break through suboptimal points. There are many possible ways of controlling the difficulty of meta tasks in graph data. In our approach, we utilize the degree of nodes as the key factor to modulate the complexity of these tasks. Node degree is the number of immediate neighbors of a node in a graph. Our investigations have uncovered an implicit correlation between the degree of nodes and their performance. specifically, nodes with fewer connections—termed ’tail nodes’— tend to present greater challenges in learning processesLiu et al. (2021b). We have observed a similar phenomenon in few-shot node classification tasks, as depicted in Figure 2. We can see that the few-shot node classification also faces performance issues on tail nodes. So, we use the DropEdge methods Rong et al. (2019) to randomly delete edges from the entire graph. The underlying motivation for this approach is to create a larger number of ’tail nodes’ within the meta-training tasks to control task difficulty. The DropEdge ratio β𝛽\betaitalic_β is determined by the model’s competence c𝑐citalic_c, where β𝛽\betaitalic_β = c𝑐citalic_c. We employ the following function Platanios et al. (2019); Vakil and Amiri (2023) to quantify model’s competence:

c(t)=min(1,t(1c0pT)+c0pp)𝑐𝑡𝑚𝑖𝑛1𝑝𝑡1superscriptsubscript𝑐0𝑝𝑇superscriptsubscript𝑐0𝑝c(t)=min\left(1,\sqrt[p]{t\left(\frac{1-c_{0}^{p}}{T}\right)+c_{0}^{p}}\right)italic_c ( italic_t ) = italic_m italic_i italic_n ( 1 , nth-root start_ARG italic_p end_ARG start_ARG italic_t ( divide start_ARG 1 - italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ) + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_ARG ) (5)

where t𝑡titalic_t is the current training iteration, p𝑝pitalic_p controls the sharpness of the curriculum so that more time is spent on the examples added later in the training, T𝑇Titalic_T is the maximum number of iterations, and c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial value of the competence. p𝑝pitalic_p and c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are manually controlled hyper-parameters. We can find that the model’s competence c𝑐citalic_c increases over time, and the DropEdge ratio β𝛽\betaitalic_β also correspondingly rises, leading the model to face harder tasks. The detailed training procedure is described in L917𝐿917L9-17italic_L 9 - 17 of Algorithm 1.

Algorithm 1 Detailed learning process of CPT.

Input: graph G=(𝒱,,𝒳)𝐺𝒱𝒳G=(\mathcal{V},\mathcal{E},\mathcal{X})italic_G = ( caligraphic_V , caligraphic_E , caligraphic_X ), a meta-test task 𝒯test={𝒮,Q}subscript𝒯𝑡𝑒𝑠𝑡𝒮𝑄\mathcal{T}_{test}=\{\mathcal{S},Q\}caligraphic_T start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT = { caligraphic_S , italic_Q }, meta-learner fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , base classess Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, meta-training epochs T𝑇Titalic_T, The number of classes N𝑁Nitalic_N , and the number of labeled nodes for each class K𝐾Kitalic_K, DropEdge ratio β𝛽\betaitalic_β, task-learning rate α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, meta-learning rate α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.
Output: Predicted labels of the query nodes Q𝑄Qitalic_Q

1:  // Meta-training process
2:  # First stage: Default meta-training to obtain a base meta-learner.
3:  for t=1,2,,T𝑡12𝑇t=1,2,...,Titalic_t = 1 , 2 , … , italic_T  do
4:     Sample a meta-training task 𝒯i={𝒮i,Qi}subscript𝒯𝑖subscript𝒮𝑖subscript𝑄𝑖\mathcal{T}_{i}=\{\mathcal{S}_{i},Q_{i}\}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } from Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT;
5:     Use fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to Compute node representations for the nodes in set 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ;
6:     Compute the meta-training loss according to the loss function defined by the model;
7:     Update the model parameters using the meta-training loss via one gradient descent step, according to Eq. (4) and (5);
8:  end for
9:   # Second stage: Competence-progressive training for the base meta-learner using increasingly hard tasks.
10:  for t=1,2,,T𝑡12𝑇t=1,2,...,Titalic_t = 1 , 2 , … , italic_T  do
11:     c(t)𝑐𝑡absentc(t)\leftarrowitalic_c ( italic_t ) ← competence from Eq. (6);
12:     DropEdge ratio β=𝛽absent\beta=italic_β = c(t)𝑐𝑡c(t)italic_c ( italic_t );
13:     Generate hard task,i.e., randomly drop β𝛽\betaitalic_β of edges : GDropEdgeG~superscriptnormal-⟶𝐷𝑟𝑜𝑝𝐸𝑑𝑔𝑒𝐺normal-~𝐺G\stackrel{{\scriptstyle DropEdge}}{{\longrightarrow}}\widetilde{G}italic_G start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_D italic_r italic_o italic_p italic_E italic_d italic_g italic_e end_ARG end_RELOP over~ start_ARG italic_G end_ARG;
14:     Sample a meta-training task 𝒯i={𝒮i,Qi}subscript𝒯𝑖subscript𝒮𝑖subscript𝑄𝑖\mathcal{T}_{i}=\{\mathcal{S}_{i},Q_{i}\}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } from Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT;
15:     Use fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to Compute node representations for the nodes in set 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
16:     Compute the meta-training loss according to the loss function defined by the model;
17:     Update the model parameters using the meta-training loss via one gradient descent step, according to Eq. (4) and (5);
18:  end for
19:  // Meta-test process
20:  Use fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to Compute node representations for the nodes in set 𝒮𝒮\mathcal{S}caligraphic_S and Q𝑄Qitalic_Q
21:  Predict labels for the nodes in the query set Q𝑄Qitalic_Q ;

Notably, CPT is a curriculum-based learning strategy designed to train any model architecture using any supervised loss to acquire a better meta-learner over graphs, which only adds a few simple components.

5 EXPERIMENTS

In this section, we conduct a comprehensive performance evaluation of CPT. Additionally, through experimental observations, we find that CPT can effectively mitigate the suboptimal solution problem. Through ablation experiments, we verify the effectiveness of both the motivation behind CPT and its various components.

5.1 Datasets

To evaluate our framework on few-shot node classification tasks, we conduct experiments on four prevalent real-world graph datasets:Amazon-E McAuley et al. (2015), DBLP Tang et al. (2008), Cora-full Bojchevski and Günnemann (2017), and OGBN-arxiv Hu et al. (2020). We summarize the detailed statistics of these datasets in Table 1. Specifically, Nodes and Edges denote the number of nodes and edges in the graph, respectively. Features denotes the dimension of node features. Class Split denotes the number of classes used for meta-training/validation/meta-test. This selection aligns with the default settings established for the respective dataset.

Table 1: Statistics of four node classification datasets.
Dataset Nodes Edges Features Class Split
Amazon-E 42,318 43,556 8,669 90/37/40
DBLP 40,672 288,270 7,202 80/27/30
Cora-full 19,793 65,311 8,710 25/20/25
OGBN-arxiv 169,343 1,166,243 128 15/5/20

5.2 Baselines

To validate the effectiveness of our proposed strategy CPT, we conduct experiments with the following baseline methods to examine performance:

  • Meta-GNN Fan et al. (2019): integrates attributed networks with MAML Finn et al. (2017) using GNNs.

  • AMM-GNN Wang et al. (2020): AMM-GNN suggests enhancing MAML through an attribute matching mechanism. In particular, node embeddings are adaptively modified based on the embeddings of nodes across the entire meta-task.

  • G-Meta Huang and Zitnik (2020): G-meta employs subgraphs to produce node representations, establishing it as a scalable and inductive meta-learning technique for graphs.

  • TENT Wang et al. (2022): TENT investigate the limitations of existing few-shot node classification methods from the lens of task variance and develop a novel task-adaptive framework that include node-level, class-level and task-level.

5.3 Implementation Details

During training, we sample a certain number of meta-training tasks from training classes (i.e., base classes) and train the model with these meta-tasks. Then we evaluate the model based on a series of randomly sampled meta-test tasks from test classes (i.e., novel classes). For consistency, the class splitting is identical for all baseline methods. Then the final result of the average classification accuracy is obtained based on these meta-test tasks.

Our methodology is formulated utilizing the PyTorch111https://pytorch.org/ framework, and all subsequent evaluations are executed on a Linux-based server equipped with 5 NVIDIA Tesla V100 GPUs. For the specific implementation setting, we set the number of training epochs T𝑇Titalic_T as 2000, the weight decay rate is set as 0.0005, the loss weight γ𝛾\gammaitalic_γ is set as 1, the graph layers are set as 2 and the hidden size is set as in {16,32}1632\left\{16,32\right\}{ 16 , 32 }. The READOUT function is implemented as mean-pooling. The other optimal hyper-parameters were determined via grid search on the validation set: the learning rate was searched in {0.03,0.01,0.005}0.030.010.005\left\{0.03,0.01,0.005\right\}{ 0.03 , 0.01 , 0.005 }, the dropout in {0.2,0.5,0.8}0.20.50.8\left\{0.2,0.5,0.8\right\}{ 0.2 , 0.5 , 0.8 }. Furthermore, to keep consistency, the test tasks are identical for all baselines.

5.4 Evaluation Methodology

Following previous few-shot node classification works, we use the average classification accuracy as the evaluation metric and adopt these four few-shot node classification tasks to evaluate the performance of all algorithms: 5-way 3-shot, 5-way 5-shot, 10-way 3-shot, and 10-way 5-shot. The average classification accuracy is the mean result of five complete runs of the model. To further ensure the reliability of the experiment, the model’s test performance is repeated ten times, and the average value is taken.

5.5 Overall Evaluation Results

Table 2: Few-shot node classification performance on the four datasets.
Model Cora-full DBLP
5-way 3-shot 5-way 5-shot 10-way 3-shot 10-way 5-shot 5-way 3-shot 5-way-5 shot 10-way 3-shot 10-way 5-shot
Meta-GNN 57.60 60.92 38.48 40.68 78.38 79.34 67.02 68.54
Meta-GNN+CPT 59.40 62.36 42.45 43.81 79.68 81.60 68.84 70.24
ΔΔ\Deltaroman_Δ Gain +3.1% +2.4% +10.3% +7.7% +1.7% +2.8% +2.7% +2.5%
G-META 57.93 60.30 45.67 47.76 75.49 76.38 57.96 61.18
G-META+CPT 59.13 62.16 46.45 49.02 76.37 76.75 60.06 63.64
ΔΔ\Deltaroman_Δ Gain +2.1% +3.0% +1.7% +2.6% +1.2% +0.5% +3.6% +4.0%
AMM-GNN 59.06 61.66 20.82 27.98 79.12 82.34 65.93 68.36
AMM-GNN+CPT 63.58 64.04 30.25 40.31 81.54 84.52 68.26 68.83
ΔΔ\Deltaroman_Δ Gain +7.7% +3.9% +45.3% +44.1% +3.1% +2.6% +3.5% +0.7%
TENT 64.80 69.24 51.73 56.00 75.76 79.38 67.59 69.77
TENT+CPT 67.48 70.50 52.99 58.82 81.40 82.98 70.51 70.89
ΔΔ\Deltaroman_Δ Gain +4.1% +1.8% +2.4% +5.1% +7.4% +4.5% +4.3% +1.6%
Averaged ΔΔ\Deltaroman_Δ Gain +4.3% +2.8% +14.8% +14.9% +3.4% +2.6% +3.5% +2.2%
Model Amazon-E OGBN-arxiv
5-way 3-shot 5-way 5-shot 10-way 3-shot 10-way 5-shot 5-way 3-shot 5-way-5 shot 10-way 3-shot 10-way 5-shot
Meta-GNN 70.26 75.10 61.14 66.01 48.22 50.24 31.30 31.98
Meta-GNN+CPT 72.32 78.32 63.95 67.58 49.18 51.92 33.87 35.11
ΔΔ\Deltaroman_Δ Gain +2.9% +4.3% +4.6% +2.4% +2.0% +3.3% +8.2% +9.8%
G-META 73.50 77.02 61.85 62.93 44.29 48.72 41.26 46.64
G-META+CPT 74.64 78.26 62.72 67.07 47.28 51.08 42.24 48.68
ΔΔ\Deltaroman_Δ Gain +1.5% +1.6% +1.4% +6.6% +6.8% +4.8% +2.4% +4.3%
AMM-GNN 73.95 76.10 62.91 68.34 50.54 52.44 30.96 32.66
AMM-GNN+CPT 76.54 78.64 67.57 69.74 52.58 56.44 32.87 35.76
ΔΔ\Deltaroman_Δ Gain +3.5% +3.3% +7.4% +2.0% +4.0% +7.6% +6.2% +9.5%
TENT 74.26 78.12 65.66 70.49 55.62 62.96 41.13 46.02
TENT+CPT 79.32 83.98 70.25 74.35 59.24 66.56 57.64 64.50
ΔΔ\Deltaroman_Δ Gain +6.8% +7.5% +7.0% +5.4% +6.5% +4.1% +41.8% +40.2%
Averaged ΔΔ\Deltaroman_Δ Gain +3.7% +4.2% +5.1% +4.1% +4.8% +4.9% +14.6% +15.9%

We compare the few-shot node classification performance of four baseline methods and their CPT-enhanced versions in Table 2. This comprehensive comparison allows us to make the following observations:

  • CPT brings performance improvement to all baseline methods, demonstrating that CPT technology can effectively enhance the performance of few-shot node classification models. Moreover, the average gain exceeds 3% in most cases, which is a significant improvement. The gain effect brought by CPT varies for different models, which may be related to the structure and characteristics of the model. Among them, TENT+CPT achieves the highest performance in most cases.

  • The 10-way task setting is more challenging than the 5-way setting because it involves more categories. We find that the cases with very high gains all come from 10-way tasks, such as AMM-GNN on Cora-full and TENT on OGBN-arxiv. This indicates that CPT may be more effective for more complex classification tasks.

  • The 3-shot task setting is more challenging than the 5-shot setting because it has fewer samples to learn from. However, based on the experimental results, the difference in gains between 3-shot and 5-shot tasks is not significant, which suggests that the effect of CPT may not be strongly related to the number of samples.

5.6 Effectiveness in Mitigating Suboptimal Solutions.

To further verify whether our method can help the model break through suboptimal points, we observe the loss changes of TENT and TENT+CPT, as depicted in Figure 3. In the figure, the loss curve of TENT+CPT is represented in blue, the loss curve of TENT is represented in red, and the black vertical line represents the boundary between the first and second stages. From this observation, we can infer the following:

In the first stage, the training loss and validation loss of the model are essentially consistent, as they follow the same training setup. However, in the second stage, TENT+CPT faces more challenging tasks, causing the training loss to remain relatively high. Despite this, its validation loss gradually decreases and becomes lower than TENT’s validation loss. In contrast, TENT exhibits low training loss but high validation loss. This suggests that by applying the CPT technology, the model’s generalization ability is improved, further mitigating the suboptimal solution problem.

CPT introduces a certain level of difficulty to the training tasks, pushing the model to learn more general and robust features, which eventually leads to better performance in the validation phase. This outcome supports the idea that the CPT technology can effectively address the suboptimal solution problem and improve the model’s performance in few-shot node classification tasks.

Refer to caption
Figure 3: Comparison of the loss trends between TENT and TENT+CPT on Amazon-E.

5.7 Ablation Study

In this section, we conducted an ablation study to verify the effectiveness of the motivation in CPT. Firstly, we remove the second stage from CPT and retained only the first stage, using the default meta-learning strategy to train until convergence. We refer to this variant as CPT w/o -SS. Secondly, we remove the first stage and keep the second stage, directly using the initial model to learn gradually difficult tasks. This variant is referred to as CPT w/o -FS. Finally, we reverse the order of the first and second stages in CPT. In this variant, we first learn hard tasks and then learn easy tasks, and we refer to it as CPT-reverse. The specific experimental results are shown in Table 3.

Table 3: Ablation study with TENT on the Amazon-E.
Variant 5way-3shot 5way-5shot 10way-3shot 10way-5shot
CPT w/o -SS 74.26 78.12 65.66 70.49
CPT w/o -FS 74.78 80.61 67.23 71.64
CPT-reverse 76.09 81.14 67.86 72.26
CPT 79.32 83.98 70.25 74.35

Through the results of the ablation experiments, we have made the following findings:

  • We find that the performance of the variant CPT w/o -FS is somewhat better than that of the variant CPT w/o -SS. This indicates that the capability-progressive training strategy can help the model converge more effectively compared to the default episodic meta-learning.

  • We also find that the performance of the aforementioned two variants is significantly lower than the experimental results of CPT. To understand this, we observe their training process and found that both variants might fall into suboptimal solutions, as depicted in Figure 4. From this, we can observe that although the validation loss curve of the CPT w/o -FS variant is slightly better than that of the CPT w/o -SS variant, overall, their curve trajectories and loss values are very similar. The possible reason for this similarity is that both variants involve random task sampling, which might lead to encountering very challenging tasks early in training. This difficulty in knowledge acquisition could consequently result in convergence towards suboptimal points. In contrast, CPT effectively tackles this problem through its two-stage training approach.

  • Moreover, we find that if we first learn some difficult tasks and then learn easy tasks, i.e., CPT-reverse, the effect is also worse than CPT. From the perspective of experimental results, the order of learning is important, and the order of learning from easy to hard can help the model learn better.

Refer to caption
Figure 4: Loss trend comparison of CPT, CPT w/o -SS, and CPT w/o -FS.

These phenomena indicate that models have different level competencies at different stages, and the learning process from easy to hard can enhance the model’s generalization ability. Furthermore, the two-stage learning strategy can effectively alleviate the suboptimal solution problem.

6 CONCLUSION

In this paper, we discuss the limitations of the episodic meta-learning approach, where random task sampling might lead the model to converge to suboptimal solutions. To tackle this issue, we propose a novel curriculum-based training strategy called CPT to better train the GNN meta-learner. This strategy is divided into two stages: the first stage trains on simple tasks to acquire preliminary capabilities, and the second stage progressively increases task difficulty as the model’s capabilities improve. We perform thorough experiments using widely-recognized datasets and methods. Our results show that CPT improves all the baseline methods, resulting in a substantial increase in the performance of models for few-shot node classification and offering important insights. In most cases, the average improvement is more than 3%. Notably, this is a general strategy that can be applied to any GNN architecture and loss function. Therefore, future work can be further extended to other tasks, such as few-shot link prediction and graph classification.

References

  • Bengio et al. [2009] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Jun 2009.
  • Bojchevski and Günnemann [2017] Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. Cornell University - arXiv, Jul 2017.
  • Cascante-Bonilla et al. [2022] Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, and Vicente Ordonez. Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. Proceedings of the AAAI Conference on Artificial Intelligence, page 6912–6920, Sep 2022.
  • Chauhan et al. [2020] Jatin Chauhan, Deepak Nathani, and Manohar Kaul. Few-shot learning on graphs via super-classes based on graph spectral measures. arXiv: Learning,arXiv: Learning, Feb 2020.
  • Dai et al. [2019] Dengxin Dai, Christos Sakaridis, Simon Hecker, and LucVan Gool. Curriculum model adaptation with synthetic and real data for semantic foggy scene understanding. Cornell University - arXiv,Cornell University - arXiv, Jan 2019.
  • Ding et al. [2020] Kaize Ding, Jianling Wang, Jundong Li, Kai Shu, Chenghao Liu, and Huan Liu. Graph prototypical networks for few-shot learning on attributed networks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Oct 2020.
  • Fan et al. [2019] Zhou Fan, Cao Chengtai, Zhang Kunpeng, Trajcevski Goce, Zhong Ting, and Geng Ji. Meta-gnn on few-shot node classification in graph meta-learning. ACM Proceedings, Jan 2019.
  • Feng et al. [2020] Zhengyang Feng, Qianyu Zhou, Guangliang Cheng, ** Shi, and Lizhuang Ma. Semi-supervised semantic segmentation via dynamic self-training and class-balanced curriculum. ArXiv, abs/2004.08514, 2020.
  • Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv: Learning, Mar 2017.
  • Gu et al. [2021] Yaowen Gu, Si Zheng, and Jiao Li. Currmg: A curriculum learning approach for graph based molecular property prediction. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec 2021.
  • Guo et al. [2020] Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen, and Tie-Yan Liu. Fine-tuning by curriculum learning for non-autoregressive neural machine translation. Proceedings of the AAAI Conference on Artificial Intelligence, page 7839–7846, Jun 2020.
  • Hacohen and Weinshall [2019] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. Cornell University - arXiv, Apr 2019.
  • Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Cornell University - arXiv, May 2020.
  • Hu et al. [2022] Weihua Hu, Kaidi Cao, Kexin Huang, Edward W Huang, Karthik Subbian, and Jure Leskovec. Tuneup: A training strategy for improving generalization of graph neural networks. arXiv preprint arXiv:2210.14843, 2022.
  • Huang and Zitnik [2020] Kexin Huang and Marinka Zitnik. Graph meta learning via local subgraphs. Cornell University - arXiv, Jan 2020.
  • Jiang et al. [2014] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and AlexanderG. Hauptmann. Self-paced learning with diversity. Neural Information Processing Systems, Dec 2014.
  • Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • Li et al. [2023] Haoyang Li, Xin Wang, and Wenwu Zhu. Curriculum graph machine learning: A survey. arXiv preprint arXiv:2302.02926, Feb 2023.
  • Liu et al. [2019] Lu Liu, Tianyi Zhou, Guodong Long, **g Jiang, and Chengqi Zhang. Learning to propagate for graph meta-learning. Cornell University - arXiv,Cornell University - arXiv, Sep 2019.
  • Liu et al. [2020] Chenghao Liu, Zhihao Wang, Doyen Sahoo, Yuan Fang, Kun Zhang, and Steven CH Hoi. Adaptive task sampling for meta-learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 752–769. Springer, 2020.
  • Liu et al. [2021a] Zemin Liu, Yuan Fang, Chenghao Liu, and StevenC.H. Hoi. Relative and absolute location embedding for few-shot node classification on graph. Proceedings of the … AAAI Conference on Artificial Intelligence, May 2021.
  • Liu et al. [2021b] Zemin Liu, Trung-Kien Nguyen, and Yuan Fang. Tail-gnn: Tail-node graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1109–1119, 2021.
  • Ma et al. [2020] Ning Ma, Jiajun Bu, Jieyu Yang, Zhen Zhang, Chengwei Yao, Zhi Yu, Sheng Zhou, and Xifeng Yan. Adaptive-step graph meta-learner for few-shot graph classification. Cornell University - arXiv,Cornell University - arXiv, Mar 2020.
  • McAuley et al. [2015] Julian McAuley, Rahul Pandey, and Jure Leskovec. Inferring networks of substitutable and complementary products. arXiv: Social and Information Networks, Jun 2015.
  • Mishra et al. [2017] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. Learning,Learning, Jul 2017.
  • Platanios et al. [2019] EmmanouilAntonios Platanios, Otilia Stretcu, Graham Neubig, Barnabás Póczos, and TomM. Mitchell. Competence-based curriculum learning for neural machine translation. Cornell University - arXiv,Cornell University - arXiv, Mar 2019.
  • Qu et al. [2018] Meng Qu, Jian Tang, and Jiawei Han. Curriculum learning for heterogeneous star network embedding via deep reinforcement learning. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Feb 2018.
  • Ravi and Larochelle [2017] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. International Conference on Learning Representations,International Conference on Learning Representations, Apr 2017.
  • Rong et al. [2019] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. Learning,Learning, Jul 2019.
  • Snell et al. [2017] Jake Snell, Kevin Swersky, and RichardS. Zemel. Prototypical networks for few-shot learning. Neural Information Processing Systems, Mar 2017.
  • Sun et al. [2019] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shot learning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
  • Sung et al. [2018] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H.S. Torr, and Timothy M. Hospedales. Learning to compare: Relation network for few-shot learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
  • Szklarczyk et al. [2019] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. String v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids research, 47(D1):D607–D613, 2019.
  • Tang et al. [2008] Jie Tang, **g Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. Arnetminer. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Aug 2008.
  • Vakil and Amiri [2022] Nidhi Vakil and Hadi Amiri. Generic and trend-aware curriculum learning for relation extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2202–2213, 2022.
  • Vakil and Amiri [2023] Nidhi Vakil and Hadi Amiri. Curriculum learning for graph neural networks: A multiview competence-based approach. arXiv preprint arXiv:2307.08859, 2023.
  • Wang et al. [2020] Ning Wang, Minnan Luo, Kaize Ding, Lingling Zhang, Jundong Li, and Qinghua Zheng. Graph few-shot learning with attribute matching. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Oct 2020.
  • Wang et al. [2022] Song Wang, Kaize Ding, Chuxu Zhang, Chen Chen, and Jundong Li. Task-adaptive few-shot node classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1910–1919, 2022.
  • Wang et al. [2023] Hui Wang, Kun Zhou, Xin Zhao, **gyuan Wang, and Ji-Rong Wen. Curriculum pre-training heterogeneous subgraph transformer for top-n recommendation. ACM Transactions on Information Systems, page 1–28, Jan 2023.
  • Wei et al. [2023] Xiaowen Wei, Xiuwen Gong, Yibing Zhan, Bo Du, Yong Luo, and Wenbin Hu. Clnode: Curriculum learning for node classification. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 670–678, 2023.
  • Wu et al. [2021] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, page 4–24, Jan 2021.
  • Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
  • Yan et al. [2021] Qilong Yan, Yufeng Zhang, Qiang Liu, Shu Wu, and Liang Wang. Relation-aware heterogeneous graph for user profiling. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 3573–3577, 2021.
  • Zhang et al. [2020] Yufeng Zhang, Xueli Yu, Zeyu Cui, Shu Wu, Zhongzhen Wen, and Liang Wang. Every document owns its structure: Inductive text classification via graph neural networks. arXiv preprint arXiv:2004.13826, 2020.
  • Zhou et al. [2020] Tianyi Zhou, Shengjie Wang, and JeffA. Bilmes. Curriculum learning by dynamic instance hardness. Neural Information Processing Systems,Neural Information Processing Systems, Jan 2020.