UniGLM: Training One Unified Language Model for
Text-Attributed Graphs
Abstract
Representation learning on text-attributed graphs (TAGs), where nodes are represented by textual descriptions, is crucial for textual and relational knowledge systems and recommendation systems. Currently, state-of-the-art embedding methods for TAGs primarily focus on fine-tuning language models (e.g., BERT) using structure-aware training signals. While effective, these methods are tailored for individual TAG and cannot generalize across various graph scenarios. Given the shared textual space, leveraging multiple TAGs for joint fine-tuning, aligning text and graph structure from different aspects, would be more beneficial. Motivated by this, we introduce a novel Unified Graph Language Model (UniGLM) framework, the first graph embedding model that generalizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM is trained over multiple TAGs with different domains and scales using self-supervised contrastive learning. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training by minimizing repetitive encoding calculations. Extensive empirical results across 9 benchmark TAGs demonstrate UniGLM’s efficacy against leading embedding baselines in terms of generalization (various downstream tasks and backbones) and transfer learning (in and out of domain scenarios). The code is available at https://github.com/NYUSHCS/UniGLM.
UniGLM: Training One Unified Language Model for
Text-Attributed Graphs
Yi Fang1, Dongzhe Fan1, Sirui Ding2, Ninghao Liu3, Qiaoyu Tan1 1 New York University Shanghai 2 University of California, San Francisco; 3 University of Georgia {yf2722, qiaoyu.tan}@nyu.edu
1 Introduction
Text-attributed graphs (TAGs) have been widely adopted to represent complex relationships between textual entities in real-world textual and relational knowledge systems, including social media, recommendation systems, and knowledge base. Unlike standard graphs, nodes in TAGs are represented by text attributes. A typical example is academic citation network, where nodes represent scientific papers and edges indicate citations. To learn from TAGs, graph embedding (GE) Perozzi et al. (2014); Grover and Leskovec (2016); Zhang et al. (2018); Wu et al. (2020), which maps nodes into embedding vectors that preserve both textual and structure information, has recently garnered significant attention.
Prior GE studies He et al. (2024); Chen et al. (2024) on TAGs primarily focus on two stages: text transformation and graph structure modeling. In the first stage, text attributes are transformed into numerical feature vectors via shallow embedding models such as Word2vec Mikolov et al. (2013) and Bag-of-Words (BoW) Harris (1954). Subsequently, the transformed node features, along with graph structure, are often fed into graph neural networks (GNNs) Kipf and Welling (2016a); Zhou et al. (2020) for structural analysis. Although these methods are straightforward, they may be suboptimal in effectively integrating text semantics and structure knowledge.
In recent years, there has been a notable shift in interest from shallow models to pretrained language models (PLMs) such as BERT Devlin et al. (2019). The high-level idea is to jointly learn text knowledge and graph structure within a single encoder, either by develo** nested graph-BERT architectures Yang et al. (2021) or by designing structure-aware training signals Chien et al. (2022); ** et al. (2023). Despite their popularity, these methods face limitations in generalization capability because they fine-tune the BERT model for a single particular TAG, making it ineffective for transferring to other TAGs for representation learning. Given that text attributes provide a unified semantic space across different TAGs, leveraging multiple TAGs for a joint fine-tuning is a promising yet under-explored research direction, supported by the scaling law Kaplan et al. (2020).
However, training a unified BERT model for multiple TAGs presents several challenges. Firstly, extracting effective structural information across various graph scenarios while maintaining their unique statistics for LM fine-tuning is difficult. Given the diversity and variability of TAGs, local structures such as node degrees and global structures within the graph vary from nodes to nodes and graphs to graphs. Secondly, directly combining multiple TAGs for joint BERT training may suffer from memory and training efficiency issues due to the non i.i.d. nature of graphs. Unlike pure text-based LM training, textual nodes in TAGs are strongly correlated with one another. Consequently, anchor nodes and their structurally similarly neighbors need to be processed by BERT simultaneously, leading to significant trade-offs in computational and memory consumption.
To address the aforementioned challenges, we propose a novel unified graph language model (UniGLM) framework, the first self-supervised language model pre-training method tailored for multiple TAGs. The key idea is to enhance language model’s (e.g., BERT) graph embedding capability by fine-tuning it using large scale, diverse and cross-domain text-to-structure knowledge based on contrastive learning. Specifically, to tackle the first challenge, we introduce an adaptive positive sample selection technique that identifies positive samples by considering each node’s local, global, and graph-specific contexts. This sampling strategy is personalized and can effectively align textual nodes and their important neighbors well across different TAGs. To address the second challenge, we devise a dynamic memory bank to encode positive samples off-the-fly, thereby accelerating the training speed by avoiding repetitive encoding of positive samples’ text attributes via BERT. Our major contributions are summarized below.
-
•
We explore the development of a generalist embedding model for TAGs and introduce UniGLM, a novel language model pre-training framework tailored for a set of TAGs. To the best of our knowledge, UniGLM is the first graph embedding foundation model for TAGs.
-
•
We propose an adaptive positive sample selection method for sampling positive samples of each node for contrastive learning. Unlike standard sampling strategies, our personalized scheme identifies positive samples based on nodes’ local, global, and graph-related contexts, thereby unifying graph structures across various TAGs.
-
•
We devise a simple yet effective dynamic embedding table scheme to encode sampled positive samples off-the-fly during mini-batch training. By maintaining an external memory bank to update and retrieval embeddings of positives examples, we accelerate the training process using historical embeddings as supervision.
-
•
We conducted extensive experiments on 9 benchmark TAGs of varying sizes and domains. Empirical results show that UniGLM not only outperforms state-of-the-art graph embedding models across various downstream tasks (node classification and link prediction) and backbones (GNNs and MLPs), but also can generate informative embeddings for unseen TAGs.
2 Preliminaries
In this section, we introduce notations, formulate the research problem, and illustrate the motivation behind learning from multiple TAGs.
Notations and Problem Formulation. We are given TAGs, denoted as , where represents the -th TAG with nodes. is the set of nodes and is the adjacency matrix. Each node is associated with a textual attribute , and represents the set of these attributes.
TAG Embedding. Given a TAG , a standard embedding model aims to learn a graph encoder that maps nodes in into embedding vectors, preserving both textual () and structure () knowledge. Therefore, for TAGs, traditional methods will learn independent graph encoders, denoted as .
Motivation. Learning one graph encoder for each particular TAG is the de facto standard in state-of-the-art graph embedding literature. However, we argue that this setting is suboptimal for two main reasons. i) Deployment inefficiency. As discussed above, this learning procedure requires the development of separate graph encoders for all TAGs, significantly increasing the deployment and maintenance costs in practice. ii) Limited performance. Given the shared textual space across various TAGs, pre-training a language model for a single TAG is inherently less effective because it cannot leverage the text-to-structural knowledge across TAGs. According to the scaling law Kaplan et al. (2020), incorporating more structure-aware textual knowledge for collaborative language model fine-tuning may be advantageous.
Motivated by this, we aim to explore learning from multiple TAGs, as follows.
Learning on Multiple TAGs. Given a set of TAGs , our objective is to develop a single unified graph encoder , such that the textual-to-structure knowledge across TAGs is collectively preserved within the embedding space.
3 The Proposed Method
In this section, we present the details of UniGLM, as depicted in Figure 1. First, we introduce the contrastive-based collective language model pre-training pipeline in Section 3.1. Then, in Section 3.2, we elaborate on the methodology to adaptively select positive samplesfor collaborative training. Finally, in Section 3.3, we introduce a simple but effective optimization strategy (Embedding Table) to accelerate our learning process.
3.1 The Overall Pipeline of Collaborative Language Model Pre-training
Learning from multiple TAGs is challenging due to the heterogeneity of textual attributes and graph structures. Existing methods address this challenge by either employing GNN-nested transformer, as seen in Graphformers Yang et al. (2021) to capture both textual attributes and their correlations among nodes, or by adopting structure-aware objectives to fine-tune LMs Chien et al. (2022). While the former approach is effective, it may encounter efficiency issues due to the combination of GNNs and Transformers. Conversely, the latter approach is computationally efficient, but it predominantly focuses on learning from individual TAGs, leaving joint learning from diverse TAGs relatively under-explored.
To bridge the gap, we pursue the second direction by training a unified language model using structure-aware learning signals from multiple TAGs . Specifically, let represent an arbitrary node across TAGs, and denote its corresponding text attribute. The collaborative pre-training objective for node is defined as:
(1) |
where denotes the set of structurally similar nodes to , is the sampled batch set in mini-batch training with , and signifies the temperature parameter. is a similarity function such as the inner product, and is implemented as BERT by default, enabling the encoding of text attributes into embedding vectors. By optimizing Eq. (1), the language encoder is trained to generate similar representations for node and nodes in , while simultaneously pushing away the representations of and other nodes in the mini-batch set . Notably, as nodes in are randomly sampled across TAGs, Eq. (1) provides a simple yet effective way to learn from nodes in various scenarios due to its instance-wise discriminator nature.
While conceptually simple and feasible, learning through Eq. (1) faces two major challenges in practice. C1: The structurally similar node set of is not well-defined. Given the heterogeneity of various TAGs and node-level statistics, random sampling based on neighbors may be suboptimal for capturing the diversity between nodes across different TAGs. C2: It presents a trade-off between model performance and training efficiency. The computational costs of Eq. (1) are determined by the number of positive samples () and batch size (), as it requires the language model to encode text sequences per iteration. While reducing the number of positive samples can accelerate training, it may degrade performance since sufficient structural information is critical for graph contrastive learning You et al. (2020); Zhu et al. (2021); Zhang et al. (2024). In Section 3.2 and Section 3.3, we introduce two strategies to address these challenges, respectively.
3.2 Adaptive Positive Sample Selection
To extract the structural similar node set of in Eq. (1) (C1), the conventional protocol aims to randomly sample some nodes from ’s neighbors, i.e., nodes directly connected in the original TAG. However, this seemingly intuitive approach is suboptimal within our learning scenarios, given that neighborhood distributions frequently exhibit significant variability both within and across graphs. To effectively consolidate critical structure information across diverse TAGs, we posit that an advanced sampling strategy should account for the following essential factors.
-
•
Local Structure. Leveraging the local neighborhood structure, i.e., directly connected nodes, constitutes a fundamental design principle of GNNs Kipf and Welling (2016a). To achieve a unified alignment of text attributes and graph structures across various TAGs, the local neighbors of nodes are indispensable.
- •
-
•
Graph Statistics. Unlike the standard learning paradigm, learning from multiple TAGs necessitates consideration of the unique characteristics inherent to each graph. For example, node degree distributions vary across TAGs, leading to diverse interpretations of hub nodes. Additionally, node status are graph-specific and not directly comparable between TAGs.
Motivated by these observations, we propose an innovative positive sample selection scheme that adaptively sample structurally similar neighbors by considering nodes’ local and high-order neighbors, as well as their unique statuses within each graph. Specifically, we define an Adaptive Positive Sample selection function as follows:
(2) |
where the positive samples candidates set is denoted as and the corresponding sampling weights is denoted as . In each iteration, we select positive samples from a candidate set , which is adaptively chosen to emphasize nodes with more informal, structure-related attributes. The selection is weighted based on the unique statistical characteristics of each graph. Next, we will describe the process of obtaining these candidates and weights.
Given graph and a central node , positive sample candidates set of is denoted as: . We define function Cands() as follows:
(3) |
Here, represents the degree of node and is the averaged degree of . and denote the first-hop and high-order neighbor set of , respectively. For each central node, we select candidates adaptively by considering both individual node statistics and overall graph metrics. This approach ensures that nodes with local or high-order structure information are chosen as candidates properly for each central node with minimal noise.
To further take personalized node status into consideration, each node within the candidate set for a given central node is assigned a sampling weight. These weights are calculated using the softmax function applied to the PageRank scores of the candidates as follows.
(4) |
where is the PageRank score of node , and is the temperature parameter that controls the concentration of the probability distribution. This strategy ensures that the selection probability of each candidate is proportionate to their personalized node status, thus effectively leveraging their structural prominence within the graph.
Remark. Our design strategically incorporates both local and higher-order structural information through the candidate selection function , enhancing the depth and accuracy of structural insights in TAGs. By further integrating the PageRank scores in our adaptive sampling function , the model prioritizes key nodes based on their centrality, adapting effectively to each graph’s unique statistical characteristics. This comprehensive approach ensures robust performance and superior adaptability across various graph topologies.
3.3 The Lazy Contrastive Module
Another challenge in fine-tuning LMs from multiple TAGs using Eq. (1) is the efficiency problem (C2). Given the constraints of GPU memory, there is a trade-off between the training batch size and the maximum number of positive samples considered. Increasing the batch size can accelerate the training speed by reducing the number of iterations per epoch, which is important given the large scale training nodes in our learning scenarios, yet at the expense of reducing the number of positive samples per node, and vice versa. However, as verified in previous studies Chien et al. (2022); Fang et al. (2024), preserving a sufficient number of structurally similar nodes is crucial to the success of graph contrastive learning.
To address the dilemma, inspired by momentum contrastive He et al. (2020), we introduce a lazy contrastive module by treating positive sample encoding as dictionary look-up operation. Specifically, we establish a dynamic dictionary across various TAGs, which preserves and updates the representations of positive samples on-the-fly using nodes in the batch size, thereby avoiding the need to encode the text attributes of positive samples using LMs during the training. Formally, we rewrite the standard contrastive loss in Eq. (1) to an efficient version, expressed as:
(5) |
Here, denotes a simple embedding look-up operation based on the embedding table and the index of node , where , depends on both graph index and node index within the graph, and represents the hidden dimension. Compared to Eq. (1), learning through Eq. (5) is efficient, as the representations of positive samples are obtained via embedding retrieval without the need for explicit text encoding. It is also worth noting that Eq. (5) is memory-efficient since is gradient-free, reducing the abundance of intermediate tensors for gradient calculation. Consequently, a larger batch size can be used to further enhance the training speed. We empirically demonstrate the efficacy of these designs in Table 5 (Appendix). Next, we will show how to effectively implement the lookup operation.
Dictionary Update and Retrieval. Given TAGs , we construct a dynamic embedding table to store embedding of all nodes’ text attributes in TAGs. Each node is uniquely identified in by combining its own index within the graph and corresponding graph index. For example, let represent the -th node in graph , then its mapped node index in is denoted as . Given and the intermediate LM encoder , we can update on-the-fly as follows.
(6) |
Here, only nodes in are used to update the embedding table , which gives rise to the name "lazy", since it utilizes the encoded representations in previous iteration for dictionary updating. In parallel, given the index of positive sample of node , we extract its hidden representation via simple indexing, i.e., , which is LM encoding-free and can accelerate the training speed.
Remark. In contrast to MoCo He et al. (2020), we do not employ an additional momentum LM encoder to update the embedding table over time. Instead, we directly utilize the encoded central nodes from previous training steps as the latest representations for updating . This design not only enhances our training speed, as demonstrated in Appendix Table 5, but also results in notable performance improvements, as empirically verified in Table 1, as it encourages the LM encoder to learn from previous experiences.
Dataset | Emb Types | MLP | GCN | SAGE | RevGAT |
Computers | SE | 57.34±0.47 (+30.47%) | 70.22±0.40 (+15.55%) | 71.02±0.27 (+15.28%) | 70.54±0.23 (+17.11%) |
BERT | 54.04±0.20 (+38.43%) | 66.88±0.24 (+21.32%) | 67.25±0.12 (+21.74%) | 65.69±0.05 (+25.76%) | |
GIANT | 73.05±0.31 (+2.41%) | 80.09±0.16 (+1.31%) | 81.03±0.12 (+1.04%) | 80.99±0.18 (+2.00%) | |
PATTON | 71.60±0.30 (+4.48%) | 78.64±0.14 (+3.18%) | 79.98±0.15 (+2.36%) | 78.97±0.23 (+4.61%) | |
MixGIA | 65.34±0.16 (+14.49%) | 75.13±0.08 (+8.00%) | 75.83±0.20 (+7.97%) | 75.67±0.25 (+9.17%) | |
UniGLM | 74.81±0.14 | 81.14±0.19 | 81.87±0.10 | 82.61±0.13 | |
Fitness | SE | 78.07±0.18 (+15.78%) | 83.20±0.19 (+9.18%) | 83.65±0.17 (+8.98%) | 84.26±0.16 (+8.12%) |
BERT | 74.92±0.26 (+20.65%) | 80.77±0.23 (+12.47%) | 81.29±0.19 (+12.14%) | 80.96±0.49 (+12.52%) | |
GIANT | 89.03±0.07 (+1.53%) | 89.63±0.14 (+1.35%) | 90.15±0.06 (+1.12%) | 90.28±0.10 (+0.91%) | |
PATTON | 89.60±0.22 (+0.88%) | 90.03±0.21 (+0.90%) | 90.61±0.12 (+0.61%) | 90.58±0.22 (+0.57%) | |
MixGIA | 82.83±0.12 (+9.13%) | 86.05±0.11 (+5.57%) | 86.59±0.16 (+5.28%) | 86.63±0.15 (+5.16%) | |
UniGLM | 90.39±0.08 | 90.84±0.08 | 91.16±0.11 | 91.10±0.16 | |
PubMed | SE | 68.06±2.03 (+20.45%) | 73.41±3.02 (+9.06%) | 74.25±2.48 (+10.07%) | 72.56±1.16 (+11.78%) |
BERT | 59.79±2.71 (+37.11%) | 69.96±2.36 (+14.44%) | 63.12±2.43 (+29.48%) | 64.34±3.10 (+26.06%) | |
GIANT | 73.18±0.97 (+12.03%) | 76.93±0.73 (+4.07%) | 74.82±0.65 (+9.24%) | 75.54±1.09 (+7.37%) | |
PATTON | 79.64±1.30 (+2.94%) | 82.57±0.71 (-3.04%) | 80.26±0.95 (+1.83%) | 80.58±1.94 (+0.66%) | |
MixGIA | 71.11±2.51 (+15.29%) | 76.86±0.61 (+4.16%) | 73.71±1.32 (+10.88%) | 73.85±1.01 (+9.83%) | |
UniGLM | 81.98±1.32 | 80.06±1.83 | 81.73±1.06 | 81.11±0.69 | |
Photo | SE | 61.24±0.41 (+25.44%) | 71.70±0.16 (+10.88%) | 72.14±0.33 (+11.88%) | 71.63±0.23 (+12.68%) |
BERT | 60.03±0.14 (+27.97%) | 69.63±0.42 (+14.17%) | 70.25±0.36 (+14.89%) | 68.79±0.09 (+17.33%) | |
GIANT | 77.43±0.27 (-0.79%) | 79.79±0.14 (-0.36%) | 81.17±0.23 (-0.57%) | 80.69±0.43 (+0.02%) | |
PATTON | 74.97±0.43 (+2.47%) | 78.40±0.18 (+1.40%) | 78.79±0.01 (+2.44%) | 79.79±0.14 (+1.15%) | |
MixGIA | 70.72±0.11 (+8.63%) | 75.88±0.10 (+4.77%) | 77.34±0.09 (+4.36%) | 76.59±0.17 (+5.38%) | |
UniGLM | 76.82±0.29 | 79.50±0.06 | 80.71±0.29 | 80.71±0.02 |
4 Experiments
Throughout the experiments, we aim to answer the following research questions. RQ1: How does UniGLM perform against leading graph embedding models in terms of node classification and link prediction tasks? RQ2: How well does UniGFM transfer in cross-domain and in-domain scenarios? RQ3: How does each component of UniGLM, i.e., sampling strategy and efficient embedding table, contribute to the performance? RQ4: What is the impact of different pre-trained language model backbones and hyper-parameter on UniGLM?
4.1 Experiments Setup
We evaluate UniGLM on eight TAG datasets (details in Tabel 9 in Appendix). UniGLM is compared with multiple embedding models for node classification and link prediction. Node classification is conducted with MLP and GNNs (GCN, GraphSAGE, RevGAT), while link prediction uses MLP and Graph AutoEncoders. Our experiments focus on semi-supervised and transfer learning settings. More details can be found in Appendix.
4.2 Performance Evaluation
Node Classification. To answer RQ1, we conduct extensive experiments on eight benchmark TAG datasets for node classification under the semi-supervised setting. The results are shown in Table 1 and in Table 8(Appendix). We observe that: ① UniGLM significantly outperforms existing graph embedding models in node classification task with state-of-the-art GNNs and MLP. Table 1 demonstrates that UniGLM consistently achieves superior performance across most datasets and models, often ranking first. Although GIANT and PATTON improve over the SE method by leveraging language models and graph structure, UniGLM excels in most cases. ② Performance improvements are proportional to the volume of data used in training, demonstrating the effectiveness of the scaling law for TAGs. Specifically, as depicted in Figure 2, training UniGLM across all TAGs consistently outperforms the variant OneGLM trained on only a single TAG, with the largest observed performance gap being 10%. Furthermore, as indicated in Table 1 and Table 8, co-purchase networks exhibit more substantial improvements compared to citation networks.
Link Prediction. To address RQ1, we evaluate the UniGLM model for link prediction tasks using a 5% training edge set under a transfer learning setting. We train UniGLM on citation networks and evaluate on co-purchase networks. The results are illustrated in Figure 3 and Figure 9 (Appendix). We observe that ③UniGLM exhibits a robust cross-domain transfer capability, generally surpassing other baselines in link prediction. Specifically, UniGLM achieves over on Ogbn-Product dataset, surpassing SE and BERT over .
4.3 Transfer Ability
We discuss RQ2 in this section. From the cross-domain perspective, we train model on one domain and apply the model to another. For instance, we train on co-purchase datasets and test on citation networks, or vice versa. These setups are benchmarked against BERT, GIANT fine-tuned on Computers, and Patton pre-trained on Ogbn-Arxiv and evaluate with node classification task. ④We observe that training UniGLM with one domain can enhance the performance of other domain. Specifically, in Table 2 on History, UniGLM trained on citation networks improves performance most in co-purchase networks comparing to baselines, achieving a notable 80.22% accuracy.
Dataset | Emb Type | Accuracy |
---|---|---|
History | BERT | 78.79 ± 0.31 |
GIA(Computers) | 77.35 ± 0.89 | |
PATTON(Arxiv) | 79.78 ± 0.29 | |
UniGLM(Citation) | 80.22 ± 0.40 | |
Photo | BERT | 60.03 ± 0.14 |
GIA(Computers) | 75.37 ± 0.48 | |
PATTON(Arxiv) | 62.53 ± 0.68 | |
UniGLM(Citation) | 63.81 ± 0.36 | |
PubMed | BERT | 59.79 ± 2.71 |
GIA(Computers) | 57.88 ± 4.56 | |
PATTON(Arxiv) | 61.37 ± 1.71 | |
UniGLM(Purchase) | 68.29 ± 3.63 |
For RQ2 in-domain perspective, we employ UniGLM, trained using eight TAGs as the text encoder, to encode the unseen VideoGames dataset, evaluating on both tasks. We use co-viewed edges for training and evaluate on co-purchased edges for link prediction. Results are shown in Table 3. ⑥ We observe that UniGLM exhibits superior in-domain transfer ability. Specifically, UniGLM significantly outperforms BERT on both tasks, highlighting UniGLM’s robustness and effectiveness in in-domain transfer ability.
UniGLM | BERT | ||
---|---|---|---|
Link Prediction | AP | ||
AUC | |||
Node Classification | ACC |
4.4 Ablation Study
To answer RQ3, from the perspective of sampling strategies, we employ various methods for selecting positive samples with results shown in Figure 4. ⑦ We observe that it is crucial to consider both node degree and graph density for proper structure information. To further answer RQ3, exploring the effect of embedding table, we compare performance on node classification with or without embedding table in Figure 5 and Figure 8. ⑧ We observe that the embedding table not only decreases the training time but also increases the performance of UniGLM.
To demonstrate the effect of different types of pre-trained LM as backbones in RQ4, we use BERT, RoBERTa and DeBERTa respectively and the results can be seen in Figure 6 and Figure 10. ⑨ We observe that different encoder backbones does not significantly affect the overall performance of UniGLM. The preference for encoders may vary across different datasets. To test the effect of hyper-parameter: number of positive samples in RQ4, we experiment with various sample sizes, and the results are presented in Figure 7, revealing that ⑩ with different sample sizes, results are consistent, UniGLM is not sensitive to this hyper-parameter .
5 Related Work
Our work is related to the following two directions.
Representation Learning on TAGs. Text-attributed graphs (TAGs) have gained significant attention in both academia and industry. Early methods used shallow embedding techniques, which struggled to integrate textual content with graph structure. With pre-trained language models (PLMs), features are now extracted and fed into graph neural networks (GNNs), but this approach often falls short. Recent studies have developed better methods to integrate PLM features into GNNs, improving model performance Ioannidis et al. (2022); Chien et al. (2022); Zhao et al. (2023a); ** et al. (2023).
Graph Foundation Models. Graph Foundation Models (GFMs) aim to generalize across various graphs and tasks. Xia et al. Xia et al. (2023) proposed OpenGraph, excelling in zero-shot learning. Self-supervised learning (SSL) is crucial for pre-training GFMs. Zhao et al. Zhao et al. (2023b) categorized SSL tasks based on graph-embedded knowledge, enhancing model adaptability. Liu et al. Liu et al. (2023) identified challenges and future directions for GFMs, while Tan et al. Tan et al. (2023) improved GFM generalizability through structure reconstruction with their model.
6 Conclusion
In this study, we introduce UniGLM, a unified framework designed to pre-train a single unified language model for TAGs across domains. UniGLM features two main innovations: (1) an adaptive sampling strategy that selects positive samples and (2) a dynamic embedding table that efficiently encodes these samples on-the-fly to speed up training. We validate UniGLM across diverse TAGs from different domains, where it consistently surpasses existing methods in node classification, link prediction, demonstrating its superior transfer ability, effectiveness and efficiency.
Limitation
In this work, we primarily concentrate on employing language models as unified embedding frameworks for Textual Attributed Graphs (TAGs). Looking ahead, several interesting avenues emerge for extending this research. Firstly, there is a compelling need to explore foundation embedding models tailored for multimodal graphs. These models could integrate diverse types of data, such as textual, visual, and auditory information, enhancing the richness of the embeddings and opening up new possibilities for graph analytics. Secondly, the application of generative language models in graph tasks presents a promising frontier. It is still unknown how our UniGLM can be applied in that direction. These directions not only promise to expand the capabilities of graph neural networks but also bridge the gap between structured graph data and unstructured multimodal data.
References
- Chen et al. (2024) Zhikai Chen, Haitao Mao, Hang Li, Wei **, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al. 2024. Exploring the potential of large language models (llms) in learning on graphs. ACM SIGKDD Explorations Newsletter, 25(2):42–61.
- Chien et al. (2022) Eli Chien, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Jiong Zhang, Olgica Milenkovic, and Inderjit S Dhillon. 2022. Node feature extraction by self-supervised multi-scale neighborhood prediction. In International Conference on Learning Representations.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
- Fang et al. (2024) Yi Fang, Dongzhe Fan, Daochen Zha, and Qiaoyu Tan. 2024. Gaugllm: Improving graph contrastive learning for text-attributed graphs with large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
- Fey and Lenssen (2019) Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428.
- Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864.
- Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NIPS), pages 1024–1034.
- Harris (1954) Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146–162.
- He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738.
- He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, pages 507–517. International World Wide Web Conferences Steering Committee.
- He et al. (2024) Xiaoxin He, Xavier Bresson, Thomas Laurent, Adam Perold, Yann LeCun, and Bryan Hooi. 2024. Harnessing explanations: Llm-to-lm interpreter for enhanced text-attributed graph representation learning. In Proceedings of the International Conference on Learning Representations (ICLR).
- Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133.
- Ioannidis et al. (2022) Vassilis N. Ioannidis, Xiang Song, Da Zheng, Houyu Zhang, Jun Ma, Yi Xu, Belinda Zeng, Trishul Chilimbi, and George Karypis. 2022. Efficient and effective training of language and graph neural network models. arXiv preprint arXiv:2206.10781.
- ** et al. (2023) Bowen **, Wentao Zhang, Yu Zhang, Yu Meng, Xinyang Zhang, Qi Zhu, and Jiawei Han. 2023. Patton: Language model pretraining on text-rich networks. arXiv preprint arXiv:2305.12268.
- Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Kipf and Welling (2016a) Thomas N Kipf and Max Welling. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
- Kipf and Welling (2016b) Thomas N. Kipf and Max Welling. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Bayesian Deep Learning Workshop (NIPS 2016).
- Li et al. (2021) Guohao Li, Matthias Muller, Bernard Ghanem, and Vladlen Koltun. 2021. Training graph neural networks with 1000 layers. In Proceedings of the International Conference on Machine Learning (ICML).
- Liu et al. (2023) Jiawei Liu, Cheng Yang, Zhiyuan Lu, Junze Chen, Yibo Li, Mengmei Zhang, Ting Bai, Yuan Fang, Lichao Sun, Philip S. Yu, and Chuan Shi. 2023. Towards graph foundation models: A survey and beyond. arXiv preprint arXiv:2310.11829.
- Liu et al. (2020) Meng Liu, Hongyang Gao, and Shuiwang Ji. 2020. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 338–348.
- Liu et al. (2021) Zemin Liu, Trung-Kien Nguyen, and Yuan Fang. 2021. Tail-gnn: Tail-node graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1109–1119.
- McAuley et al. (2015) Julian McAuley, Christopher Targett, Javen Shi, and Anton van den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–52. ACM.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 188–197.
- Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710.
- Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine, 29(3):93–93.
- Tan et al. (2023) Qiaoyu Tan, Ninghao Liu, Xiao Huang, Soo-Hyun Choi, Li Li, Rui Chen, and Xia Hu. 2023. S2gae: Self-supervised graph autoencoders are generalizable learners with graph masking. WSDM ’23, pages 787–795, New York, NY, USA. Association for Computing Machinery.
- Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24.
- Xia et al. (2023) Lianghao Xia, Ben Kao, and Chao Huang. 2023. Opengraph: Towards open graph foundation models. arXiv preprint arXiv:2403.01121.
- Yang et al. (2021) Junhan Yang, Zheng Liu, Shitao Xiao, Chaozhuo Li, Defu Lian, Sanjay Agrawal, Amit Singh, Guangzhong Sun, and Xing Xie. 2021. Graphformers: Gnn-nested transformers for representation learning on textual graph. Advances in Neural Information Processing Systems, 34:28798–28810.
- You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823.
- Zhang et al. (2018) Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. 2018. Network representation learning: A survey. IEEE transactions on Big Data, 6(1):3–28.
- Zhang et al. (2024) Xin Zhang, Qiaoyu Tan, Xiao Huang, and Bo Li. 2024. Graph contrastive learning with personalized augmentation. IEEE Transactions on Knowledge and Data Engineering.
- Zhao et al. (2023a) Jianan Zhao, Meng Qu, Chaozhuo Li, Hao Yan, Qian Liu, Rui Li, Xing Xie, and Jian Tang. 2023a. Learning on large-scale text-attributed graphs via variational inference. In International Conference on Learning Representations. ICLR.
- Zhao et al. (2023b) Ziwen Zhao, Yuhua Li, Yixiong Zou, Ruixuan Li, and Rui Zhang. 2023b. A survey on self-supervised pre-training of graph foundation models: A knowledge-based perspective. arXiv preprint arXiv:2403.16137.
- Zhou et al. (2020) Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI open, 1:57–81.
- Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, pages 2069–2080.
Appendix A Appendix
Datasets. We evaluate the proposed UniGLM framework using eight publicly available TAG datasets. These datasets include two citation networks, namely PubMed Sen et al. (2008) and Ogbn-Arxiv Hu et al. (2020), one co-purchase network Ogbn-Product(subset) from TAPE He et al. (2024) and five E-commerce datasets extracted from Amazon Ni et al. (2019): Electronics-Computers (Computers), Books-History (History), Books-Children (Children), Sports-Fitness (Fitness), and Electronics-Photography (Photo). Noteblely, we use another co-purchase network from He and McAuley (2016) and McAuley et al. (2015) named VideoGames. For node classification, we adhere to the standard data splits used in prior research for PubMed, Ogbn-Arxiv, and Ogbn-Products, while we use a 5:20:75 split for the E-commerce datasets. For link prediction, we adopt the widely used 5:5:90 split and sample equal number of negative links as positive links.
-
•
PubMed Sen et al. (2008). The PubMed dataset consists of 19,717 scientific publications from PubMed database. The citation network consists of 44,338links
-
•
Ogbn-Arxiv Hu et al. (2020). The Ogbn-Arxiv dataset is a directed graph, representing the citation network between all Computer Science(CS) arXiv papers.
-
•
Ogbn-Products(subset) He et al. (2024). The Ogbn-Products dataset represents an Amazon product co-purchasing network, with product decriptions as raw text.
-
•
Electronics-Computers(Computers) Ni et al. (2019). The Electronics-Computers dataset is a segment of the Amazon co-purchase graph, where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews by default, and class labels are given by the product category.
-
•
Books-History(History) Ni et al. (2019). Each node represents a book related to history domain, edges indicate that two books are frequently bought together and class labels are given by the book category.
-
•
Books-Children(Children) Ni et al. (2019). Each node represents a book related to child domain, edges indicate that two books are frequently bought together and class labels are given by the book category.
-
•
Sports-Fitness(Fitness) Ni et al. (2019). Each node represents a kind of item in Sports and Fitness category.
-
•
Electronics-Photography(Photo) Ni et al. (2019). This dataset contains review data from Amazon platform. Each node represents a review related to the item in camera category.
Baselines. We compare UniGLM with five textual feature extraction methods. Previously we mentioned four methods including: SE, BERT, GIANT, PATTON in Section 1. To comprehensively compare the proposed UniGLM with existing methods on multiple-datasets, we implement MixGIA as an additional baseline. MixGIA utilizes the GIANT framework and is applied to multiple datasets, leveraging the structural information of each dataset to fine-tune an a shared BERT model. Specifically, original GIANT construct a clustering tree to classify each node layer by layer. Our implementation of MixGIA set a fix number of layers for clustering tree and change the tree alone with the dataset per layer.
Experimental Settings. For node classification, we input the node embeddings from UniGLM into MLP and several different state-of-art GNN models, including GCN Mikolov et al. (2013), GraphSAGE Hamilton et al. (2017) and RevGAT Li et al. (2021). We run experiments 5 times and report the mean result and the standard deviation. We explore link prediction ability under the transfer learning setting, details are explained in link prediction evaluation. We use AUC and AP as metrics for link prediction. We test with MLP and Graph AutoEncoders Kipf and Welling (2016b). For the reproducibility of our experiments, we employ GNN implementations from the PyG (Fey and Lenssen, 2019) package. The default language model backbone is Bert if not specified. Hyper-parameters are shown below.
Hyperparameter Our hyperparameter for GNN is set as same as TAPE He et al. (2024). For UniGLM, we have hyperparameter in Table 4 as follow:
Hyperparameters | |
BATCH SIZE | 64 |
TEMPERATURE | 0.3 |
MAX SEQUENCE LENGTH | 512 |
NUM POS SAMPLES (t) | 6 |
Device and time cost. We compare the training and inference time cost of GIANT, PATTON and UGEmb. For GIANT and PATTON, the training time is the sum of training all 8 TAGs listed above. For UGEmb, it is the time to use all 8 TAGs to train one model. Inference time is the time to encode Ogbn-Arxiv for each method. We use a A800(80G) for training and inference.
GIANT | PATTON | UGEmb | |
---|---|---|---|
Training Time | 14.24h | 5.4d | 13.5h |
Inference Time | 30min | 5h15min | 29min |
Additional Results In this section, we present further results and details not previously included due to space constraints. These additional findings further substantiate our research claims and provide a deeper understanding of our study’s implications.
This figure 8 shows the result with or without Embedding Table design.
This figure 9 shows the result of using GAE as the inference model for link prediction.
This figure 10 shows the result of using different LLMs as backbone.
This table 7 shown additional results to prove the effectiveness of our adaptive sampling strategy.
Dataset | Emb Type | Accuracy |
---|---|---|
History | with E | 82.51±0.08 |
w/o E | 81.96 ± 0.29 | |
Computers | with E | 81.14±0.19 |
w/o E | 80.14 ± 0.14 | |
Children | with E | 50.26±0.15 |
w/o E | 49.07 ± 0.25 | |
Fitness | with E | 90.84±0.08 |
w/o E | 90.68 ± 0.10 |
This table 6 shown additional results to prove the effectiveness of our Embedding Table Module.
Dataset | Emb Type | Accuracy |
---|---|---|
History | w/o Node Degree | 79.84 ± 0.24 |
w/o Graph Density | 83.07 ± 0.31 | |
with both | 83.42 ± 0.18 | |
PubMed | w/o Node Degree | 76.04 ± 1.59 |
w/o Graph Density | 82.35 ± 0.83 | |
with both | 81.98 ± 1.32 | |
Children | w/o Node Degree | 47.57 ± 0.29 |
w/o Graph Density | 51.38 ± 0.20 | |
with both | 51.86 ± 0.31 | |
Fitness | w/o Node Degree | 90.03 ± 0.11 |
w/o Graph Density | 88.62 ± 0.07 | |
with both | 90.39 ± 0.08 | |
Ogbn-Products | w/o Node Degree | 74.96 ± 0.57 |
w/o Graph Density | 76.56 ± 1.14 | |
with both | 76.46 ± 0.21 |
This table 8 is the other half of node classification results.
Dataset | Emb Types | MLP | GCN | SAGE | RevGAT |
Children | SE | 38.84±0.35 (+33.52%) | 43.19±0.31 (+16.37%) | 44.83±0.24 (+16.28%) | 43.00±0.18 (+19.40%) |
BERT | 44.00±0.83 (+17.86%) | 46.88±0.60 (+7.21%) | 47.97±0.55 (+8.67%) | 48.00±0.11 (+6.96%) | |
GIANT | 48.95±0.23 (+5.94%) | 48.47±0.35 (+3.69%) | 51.41±0.42 (+1.40%) | 50.63±0.36 (+1.40%) | |
PATTON | 49.91±0.13 (+3.91%) | 49.98±0.38(+0.56%) | 52.01±0.50 (+0.23%) | 51.07±0.21(+0.53%) | |
MixGIA | 47.49±0.19 (+9.20%) | 48.89±0.25 (+2.80%) | 50.60±0.19 (+3.02%) | 49.73±0.37 (+3.24%) | |
UniGLM | 51.86±0.31 | 50.26±0.15 | 52.13±0.34 | 51.34±0.22 | |
Ogbn-Products | SE | 53.85±0.17 (+41.99%) | 70.52±0.51 (+9.34%) | 69.13±0.26 (+12.98%) | 69.64±0.17 (+13.47%) |
BERT | 67.58±0.28 (+13.14%) | 74.77±0.87 (+3.13%) | 74.09±0.27 (+5.41%) | 74.53±0.26 (+6.02%) | |
GIANT | 72.46±0.33 (+5.52%) | 69.77±0.42 (+10.52%) | 68.69±1.19 (+13.70%) | 71.89±0.30 (+9.92%) | |
PATTON | 76.42±0.23 (+0.05%) | 77.22±0.34 (-0.14%) | 77.81±0.58 (+0.37%) | 78.48±0.15 (+0.69%) | |
MixGIA | 71.04±0.38 (+7.63%) | 76.13±0.82 (+1.29%) | 75.77±0.40 (+3.08%) | 76.24±0.33 (+3.65%) | |
UniGLM | 76.46±0.21 | 77.11±0.41 | 78.10±0.22 | 79.02±1.12 | |
History | SE | 73.19±0.36 (+13.98%) | 77.03±0.70 (+7.11%) | 77.63±0.22 (+7.29%) | 77.83±0.27 (+6.68%) |
BERT | 78.79±0.31 (+5.88%) | 80.33±0.33 (+2.71%) | 80.63±0.33 (+3.30%) | 80.69±0.22 (+2.90%) | |
GIANT | 81.37±0.32 (+2.52%) | 80.88±0.19 (+2.02%) | 82.25±0.19 (+1.26%) | 81.70±0.26 (+1.63%) | |
PATTON | 82.88±0.24 (+0.65%) | 82.41±0.20 (+0.12%) | 83.43±0.18 (-0.17%) | 82.94±0.07 (+0.11%) | |
MixGIA | 81.47±0.32 (+2.39%) | 81.68±0.24 (+1.02%) | 82.55±0.23 (+0.90%) | 82.26±0.26 (+0.94%) | |
UniGLM | 83.42±0.18 | 82.51±0.08 | 83.29±0.17 | 83.03±0.23 | |
Ogbn-Arxiv | SE | 64.30±0.09 (+13.56%) | 71.74±0.29 (+2.48%) | 71.49±0.27 (+4.15%) | 74.02±0.18 (+0.19%) |
BERT | 66.29±0.20 (+10.15%) | 72.63±0.31 (+1.23%) | 73.33±0.33 (+1.54%) | 72.88±0.39 (+1.76%) | |
GIANT | 73.08±0.06 (-0.08%) | 73.29±0.10 (-0.31%) | 74.59±0.28 (-0.17%) | 75.96±0.09 (-2.37%) | |
PATTON | 73.47±0.11(-0.61%) | 73.59±0.20(-0.08%) | 75.00±0.16(-0.72%) | 74.08±0.12(+0.11%) | |
MixGIA | 67.62±0.24 (+7.99%) | 73.07±0.30 (+0.62%) | 73.70±0.10 (+1.03%) | 73.57±0.39 (+0.80%) | |
UniGLM | 73.02±0.11 | 73.52±0.23 | 74.46±0.12 | 74.16±0.51 |
Data | # Nodes | # Edges | # Ave Token | #classes |
---|---|---|---|---|
PubMed | ||||
Ogbn-Arxiv | ||||
Ogbn-Products | ||||
Electronics-Computers | ||||
Books-History | ||||
Books-Children | ||||
Sports-Fitness | ||||
Electronics-Photography |