UniGLM: Training One Unified Language Model for
Text-Attributed Graphs

Yi Fang1, Dongzhe Fan1, Sirui Ding2, Ninghao Liu3, Qiaoyu Tan1
1 New York University Shanghai
2 University of California, San Francisco; 3 University of Georgia
{yf2722, qiaoyu.tan}@nyu.edu
Abstract

Representation learning on text-attributed graphs (TAGs), where nodes are represented by textual descriptions, is crucial for textual and relational knowledge systems and recommendation systems. Currently, state-of-the-art embedding methods for TAGs primarily focus on fine-tuning language models (e.g., BERT) using structure-aware training signals. While effective, these methods are tailored for individual TAG and cannot generalize across various graph scenarios. Given the shared textual space, leveraging multiple TAGs for joint fine-tuning, aligning text and graph structure from different aspects, would be more beneficial. Motivated by this, we introduce a novel Unified Graph Language Model (UniGLM) framework, the first graph embedding model that generalizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM is trained over multiple TAGs with different domains and scales using self-supervised contrastive learning. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training by minimizing repetitive encoding calculations. Extensive empirical results across 9 benchmark TAGs demonstrate UniGLM’s efficacy against leading embedding baselines in terms of generalization (various downstream tasks and backbones) and transfer learning (in and out of domain scenarios). The code is available at https://github.com/NYUSHCS/UniGLM.

UniGLM: Training One Unified Language Model for
Text-Attributed Graphs


Yi Fang1, Dongzhe Fan1, Sirui Ding2, Ninghao Liu3, Qiaoyu Tan1 1 New York University Shanghai 2 University of California, San Francisco; 3 University of Georgia {yf2722, qiaoyu.tan}@nyu.edu


1 Introduction

Text-attributed graphs (TAGs) have been widely adopted to represent complex relationships between textual entities in real-world textual and relational knowledge systems, including social media, recommendation systems, and knowledge base. Unlike standard graphs, nodes in TAGs are represented by text attributes. A typical example is academic citation network, where nodes represent scientific papers and edges indicate citations. To learn from TAGs, graph embedding (GE) Perozzi et al. (2014); Grover and Leskovec (2016); Zhang et al. (2018); Wu et al. (2020), which maps nodes into embedding vectors that preserve both textual and structure information, has recently garnered significant attention.

Prior GE studies He et al. (2024); Chen et al. (2024) on TAGs primarily focus on two stages: text transformation and graph structure modeling. In the first stage, text attributes are transformed into numerical feature vectors via shallow embedding models such as Word2vec Mikolov et al. (2013) and Bag-of-Words (BoW) Harris (1954). Subsequently, the transformed node features, along with graph structure, are often fed into graph neural networks (GNNs) Kipf and Welling (2016a); Zhou et al. (2020) for structural analysis. Although these methods are straightforward, they may be suboptimal in effectively integrating text semantics and structure knowledge.

In recent years, there has been a notable shift in interest from shallow models to pretrained language models (PLMs) such as BERT Devlin et al. (2019). The high-level idea is to jointly learn text knowledge and graph structure within a single encoder, either by develo** nested graph-BERT architectures Yang et al. (2021) or by designing structure-aware training signals Chien et al. (2022); ** et al. (2023). Despite their popularity, these methods face limitations in generalization capability because they fine-tune the BERT model for a single particular TAG, making it ineffective for transferring to other TAGs for representation learning. Given that text attributes provide a unified semantic space across different TAGs, leveraging multiple TAGs for a joint fine-tuning is a promising yet under-explored research direction, supported by the scaling law Kaplan et al. (2020).

However, training a unified BERT model for multiple TAGs presents several challenges. Firstly, extracting effective structural information across various graph scenarios while maintaining their unique statistics for LM fine-tuning is difficult. Given the diversity and variability of TAGs, local structures such as node degrees and global structures within the graph vary from nodes to nodes and graphs to graphs. Secondly, directly combining multiple TAGs for joint BERT training may suffer from memory and training efficiency issues due to the non i.i.d. nature of graphs. Unlike pure text-based LM training, textual nodes in TAGs are strongly correlated with one another. Consequently, anchor nodes and their structurally similarly neighbors need to be processed by BERT simultaneously, leading to significant trade-offs in computational and memory consumption.

To address the aforementioned challenges, we propose a novel unified graph language model (UniGLM) framework, the first self-supervised language model pre-training method tailored for multiple TAGs. The key idea is to enhance language model’s (e.g., BERT) graph embedding capability by fine-tuning it using large scale, diverse and cross-domain text-to-structure knowledge based on contrastive learning. Specifically, to tackle the first challenge, we introduce an adaptive positive sample selection technique that identifies positive samples by considering each node’s local, global, and graph-specific contexts. This sampling strategy is personalized and can effectively align textual nodes and their important neighbors well across different TAGs. To address the second challenge, we devise a dynamic memory bank to encode positive samples off-the-fly, thereby accelerating the training speed by avoiding repetitive encoding of positive samples’ text attributes via BERT. Our major contributions are summarized below.

  • We explore the development of a generalist embedding model for TAGs and introduce UniGLM, a novel language model pre-training framework tailored for a set of TAGs. To the best of our knowledge, UniGLM is the first graph embedding foundation model for TAGs.

  • We propose an adaptive positive sample selection method for sampling positive samples of each node for contrastive learning. Unlike standard sampling strategies, our personalized scheme identifies positive samples based on nodes’ local, global, and graph-related contexts, thereby unifying graph structures across various TAGs.

  • We devise a simple yet effective dynamic embedding table scheme to encode sampled positive samples off-the-fly during mini-batch training. By maintaining an external memory bank to update and retrieval embeddings of positives examples, we accelerate the training process using historical embeddings as supervision.

  • We conducted extensive experiments on 9 benchmark TAGs of varying sizes and domains. Empirical results show that UniGLM not only outperforms state-of-the-art graph embedding models across various downstream tasks (node classification and link prediction) and backbones (GNNs and MLPs), but also can generate informative embeddings for unseen TAGs.

Refer to caption
Figure 1: The proposed UniGLM framework. The UniGLM framework trains a unified graph encoder across multiple TAGs using contrastive learning, instead of learning separate language models for each TAG. To ensure effective and efficient textual-to-structure alignment, we introduce an adaptive positive sample selection scheme and a lazy contrastive strategy. UniGLM serves as a foundational embedding model for TAGs, consistently delivering strong performance across various downstream tasks and backbones.

2 Preliminaries

In this section, we introduce notations, formulate the research problem, and illustrate the motivation behind learning from multiple TAGs.

Notations and Problem Formulation. We are given m𝑚mitalic_m TAGs, denoted as {Gi|i=1,2,,m}conditional-setsubscript𝐺𝑖𝑖12𝑚\{G_{i}|i=1,2,...,m\}{ italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , 2 , … , italic_m }, where 𝒢i=(𝒱,𝒯,𝐀)subscript𝒢𝑖𝒱𝒯𝐀\mathcal{G}_{i}=(\mathcal{V},\mathcal{T},\mathbf{A})caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( caligraphic_V , caligraphic_T , bold_A ) represents the i𝑖iitalic_i-th TAG with nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT nodes. 𝒱𝒱\mathcal{V}caligraphic_V is the set of nodes and 𝐀ni×ni𝐀superscriptsubscript𝑛𝑖subscript𝑛𝑖\mathbf{A}\in\mathbb{R}^{n_{i}\times n_{i}}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the adjacency matrix. Each node v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V is associated with a textual attribute 𝒯vsubscript𝒯𝑣\mathcal{T}_{v}caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and 𝒯={𝒯vv𝒱}𝒯conditional-setsubscript𝒯𝑣𝑣𝒱\mathcal{T}=\{\mathcal{T}_{v}\mid v\in\mathcal{V}\}caligraphic_T = { caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_v ∈ caligraphic_V } represents the set of these attributes.

TAG Embedding. Given a TAG Gi=(𝒱,𝒯,𝐀)subscript𝐺𝑖𝒱𝒯𝐀G_{i}=(\mathcal{V},\mathcal{T},\mathbf{A})italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( caligraphic_V , caligraphic_T , bold_A ), a standard embedding model aims to learn a graph encoder fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that maps nodes in Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into embedding vectors, preserving both textual (𝒯𝒯\mathcal{T}caligraphic_T) and structure (𝐀𝐀\mathbf{A}bold_A) knowledge. Therefore, for m𝑚mitalic_m TAGs, traditional methods will learn m𝑚mitalic_m independent graph encoders, denoted as {fi}i=1msuperscriptsubscriptsubscript𝑓𝑖𝑖1𝑚\{f_{i}\}_{i=1}^{m}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT.

Motivation. Learning one graph encoder for each particular TAG is the de facto standard in state-of-the-art graph embedding literature. However, we argue that this setting is suboptimal for two main reasons. i) Deployment inefficiency. As discussed above, this learning procedure requires the development of m𝑚mitalic_m separate graph encoders for all TAGs, significantly increasing the deployment and maintenance costs in practice. ii) Limited performance. Given the shared textual space across various TAGs, pre-training a language model for a single TAG is inherently less effective because it cannot leverage the text-to-structural knowledge across TAGs. According to the scaling law Kaplan et al. (2020), incorporating more structure-aware textual knowledge for collaborative language model fine-tuning may be advantageous.

Motivated by this, we aim to explore learning from multiple TAGs, as follows.

Learning on Multiple TAGs. Given a set of TAGs {Gi|i=1,2,,m}conditional-setsubscript𝐺𝑖𝑖12𝑚\{G_{i}|i=1,2,...,m\}{ italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , 2 , … , italic_m }, our objective is to develop a single unified graph encoder f𝑓fitalic_f, such that the textual-to-structure knowledge across m𝑚mitalic_m TAGs is collectively preserved within the embedding space.

3 The Proposed Method

In this section, we present the details of UniGLM, as depicted in Figure 1. First, we introduce the contrastive-based collective language model pre-training pipeline in Section 3.1. Then, in Section 3.2, we elaborate on the methodology to adaptively select positive samplesfor collaborative training. Finally, in Section 3.3, we introduce a simple but effective optimization strategy (Embedding Table) to accelerate our learning process.

3.1 The Overall Pipeline of Collaborative Language Model Pre-training

Learning from multiple TAGs is challenging due to the heterogeneity of textual attributes and graph structures. Existing methods address this challenge by either employing GNN-nested transformer, as seen in Graphformers Yang et al. (2021) to capture both textual attributes and their correlations among nodes, or by adopting structure-aware objectives to fine-tune LMs Chien et al. (2022). While the former approach is effective, it may encounter efficiency issues due to the combination of GNNs and Transformers. Conversely, the latter approach is computationally efficient, but it predominantly focuses on learning from individual TAGs, leaving joint learning from diverse TAGs relatively under-explored.

To bridge the gap, we pursue the second direction by training a unified language model f𝑓fitalic_f using structure-aware learning signals from multiple TAGs {Gi}i=1msuperscriptsubscriptsubscript𝐺𝑖𝑖1𝑚\{G_{i}\}_{i=1}^{m}{ italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. Specifically, let v𝑣vitalic_v represent an arbitrary node across m𝑚mitalic_m TAGs, and 𝒯vsubscript𝒯𝑣\mathcal{T}_{v}caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denote its corresponding text attribute. The collaborative pre-training objective for node v𝑣vitalic_v is defined as:

v=u𝒮vlogexp(sim(f(𝒯v),f(𝒯u))/τ)kexp(sim(f(𝒯v),f(𝒯k))/τ),subscript𝑣subscript𝑢subscript𝒮𝑣sim𝑓subscript𝒯𝑣𝑓subscript𝒯𝑢𝜏subscript𝑘sim𝑓subscript𝒯𝑣𝑓subscript𝒯𝑘𝜏\begin{split}\mathcal{L}_{v}=-\sum_{u\in\mathcal{S}_{v}}\log\frac{\exp\left({% \text{sim}(f(\mathcal{T}_{v}),f(\mathcal{T}_{u})})/{\tau}\right)}{\sum\limits_% {k\in\mathcal{B}}\exp\left({\text{sim}(f(\mathcal{T}_{v}),f(\mathcal{T}_{k})})% /{\tau}\right)},\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( sim ( italic_f ( caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , italic_f ( caligraphic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_B end_POSTSUBSCRIPT roman_exp ( sim ( italic_f ( caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , italic_f ( caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) / italic_τ ) end_ARG , end_CELL end_ROW (1)

where 𝒮vsubscript𝒮𝑣\mathcal{S}_{v}caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denotes the set of structurally similar nodes to v𝑣vitalic_v, \mathcal{B}caligraphic_B is the sampled batch set in mini-batch training with v𝑣v\in\mathcal{B}italic_v ∈ caligraphic_B, and τ𝜏\tauitalic_τ signifies the temperature parameter. sim(,)sim\text{sim}(\cdot,\cdot)sim ( ⋅ , ⋅ ) is a similarity function such as the inner product, and f𝑓fitalic_f is implemented as BERT by default, enabling the encoding of text attributes into embedding vectors. By optimizing Eq. (1), the language encoder f𝑓fitalic_f is trained to generate similar representations for node v𝑣vitalic_v and nodes in 𝒮vsubscript𝒮𝑣\mathcal{S}_{v}caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, while simultaneously pushing away the representations of v𝑣vitalic_v and other nodes in the mini-batch set \mathcal{B}caligraphic_B. Notably, as nodes in \mathcal{B}caligraphic_B are randomly sampled across m𝑚mitalic_m TAGs, Eq. (1) provides a simple yet effective way to learn from nodes in various scenarios due to its instance-wise discriminator nature.

While conceptually simple and feasible, learning through Eq. (1) faces two major challenges in practice. C1: The structurally similar node set 𝒮vsubscript𝒮𝑣\mathcal{S}_{v}caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT of v𝑣vitalic_v is not well-defined. Given the heterogeneity of various TAGs and node-level statistics, random sampling based on neighbors may be suboptimal for capturing the diversity between nodes across different TAGs. C2: It presents a trade-off between model performance and training efficiency. The computational costs of Eq. (1) are determined by the number of positive samples (|𝒮v|subscript𝒮𝑣|\mathcal{S}_{v}|| caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT |) and batch size (|||\mathcal{B}|| caligraphic_B |), as it requires the language model to encode (|𝒮v|+1)||subscript𝒮𝑣1(|\mathcal{S}_{v}|+1)|\mathcal{B}|( | caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | + 1 ) | caligraphic_B | text sequences per iteration. While reducing the number of positive samples can accelerate training, it may degrade performance since sufficient structural information is critical for graph contrastive learning You et al. (2020); Zhu et al. (2021); Zhang et al. (2024). In Section 3.2 and Section 3.3, we introduce two strategies to address these challenges, respectively.

3.2 Adaptive Positive Sample Selection

To extract the structural similar node set 𝒮vsubscript𝒮𝑣\mathcal{S}_{v}caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT of v𝑣vitalic_v in Eq. (1) (C1), the conventional protocol aims to randomly sample some nodes from v𝑣vitalic_v’s neighbors, i.e., nodes directly connected in the original TAG. However, this seemingly intuitive approach is suboptimal within our learning scenarios, given that neighborhood distributions frequently exhibit significant variability both within and across graphs. To effectively consolidate critical structure information across diverse TAGs, we posit that an advanced sampling strategy should account for the following essential factors.

  • Local Structure. Leveraging the local neighborhood structure, i.e., directly connected nodes, constitutes a fundamental design principle of GNNs Kipf and Welling (2016a). To achieve a unified alignment of text attributes and graph structures across various TAGs, the local neighbors of nodes are indispensable.

  • High-Order Structure. Beyond the local structure, high-order neighbors are crucial for the efficacy of graph machine learning Liu et al. (2020); Li et al. (2021), particularly for long-tail nodes Liu et al. (2021).

  • Graph Statistics. Unlike the standard learning paradigm, learning from multiple TAGs necessitates consideration of the unique characteristics inherent to each graph. For example, node degree distributions vary across TAGs, leading to diverse interpretations of hub nodes. Additionally, node status are graph-specific and not directly comparable between TAGs.

Motivated by these observations, we propose an innovative positive sample selection scheme that adaptively sample structurally similar neighbors by considering nodes’ local and high-order neighbors, as well as their unique statuses within each graph. Specifically, we define an Adaptive Positive Sample selection function AdaPS()AdaPS\text{AdaPS}()AdaPS ( ) as follows:

Sv=AdaPS(Cv,Wv)subscript𝑆𝑣AdaPSsubscript𝐶𝑣subscript𝑊𝑣{S}_{v}=\text{{AdaPS}}(C_{v},W_{v})italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = AdaPS ( italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) (2)

where the positive samples candidates set is denoted as Cvsubscript𝐶𝑣C_{v}italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and the corresponding sampling weights is denoted as Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. In each iteration, we select positive samples from a candidate set Cvsubscript𝐶𝑣C_{v}italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, which is adaptively chosen to emphasize nodes with more informal, structure-related attributes. The selection is weighted based on the unique statistical characteristics of each graph. Next, we will describe the process of obtaining these candidates and weights.

Given graph Gisubscript𝐺𝑖{G}_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a central node v𝑣vitalic_v, positive sample candidates set 𝒞vsubscript𝒞𝑣\mathcal{C}_{v}caligraphic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT of v𝑣vitalic_v is denoted as: 𝒞v=Cands(v,Gi,t)subscript𝒞𝑣Cands𝑣subscript𝐺𝑖𝑡\mathcal{C}_{v}=\text{Cands}(v,{G}_{i},t)caligraphic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = Cands ( italic_v , italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t ). We define function Cands() as follows:

Cands(v,Gi,t)={N1(v)+Nh(v)if deg(v),deg(Gi¯)<tN1(v)otherwiseCands𝑣subscript𝐺𝑖𝑡casessubscript𝑁1𝑣subscript𝑁𝑣if degree𝑣degree¯subscript𝐺𝑖𝑡subscript𝑁1𝑣otherwise\text{Cands}(v,{G}_{i},t)=\begin{cases}N_{1}(v)+N_{h}(v)&\text{if }\deg(v),% \deg(\overline{{G}_{i}})<t\\ N_{1}(v)&\text{otherwise}\end{cases}Cands ( italic_v , italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t ) = { start_ROW start_CELL italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) + italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) end_CELL start_CELL if roman_deg ( italic_v ) , roman_deg ( over¯ start_ARG italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) < italic_t end_CELL end_ROW start_ROW start_CELL italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) end_CELL start_CELL otherwise end_CELL end_ROW (3)

Here, deg(v)degree𝑣\deg(v)roman_deg ( italic_v ) represents the degree of node v𝑣vitalic_v and deg(Gi)¯\deg(\overline{{G}_{i})}roman_deg ( over¯ start_ARG italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG is the averaged degree of Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. N1(v)subscript𝑁1𝑣N_{1}(v)italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v ) and Nh(v)subscript𝑁𝑣N_{h}(v)italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) denote the first-hop and high-order neighbor set of v𝑣vitalic_v, respectively. For each central node, we select candidates adaptively by considering both individual node statistics and overall graph metrics. This approach ensures that nodes with local or high-order structure information are chosen as candidates properly for each central node with minimal noise.

To further take personalized node status into consideration, each node within the candidate set for a given central node v𝑣vitalic_v is assigned a sampling weight. These weights are calculated using the softmax function applied to the PageRank scores of the candidates as follows.

Wv={exp(PR(u)τ)kCvexp(PR(k)τ)uCv}subscript𝑊𝑣conditional-setPR𝑢𝜏subscript𝑘subscript𝐶𝑣PR𝑘𝜏𝑢subscript𝐶𝑣W_{v}=\left\{\frac{\exp\left(\frac{\text{PR}(u)}{\tau}\right)}{\sum_{k\in C_{v% }}\exp\left(\frac{\text{PR}(k)}{\tau}\right)}\mid u\in C_{v}\right\}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = { divide start_ARG roman_exp ( divide start_ARG PR ( italic_u ) end_ARG start_ARG italic_τ end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( divide start_ARG PR ( italic_k ) end_ARG start_ARG italic_τ end_ARG ) end_ARG ∣ italic_u ∈ italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } (4)

where PR(u)PR𝑢\text{PR}(u)PR ( italic_u ) is the PageRank score of node u𝑢uitalic_u, and τ𝜏\tauitalic_τ is the temperature parameter that controls the concentration of the probability distribution. This strategy ensures that the selection probability of each candidate is proportionate to their personalized node status, thus effectively leveraging their structural prominence within the graph.

Remark. Our design strategically incorporates both local and higher-order structural information through the candidate selection function C𝐶Citalic_C, enhancing the depth and accuracy of structural insights in TAGs. By further integrating the PageRank scores in our adaptive sampling function S𝑆Sitalic_S, the model prioritizes key nodes based on their centrality, adapting effectively to each graph’s unique statistical characteristics. This comprehensive approach ensures robust performance and superior adaptability across various graph topologies.

3.3 The Lazy Contrastive Module

Another challenge in fine-tuning LMs from multiple TAGs using Eq. (1) is the efficiency problem (C2). Given the constraints of GPU memory, there is a trade-off between the training batch size and the maximum number of positive samples considered. Increasing the batch size can accelerate the training speed by reducing the number of iterations per epoch, which is important given the large scale training nodes in our learning scenarios, yet at the expense of reducing the number of positive samples per node, and vice versa. However, as verified in previous studies Chien et al. (2022); Fang et al. (2024), preserving a sufficient number of structurally similar nodes is crucial to the success of graph contrastive learning.

To address the dilemma, inspired by momentum contrastive He et al. (2020), we introduce a lazy contrastive module by treating positive sample encoding as dictionary look-up operation. Specifically, we establish a dynamic dictionary across various TAGs, which preserves and updates the representations of positive samples on-the-fly using nodes in the batch size, thereby avoiding the need to encode the text attributes of positive samples using LMs during the training. Formally, we rewrite the standard contrastive loss in Eq. (1) to an efficient version, expressed as:

v=u𝒮vlogexp(sim(f(𝒯v),𝐲u)/τ)kexp(sim(f(𝒯v),f(𝒯k))/τ)s.t.𝐲u=LookUp(𝐄,Idx(u)).\begin{split}\mathcal{L}_{v}&=-\sum_{u\in\mathcal{S}_{v}}\log\frac{\exp\left({% \text{sim}(f(\mathcal{T}_{v}),\mathbf{y}_{u}})/{\tau}\right)}{\sum\limits_{k% \in\mathcal{B}}\exp\left({\text{sim}(f(\mathcal{T}_{v}),f(\mathcal{T}_{k})})/{% \tau}\right)}\\ &s.t.\ \ \ \ \mathbf{y}_{u}=\text{LookUp}(\mathbf{E},\text{Idx}(u)).\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_CELL start_CELL = - ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( sim ( italic_f ( caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , bold_y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_B end_POSTSUBSCRIPT roman_exp ( sim ( italic_f ( caligraphic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , italic_f ( caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) / italic_τ ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_s . italic_t . bold_y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = LookUp ( bold_E , Idx ( italic_u ) ) . end_CELL end_ROW (5)

Here, LookUp(,)LookUp\text{LookUp}(\cdot,\cdot)LookUp ( ⋅ , ⋅ ) denotes a simple embedding look-up operation based on the embedding table 𝐄n×dsuperscript𝐄𝑛𝑑\mathbf{E}^{n\times d}bold_E start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT and the index of node u𝑢uitalic_u, where n=imni𝑛superscriptsubscript𝑖𝑚subscript𝑛𝑖n=\sum_{i}^{m}n_{i}italic_n = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Inx(u)Inx𝑢\text{Inx}(u)Inx ( italic_u ) depends on both graph index and node index within the graph, and d𝑑ditalic_d represents the hidden dimension. Compared to Eq. (1), learning through Eq. (5) is efficient, as the representations of positive samples 𝐲subscript𝐲\mathbf{y}_{*}bold_y start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT are obtained via embedding retrieval without the need for explicit text encoding. It is also worth noting that Eq. (5) is memory-efficient since 𝐲subscript𝐲\mathbf{y}_{*}bold_y start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is gradient-free, reducing the abundance of intermediate tensors for gradient calculation. Consequently, a larger batch size can be used to further enhance the training speed. We empirically demonstrate the efficacy of these designs in Table 5 (Appendix). Next, we will show how to effectively implement the lookup operation.

Dictionary Update and Retrieval. Given m𝑚mitalic_m TAGs {𝒢i}i=1msuperscriptsubscriptsubscript𝒢𝑖𝑖1𝑚\{\mathcal{G}_{i}\}_{i=1}^{m}{ caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, we construct a dynamic embedding table 𝐄𝐄\mathbf{E}bold_E to store embedding of all nodes’ text attributes in m𝑚mitalic_m TAGs. Each node is uniquely identified in 𝐄𝐄\mathbf{E}bold_E by combining its own index within the graph and corresponding graph index. For example, let vi,jsubscript𝑣𝑖𝑗v_{i,j}italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represent the j𝑗jitalic_j-th node in graph Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, then its mapped node index in 𝐄𝐄\mathbf{E}bold_E is denoted as Idx(vi,j)=j+k=1i1nkIdxsubscript𝑣𝑖𝑗𝑗superscriptsubscript𝑘1𝑖1subscript𝑛𝑘\text{Idx}(v_{i,j})=j+\sum_{k=1}^{i-1}n_{k}Idx ( italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = italic_j + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Given vi,jsubscript𝑣𝑖𝑗v_{i,j}italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and the intermediate LM encoder f𝑓fitalic_f, we can update 𝐄𝐄\mathbf{E}bold_E on-the-fly as follows.

𝐄(Idx(vi,j))=f(𝒯vi,j),v(i,j).\begin{split}\mathbf{E}(\text{Idx}(v_{i,j}))&=f(\mathcal{T}_{v_{i,j}}),\quad v% _{(i,j)}\in\mathcal{B}.\end{split}start_ROW start_CELL bold_E ( Idx ( italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) end_CELL start_CELL = italic_f ( caligraphic_T start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_v start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT ∈ caligraphic_B . end_CELL end_ROW (6)

Here, only nodes in \mathcal{B}caligraphic_B are used to update the embedding table 𝐄𝐄\mathbf{E}bold_E, which gives rise to the name "lazy", since it utilizes the encoded representations in previous iteration for dictionary updating. In parallel, given the index of positive sample ui,jsubscript𝑢𝑖𝑗u_{i,j}italic_u start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT of node vi,jsubscript𝑣𝑖𝑗v_{i,j}italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, we extract its hidden representation via simple indexing, i.e., 𝐲ui,j𝐄(Idx(u))subscript𝐲subscript𝑢𝑖𝑗𝐄𝐼𝑑𝑥𝑢\mathbf{y}_{u_{i,j}}\mathbf{E}(Idx(u))bold_y start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_E ( italic_I italic_d italic_x ( italic_u ) ), which is LM encoding-free and can accelerate the training speed.

Remark. In contrast to MoCo He et al. (2020), we do not employ an additional momentum LM encoder to update the embedding table 𝐄𝐄\mathbf{E}bold_E over time. Instead, we directly utilize the encoded central nodes from previous training steps as the latest representations for updating 𝐄𝐄\mathbf{E}bold_E. This design not only enhances our training speed, as demonstrated in Appendix Table 5, but also results in notable performance improvements, as empirically verified in Table 1, as it encourages the LM encoder to learn from previous experiences.

Dataset Emb Types MLP GCN SAGE RevGAT
Computers SE 57.34±0.47 (+30.47%) 70.22±0.40 (+15.55%) 71.02±0.27 (+15.28%) 70.54±0.23 (+17.11%)
BERT 54.04±0.20 (+38.43%) 66.88±0.24 (+21.32%) 67.25±0.12 (+21.74%) 65.69±0.05 (+25.76%)
GIANT 73.05±0.31 (+2.41%) 80.09±0.16 (+1.31%) 81.03±0.12 (+1.04%) 80.99±0.18 (+2.00%)
PATTON 71.60±0.30 (+4.48%) 78.64±0.14 (+3.18%) 79.98±0.15 (+2.36%) 78.97±0.23 (+4.61%)
MixGIA 65.34±0.16 (+14.49%) 75.13±0.08 (+8.00%) 75.83±0.20 (+7.97%) 75.67±0.25 (+9.17%)
UniGLM 74.81±0.14 81.14±0.19 81.87±0.10 82.61±0.13
Fitness SE 78.07±0.18 (+15.78%) 83.20±0.19 (+9.18%) 83.65±0.17 (+8.98%) 84.26±0.16 (+8.12%)
BERT 74.92±0.26 (+20.65%) 80.77±0.23 (+12.47%) 81.29±0.19 (+12.14%) 80.96±0.49 (+12.52%)
GIANT 89.03±0.07 (+1.53%) 89.63±0.14 (+1.35%) 90.15±0.06 (+1.12%) 90.28±0.10 (+0.91%)
PATTON 89.60±0.22 (+0.88%) 90.03±0.21 (+0.90%) 90.61±0.12 (+0.61%) 90.58±0.22 (+0.57%)
MixGIA 82.83±0.12 (+9.13%) 86.05±0.11 (+5.57%) 86.59±0.16 (+5.28%) 86.63±0.15 (+5.16%)
UniGLM 90.39±0.08 90.84±0.08 91.16±0.11 91.10±0.16
PubMed SE 68.06±2.03 (+20.45%) 73.41±3.02 (+9.06%) 74.25±2.48 (+10.07%) 72.56±1.16 (+11.78%)
BERT 59.79±2.71 (+37.11%) 69.96±2.36 (+14.44%) 63.12±2.43 (+29.48%) 64.34±3.10 (+26.06%)
GIANT 73.18±0.97 (+12.03%) 76.93±0.73 (+4.07%) 74.82±0.65 (+9.24%) 75.54±1.09 (+7.37%)
PATTON 79.64±1.30 (+2.94%) 82.57±0.71 (-3.04%) 80.26±0.95 (+1.83%) 80.58±1.94 (+0.66%)
MixGIA 71.11±2.51 (+15.29%) 76.86±0.61 (+4.16%) 73.71±1.32 (+10.88%) 73.85±1.01 (+9.83%)
UniGLM 81.98±1.32 80.06±1.83 81.73±1.06 81.11±0.69
Photo SE 61.24±0.41 (+25.44%) 71.70±0.16 (+10.88%) 72.14±0.33 (+11.88%) 71.63±0.23 (+12.68%)
BERT 60.03±0.14 (+27.97%) 69.63±0.42 (+14.17%) 70.25±0.36 (+14.89%) 68.79±0.09 (+17.33%)
GIANT 77.43±0.27 (-0.79%) 79.79±0.14 (-0.36%) 81.17±0.23 (-0.57%) 80.69±0.43 (+0.02%)
PATTON 74.97±0.43 (+2.47%) 78.40±0.18 (+1.40%) 78.79±0.01 (+2.44%) 79.79±0.14 (+1.15%)
MixGIA 70.72±0.11 (+8.63%) 75.88±0.10 (+4.77%) 77.34±0.09 (+4.36%) 76.59±0.17 (+5.38%)
UniGLM 76.82±0.29 79.50±0.06 80.71±0.29 80.71±0.02
Table 1: Semi-supervised accuracy results on MLP and state-of-art GNNs with various embeddings for Children, Computers, History, PubMed, and Ogbn-Arxiv datasets. More results can be seen in Appendix.

4 Experiments

Throughout the experiments, we aim to answer the following research questions. RQ1: How does UniGLM perform against leading graph embedding models in terms of node classification and link prediction tasks? RQ2: How well does UniGFM transfer in cross-domain and in-domain scenarios? RQ3: How does each component of UniGLM, i.e., sampling strategy and efficient embedding table, contribute to the performance? RQ4: What is the impact of different pre-trained language model backbones and hyper-parameter on UniGLM?

4.1 Experiments Setup

We evaluate UniGLM on eight TAG datasets (details in Tabel  9 in Appendix). UniGLM is compared with multiple embedding models for node classification and link prediction. Node classification is conducted with MLP and GNNs (GCN, GraphSAGE, RevGAT), while link prediction uses MLP and Graph AutoEncoders. Our experiments focus on semi-supervised and transfer learning settings. More details can be found in Appendix.

4.2 Performance Evaluation

Node Classification. To answer RQ1, we conduct extensive experiments on eight benchmark TAG datasets for node classification under the semi-supervised setting. The results are shown in Table 1 and in Table 8(Appendix). We observe that: ① UniGLM significantly outperforms existing graph embedding models in node classification task with state-of-the-art GNNs and MLP. Table 1 demonstrates that UniGLM consistently achieves superior performance across most datasets and models, often ranking first. Although GIANT and PATTON improve over the SE method by leveraging language models and graph structure, UniGLM excels in most cases. ② Performance improvements are proportional to the volume of data used in training, demonstrating the effectiveness of the scaling law for TAGs. Specifically, as depicted in Figure 2, training UniGLM across all TAGs consistently outperforms the variant OneGLM trained on only a single TAG, with the largest observed performance gap being 10%. Furthermore, as indicated in Table 1 and Table 8, co-purchase networks exhibit more substantial improvements compared to citation networks.
Link Prediction. To address RQ1, we evaluate the UniGLM model for link prediction tasks using a 5% training edge set under a transfer learning setting. We train UniGLM on citation networks and evaluate on co-purchase networks. The results are illustrated in Figure 3 and Figure 9 (Appendix). We observe that ③UniGLM exhibits a robust cross-domain transfer capability, generally surpassing other baselines in link prediction. Specifically, UniGLM achieves over 90%percent9090\%90 % on Ogbn-Product dataset, surpassing SE and BERT over 15%percent1515\%15 %.

Refer to caption
Figure 2: Comparison of UniGLM and OneGLM with node classification task on GCN.
Refer to caption
Figure 3: Comparison of UniGLM and baselines with Link prediction task on MLP.

4.3 Transfer Ability

We discuss RQ2 in this section. From the cross-domain perspective, we train model on one domain and apply the model to another. For instance, we train on co-purchase datasets and test on citation networks, or vice versa. These setups are benchmarked against BERT, GIANT fine-tuned on Computers, and Patton pre-trained on Ogbn-Arxiv and evaluate with node classification task. ④We observe that training UniGLM with one domain can enhance the performance of other domain. Specifically, in Table 2 on History, UniGLM trained on citation networks improves performance most in co-purchase networks comparing to baselines, achieving a notable 80.22% accuracy.

Dataset Emb Type Accuracy
History BERT 78.79 ± 0.31
GIA(Computers) 77.35 ± 0.89
PATTON(Arxiv) 79.78 ± 0.29
UniGLM(Citation) 80.22 ± 0.40
Photo BERT 60.03 ± 0.14
GIA(Computers) 75.37 ± 0.48
PATTON(Arxiv) 62.53 ± 0.68
UniGLM(Citation) 63.81 ± 0.36
PubMed BERT 59.79 ± 2.71
GIA(Computers) 57.88 ± 4.56
PATTON(Arxiv) 61.37 ± 1.71
UniGLM(Purchase) 68.29 ± 3.63
Table 2: Accuracy under MLP with Cross-Domain Embeddings. Embedding types under transfer learning settings(cross-domain) are underlined for comparison. Highest results are bolded. Notebly, we do not consider GIA(Computers) for Photo as cross-domain senario for they are both co-purchase network.

For RQ2 in-domain perspective, we employ UniGLM, trained using eight TAGs as the text encoder, to encode the unseen VideoGames dataset, evaluating on both tasks. We use co-viewed edges for training and evaluate on co-purchased edges for link prediction. Results are shown in Table 3. ⑥ We observe that UniGLM exhibits superior in-domain transfer ability. Specifically, UniGLM significantly outperforms BERT on both tasks, highlighting UniGLM’s robustness and effectiveness in in-domain transfer ability.

UniGLM BERT
Link Prediction AP 79.36±0.01plus-or-minus79.360.0179.36\pm 0.0179.36 ± 0.01 58.23±1.27plus-or-minus58.231.2758.23\pm 1.2758.23 ± 1.27
AUC 78.39±0.01plus-or-minus78.390.0178.39\pm 0.0178.39 ± 0.01 57.54±1.07plus-or-minus57.541.0757.54\pm 1.0757.54 ± 1.07
Node Classification ACC 50.88±0.47plus-or-minus50.880.4750.88\pm 0.4750.88 ± 0.47 50.21±0.28plus-or-minus50.210.2850.21\pm 0.2850.21 ± 0.28
Table 3: In-domain Transfer Ability-Comparison of UniGLM and BERT on unseen VideoGames dataset

4.4 Ablation Study

To answer RQ3, from the perspective of sampling strategies, we employ various methods for selecting positive samples with results shown in Figure 4. ⑦ We observe that it is crucial to consider both node degree and graph density for proper structure information. To further answer RQ3, exploring the effect of embedding table, we compare performance on node classification with or without embedding table in Figure 5 and Figure 8. ⑧ We observe that the embedding table not only decreases the training time but also increases the performance of UniGLM.

Refer to caption
Figure 4: Ablation Study - Effect of Sample Strategy.
Refer to caption
Figure 5: Ablation Study - Effect of Embedding Table.

To demonstrate the effect of different types of pre-trained LM as backbones in RQ4, we use BERT, RoBERTa and DeBERTa respectively and the results can be seen in Figure 6 and Figure 10. ⑨ We observe that different encoder backbones does not significantly affect the overall performance of UniGLM. The preference for encoders may vary across different datasets. To test the effect of hyper-parameter: number of positive samples in RQ4, we experiment with various sample sizes, and the results are presented in Figure 7, revealing that ⑩ with different sample sizes, results are consistent, UniGLM is not sensitive to this hyper-parameter .

Refer to caption
Figure 6: Ablation Study - Effect of different LLM Backbones with node classification task on MLP.
Refer to caption
Figure 7: Hyper-parameter Analyse - Number of Positive Samples.

5 Related Work

Our work is related to the following two directions.

Representation Learning on TAGs. Text-attributed graphs (TAGs) have gained significant attention in both academia and industry. Early methods used shallow embedding techniques, which struggled to integrate textual content with graph structure. With pre-trained language models (PLMs), features are now extracted and fed into graph neural networks (GNNs), but this approach often falls short. Recent studies have developed better methods to integrate PLM features into GNNs, improving model performance Ioannidis et al. (2022); Chien et al. (2022); Zhao et al. (2023a); ** et al. (2023).

Graph Foundation Models. Graph Foundation Models (GFMs) aim to generalize across various graphs and tasks. Xia et al. Xia et al. (2023) proposed OpenGraph, excelling in zero-shot learning. Self-supervised learning (SSL) is crucial for pre-training GFMs. Zhao et al. Zhao et al. (2023b) categorized SSL tasks based on graph-embedded knowledge, enhancing model adaptability. Liu et al. Liu et al. (2023) identified challenges and future directions for GFMs, while Tan et al. Tan et al. (2023) improved GFM generalizability through structure reconstruction with their model.

6 Conclusion

In this study, we introduce UniGLM, a unified framework designed to pre-train a single unified language model for TAGs across domains. UniGLM features two main innovations: (1) an adaptive sampling strategy that selects positive samples and (2) a dynamic embedding table that efficiently encodes these samples on-the-fly to speed up training. We validate UniGLM across diverse TAGs from different domains, where it consistently surpasses existing methods in node classification, link prediction, demonstrating its superior transfer ability, effectiveness and efficiency.

Limitation

In this work, we primarily concentrate on employing language models as unified embedding frameworks for Textual Attributed Graphs (TAGs). Looking ahead, several interesting avenues emerge for extending this research. Firstly, there is a compelling need to explore foundation embedding models tailored for multimodal graphs. These models could integrate diverse types of data, such as textual, visual, and auditory information, enhancing the richness of the embeddings and opening up new possibilities for graph analytics. Secondly, the application of generative language models in graph tasks presents a promising frontier. It is still unknown how our UniGLM can be applied in that direction. These directions not only promise to expand the capabilities of graph neural networks but also bridge the gap between structured graph data and unstructured multimodal data.

References

  • Chen et al. (2024) Zhikai Chen, Haitao Mao, Hang Li, Wei **, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al. 2024. Exploring the potential of large language models (llms) in learning on graphs. ACM SIGKDD Explorations Newsletter, 25(2):42–61.
  • Chien et al. (2022) Eli Chien, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Jiong Zhang, Olgica Milenkovic, and Inderjit S Dhillon. 2022. Node feature extraction by self-supervised multi-scale neighborhood prediction. In International Conference on Learning Representations.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  • Fang et al. (2024) Yi Fang, Dongzhe Fan, Daochen Zha, and Qiaoyu Tan. 2024. Gaugllm: Improving graph contrastive learning for text-attributed graphs with large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  • Fey and Lenssen (2019) Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428.
  • Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864.
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NIPS), pages 1024–1034.
  • Harris (1954) Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146–162.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738.
  • He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, pages 507–517. International World Wide Web Conferences Steering Committee.
  • He et al. (2024) Xiaoxin He, Xavier Bresson, Thomas Laurent, Adam Perold, Yann LeCun, and Bryan Hooi. 2024. Harnessing explanations: Llm-to-lm interpreter for enhanced text-attributed graph representation learning. In Proceedings of the International Conference on Learning Representations (ICLR).
  • Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133.
  • Ioannidis et al. (2022) Vassilis N. Ioannidis, Xiang Song, Da Zheng, Houyu Zhang, Jun Ma, Yi Xu, Belinda Zeng, Trishul Chilimbi, and George Karypis. 2022. Efficient and effective training of language and graph neural network models. arXiv preprint arXiv:2206.10781.
  • ** et al. (2023) Bowen **, Wentao Zhang, Yu Zhang, Yu Meng, Xinyang Zhang, Qi Zhu, and Jiawei Han. 2023. Patton: Language model pretraining on text-rich networks. arXiv preprint arXiv:2305.12268.
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  • Kipf and Welling (2016a) Thomas N Kipf and Max Welling. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  • Kipf and Welling (2016b) Thomas N. Kipf and Max Welling. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Bayesian Deep Learning Workshop (NIPS 2016).
  • Li et al. (2021) Guohao Li, Matthias Muller, Bernard Ghanem, and Vladlen Koltun. 2021. Training graph neural networks with 1000 layers. In Proceedings of the International Conference on Machine Learning (ICML).
  • Liu et al. (2023) Jiawei Liu, Cheng Yang, Zhiyuan Lu, Junze Chen, Yibo Li, Mengmei Zhang, Ting Bai, Yuan Fang, Lichao Sun, Philip S. Yu, and Chuan Shi. 2023. Towards graph foundation models: A survey and beyond. arXiv preprint arXiv:2310.11829.
  • Liu et al. (2020) Meng Liu, Hongyang Gao, and Shuiwang Ji. 2020. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 338–348.
  • Liu et al. (2021) Zemin Liu, Trung-Kien Nguyen, and Yuan Fang. 2021. Tail-gnn: Tail-node graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1109–1119.
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Javen Shi, and Anton van den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–52. ACM.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 188–197.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710.
  • Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine, 29(3):93–93.
  • Tan et al. (2023) Qiaoyu Tan, Ninghao Liu, Xiao Huang, Soo-Hyun Choi, Li Li, Rui Chen, and Xia Hu. 2023. S2gae: Self-supervised graph autoencoders are generalizable learners with graph masking. WSDM ’23, pages 787–795, New York, NY, USA. Association for Computing Machinery.
  • Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24.
  • Xia et al. (2023) Lianghao Xia, Ben Kao, and Chao Huang. 2023. Opengraph: Towards open graph foundation models. arXiv preprint arXiv:2403.01121.
  • Yang et al. (2021) Junhan Yang, Zheng Liu, Shitao Xiao, Chaozhuo Li, Defu Lian, Sanjay Agrawal, Amit Singh, Guangzhong Sun, and Xing Xie. 2021. Graphformers: Gnn-nested transformers for representation learning on textual graph. Advances in Neural Information Processing Systems, 34:28798–28810.
  • You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823.
  • Zhang et al. (2018) Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. 2018. Network representation learning: A survey. IEEE transactions on Big Data, 6(1):3–28.
  • Zhang et al. (2024) Xin Zhang, Qiaoyu Tan, Xiao Huang, and Bo Li. 2024. Graph contrastive learning with personalized augmentation. IEEE Transactions on Knowledge and Data Engineering.
  • Zhao et al. (2023a) Jianan Zhao, Meng Qu, Chaozhuo Li, Hao Yan, Qian Liu, Rui Li, Xing Xie, and Jian Tang. 2023a. Learning on large-scale text-attributed graphs via variational inference. In International Conference on Learning Representations. ICLR.
  • Zhao et al. (2023b) Ziwen Zhao, Yuhua Li, Yixiong Zou, Ruixuan Li, and Rui Zhang. 2023b. A survey on self-supervised pre-training of graph foundation models: A knowledge-based perspective. arXiv preprint arXiv:2403.16137.
  • Zhou et al. (2020) Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI open, 1:57–81.
  • Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, pages 2069–2080.

Appendix A Appendix

Datasets.  We evaluate the proposed UniGLM framework using eight publicly available TAG datasets. These datasets include two citation networks, namely PubMed Sen et al. (2008) and Ogbn-Arxiv Hu et al. (2020), one co-purchase network Ogbn-Product(subset) from TAPE He et al. (2024) and five E-commerce datasets extracted from Amazon Ni et al. (2019): Electronics-Computers (Computers), Books-History (History), Books-Children (Children), Sports-Fitness (Fitness), and Electronics-Photography (Photo). Noteblely, we use another co-purchase network from He and McAuley (2016) and McAuley et al. (2015) named VideoGames. For node classification, we adhere to the standard data splits used in prior research for PubMed, Ogbn-Arxiv, and Ogbn-Products, while we use a 5:20:75 split for the E-commerce datasets. For link prediction, we adopt the widely used 5:5:90 split and sample equal number of negative links as positive links.

  • PubMed Sen et al. (2008). The PubMed dataset consists of 19,717 scientific publications from PubMed database. The citation network consists of 44,338links

  • Ogbn-Arxiv Hu et al. (2020). The Ogbn-Arxiv dataset is a directed graph, representing the citation network between all Computer Science(CS) arXiv papers.

  • Ogbn-Products(subset) He et al. (2024). The Ogbn-Products dataset represents an Amazon product co-purchasing network, with product decriptions as raw text.

  • Electronics-Computers(Computers) Ni et al. (2019). The Electronics-Computers dataset is a segment of the Amazon co-purchase graph, where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews by default, and class labels are given by the product category.

  • Books-History(History) Ni et al. (2019). Each node represents a book related to history domain, edges indicate that two books are frequently bought together and class labels are given by the book category.

  • Books-Children(Children) Ni et al. (2019). Each node represents a book related to child domain, edges indicate that two books are frequently bought together and class labels are given by the book category.

  • Sports-Fitness(Fitness) Ni et al. (2019). Each node represents a kind of item in Sports and Fitness category.

  • Electronics-Photography(Photo) Ni et al. (2019). This dataset contains review data from Amazon platform. Each node represents a review related to the item in camera category.

Baselines. We compare UniGLM with five textual feature extraction methods. Previously we mentioned four methods including: SE, BERT, GIANT, PATTON in Section 1. To comprehensively compare the proposed UniGLM with existing methods on multiple-datasets, we implement MixGIA as an additional baseline. MixGIA utilizes the GIANT framework and is applied to multiple datasets, leveraging the structural information of each dataset to fine-tune an a shared BERT model. Specifically, original GIANT construct a clustering tree to classify each node layer by layer. Our implementation of MixGIA set a fix number of layers for clustering tree and change the tree alone with the dataset per layer.

Experimental Settings. For node classification, we input the node embeddings from UniGLM into MLP and several different state-of-art GNN models, including GCN Mikolov et al. (2013), GraphSAGE Hamilton et al. (2017) and RevGAT Li et al. (2021). We run experiments 5 times and report the mean result and the standard deviation. We explore link prediction ability under the transfer learning setting, details are explained in link prediction evaluation. We use AUC and AP as metrics for link prediction. We test with MLP and Graph AutoEncoders Kipf and Welling (2016b). For the reproducibility of our experiments, we employ GNN implementations from the PyG (Fey and Lenssen, 2019) package. The default language model backbone is Bert if not specified. Hyper-parameters are shown below.

Hyperparameter Our hyperparameter for GNN is set as same as TAPE He et al. (2024). For UniGLM, we have hyperparameter in Table 4 as follow:

Hyperparameters
BATCH SIZE 64
TEMPERATURE 0.3
MAX SEQUENCE LENGTH 512
NUM POS SAMPLES (t) 6
Table 4: Hyperparemeters of UniGLM

Device and time cost. We compare the training and inference time cost of GIANT, PATTON and UGEmb. For GIANT and PATTON, the training time is the sum of training all 8 TAGs listed above. For UGEmb, it is the time to use all 8 TAGs to train one model. Inference time is the time to encode Ogbn-Arxiv for each method. We use a A800(80G) for training and inference.

GIANT PATTON UGEmb
Training Time 14.24h 5.4d 13.5h
Inference Time 30min 5h15min 29min
Table 5: Time Cost of Different Frameworks

Additional Results In this section, we present further results and details not previously included due to space constraints. These additional findings further substantiate our research claims and provide a deeper understanding of our study’s implications.

This figure  8 shows the result with or without Embedding Table design.

Refer to caption
Figure 8: Ablation Study - Effect of Embedding Table.

This figure 9 shows the result of using GAE as the inference model for link prediction.

Refer to caption
Figure 9: Average Precision with GAE

This figure 10 shows the result of using different LLMs as backbone.

Refer to caption
Figure 10: Ablation Study - Effect of different LLM Backbones with node classification task on GCN.

This table  7 shown additional results to prove the effectiveness of our adaptive sampling strategy.

Dataset Emb Type Accuracy
History with E 82.51±0.08
w/o E 81.96 ± 0.29
Computers with E 81.14±0.19
w/o E 80.14 ± 0.14
Children with E 50.26±0.15
w/o E 49.07 ± 0.25
Fitness with E 90.84±0.08
w/o E 90.68 ± 0.10
Table 6: More results of Ablation Study - Effect of Embedding Table(GCN).

This table  6 shown additional results to prove the effectiveness of our Embedding Table Module.

Dataset Emb Type Accuracy
History w/o Node Degree 79.84 ± 0.24
w/o Graph Density 83.07 ± 0.31
with both 83.42 ± 0.18
PubMed w/o Node Degree 76.04 ± 1.59
w/o Graph Density 82.35 ± 0.83
with both 81.98 ± 1.32
Children w/o Node Degree 47.57 ± 0.29
w/o Graph Density 51.38 ± 0.20
with both 51.86 ± 0.31
Fitness w/o Node Degree 90.03 ± 0.11
w/o Graph Density 88.62 ± 0.07
with both 90.39 ± 0.08
Ogbn-Products w/o Node Degree 74.96 ± 0.57
w/o Graph Density 76.56 ± 1.14
with both 76.46 ± 0.21
Table 7: More results of Ablation Study - Effect of Sample Strategy(MLP).

This table 8 is the other half of node classification results.

Dataset Emb Types MLP GCN SAGE RevGAT
Children SE 38.84±0.35 (+33.52%) 43.19±0.31 (+16.37%) 44.83±0.24 (+16.28%) 43.00±0.18 (+19.40%)
BERT 44.00±0.83 (+17.86%) 46.88±0.60 (+7.21%) 47.97±0.55 (+8.67%) 48.00±0.11 (+6.96%)
GIANT 48.95±0.23 (+5.94%) 48.47±0.35 (+3.69%) 51.41±0.42 (+1.40%) 50.63±0.36 (+1.40%)
PATTON 49.91±0.13 (+3.91%) 49.98±0.38(+0.56%) 52.01±0.50 (+0.23%) 51.07±0.21(+0.53%)
MixGIA 47.49±0.19 (+9.20%) 48.89±0.25 (+2.80%) 50.60±0.19 (+3.02%) 49.73±0.37 (+3.24%)
UniGLM 51.86±0.31 50.26±0.15 52.13±0.34 51.34±0.22
Ogbn-Products SE 53.85±0.17 (+41.99%) 70.52±0.51 (+9.34%) 69.13±0.26 (+12.98%) 69.64±0.17 (+13.47%)
BERT 67.58±0.28 (+13.14%) 74.77±0.87 (+3.13%) 74.09±0.27 (+5.41%) 74.53±0.26 (+6.02%)
GIANT 72.46±0.33 (+5.52%) 69.77±0.42 (+10.52%) 68.69±1.19 (+13.70%) 71.89±0.30 (+9.92%)
PATTON 76.42±0.23 (+0.05%) 77.22±0.34 (-0.14%) 77.81±0.58 (+0.37%) 78.48±0.15 (+0.69%)
MixGIA 71.04±0.38 (+7.63%) 76.13±0.82 (+1.29%) 75.77±0.40 (+3.08%) 76.24±0.33 (+3.65%)
UniGLM 76.46±0.21 77.11±0.41 78.10±0.22 79.02±1.12
History SE 73.19±0.36 (+13.98%) 77.03±0.70 (+7.11%) 77.63±0.22 (+7.29%) 77.83±0.27 (+6.68%)
BERT 78.79±0.31 (+5.88%) 80.33±0.33 (+2.71%) 80.63±0.33 (+3.30%) 80.69±0.22 (+2.90%)
GIANT 81.37±0.32 (+2.52%) 80.88±0.19 (+2.02%) 82.25±0.19 (+1.26%) 81.70±0.26 (+1.63%)
PATTON 82.88±0.24 (+0.65%) 82.41±0.20 (+0.12%) 83.43±0.18 (-0.17%) 82.94±0.07 (+0.11%)
MixGIA 81.47±0.32 (+2.39%) 81.68±0.24 (+1.02%) 82.55±0.23 (+0.90%) 82.26±0.26 (+0.94%)
UniGLM 83.42±0.18 82.51±0.08 83.29±0.17 83.03±0.23
Ogbn-Arxiv SE 64.30±0.09 (+13.56%) 71.74±0.29 (+2.48%) 71.49±0.27 (+4.15%) 74.02±0.18 (+0.19%)
BERT 66.29±0.20 (+10.15%) 72.63±0.31 (+1.23%) 73.33±0.33 (+1.54%) 72.88±0.39 (+1.76%)
GIANT 73.08±0.06 (-0.08%) 73.29±0.10 (-0.31%) 74.59±0.28 (-0.17%) 75.96±0.09 (-2.37%)
PATTON 73.47±0.11(-0.61%) 73.59±0.20(-0.08%) 75.00±0.16(-0.72%) 74.08±0.12(+0.11%)
MixGIA 67.62±0.24 (+7.99%) 73.07±0.30 (+0.62%) 73.70±0.10 (+1.03%) 73.57±0.39 (+0.80%)
UniGLM 73.02±0.11 73.52±0.23 74.46±0.12 74.16±0.51
Table 8: Semi-supervised accuracy results on MLP and state-of-art GNNs with various embeddings for Fitness, Photo, and Ogbn-Products datasets.
Data # Nodes # Edges # Ave Token #classes
PubMed 19,7171971719,71719 , 717 44,3384433844,33844 , 338 391391391391 3333
Ogbn-Arxiv 169,343169343169,343169 , 343 1,166,24311662431,166,2431 , 166 , 243 235235235235 40404040
Ogbn-Products 54,0255402554,02554 , 025 74,4207442074,42074 , 420 163.18163.18163.18163.18 47474747
Electronics-Computers 87,2298722987,22987 , 229 808,310808310808,310808 , 310 117117117117 10101010
Books-History 41,5514155141,55141 , 551 400,125400125400,125400 , 125 302302302302 12121212
Books-Children 76,8757687576,87576 , 875 1,631,45316314531,631,4531 , 631 , 453 280280280280 27272727
Sports-Fitness 173,055173055173,055173 , 055 1,946,55519465551,946,5551 , 946 , 555 30303030 13131313
Electronics-Photography 48,3624836248,36248 , 362 549,290549290549,290549 , 290 191191191191 12121212
Table 9: Dataset statistics of eight text-attributed graphs (TAGs).