HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification

He Zhu1∗, Junran Wu1∗, Ruomei Liu1, Yue Hou1, Ze Yuan1, Shangzhe Li3,
Yicheng Pan1,2†, and Ke Xu1,2
1State Key Lab of Software Development Environment,
Beihang University, Bei**g, 100191, China
2Zhongguancun Laboratory, Bei**g, 100094, China
3School of Statistics and Mathematics,
Central University of Finance and Economics, Bei**g 100081, China
{roy_zh, wu_junran, rmliu, hou_yue, yuanze1024, yichengp, kexu}@buaa.edu.cn,
[email protected]
Abstract

Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information. In this paper, we tend to investigate the feasibility of a contrastive learning scheme in which the semantic and syntactic information inherent in the input sample is adequately reserved in the contrastive samples and fused during the learning process. Specifically, we propose an information lossless contrastive learning strategy for HTC, namely Hierarchy-aware Information Lossless contrastive Learning (HILL), which consists of a text encoder representing the input document, and a structure encoder directly generating the positive sample. The structure encoder takes the document embedding as input, extracts the essential syntactic information inherent in the label hierarchy with the principle of structural entropy minimization, and injects the syntactic information into the text representation via hierarchical representation learning. Experiments on three common datasets are conducted to verify the superiority of HILL.

HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification


He Zhu1∗, Junran Wu1∗, Ruomei Liu1, Yue Hou1, Ze Yuan1, Shangzhe Li3, Yicheng Pan1,2†, and Ke Xu1,2 1State Key Lab of Software Development Environment, Beihang University, Bei**g, 100191, China 2Zhongguancun Laboratory, Bei**g, 100094, China 3School of Statistics and Mathematics, Central University of Finance and Economics, Bei**g 100081, China {roy_zh, wu_junran, rmliu, hou_yue, yuanze1024, yichengp, kexu}@buaa.edu.cn, [email protected]


footnotetext: Equal Contribution.footnotetext: Correspondence to: Yicheng Pan.

1 Introduction

Self-supervised learning (SSL) has exhibited remarkable success across various domains in deep learning, empowering models with potent representation capabilities. Based on these achievements, researchers have incorporated contrastive learning into hierarchical text classification (Wang et al., 2022a), a challenging sub-task within the realm of text multi-label classification. Beyond processing text samples, hierarchical text classification methods should handle a predefined directed acyclic graph in the corpus, referred to as the label hierarchy. While language models such as BERT (Devlin et al., 2019) are pretrained on textual data, their efficacy in handling structural input is limited. Consequently, researchers have introduced a Graph Neural Network (GNN)-based encoder to establish a dual-encoder framework for HTC (Zhou et al., 2020; Deng et al., 2021; Chen et al., 2021).

Refer to caption
(a) Previous Methods
Refer to caption
(b) Ours

Figure 1: Comparison between HILL and previous methods. (a) Previous work use structure encoder in data augmentation. (b) Our method extracting syntactic information in information lossless learning paradigm.

Although the structure encoder contributes to representing the label hierarchy, the dual-encoder framework simply blends the outputs of encoders. In an effort to integrate the label hierarchy into BERT, Wang et al. (2022a) propose a contrastive learning framework, in which BERT functions as a siamese network (Bromley et al., 1993) accepting both the raw text and the masked text, where the mask is generated by the structure encoder. However, their contrastive learning process essentially relies on data augmentation, even with the involvement of the structure encoder in the masking process, as depicted in Figure 1(a). According to the data processing inequality (Cover and Thomas, 2006), applying data augmentation to the raw text may potentially erase sufficient semantic information relevant to downstream prediction.

To maximally preserve the semantic information in the text and effectively leverage the structural encoder in the contrastive learning process for HTC, we tend to inject the essential information inherent in the label hierarchy into text representations rather than augmenting the input document. As shown in Figure 1(b), our structure encoder directly generates feature vectors by fusing textual and structural information, in contrast to masking the text as illustrated in Figure 1(a). Following the insights of Li and Pan (2016), where structural entropy encodes and decodes the essential structure of the original system to support its semantic analysis. Since hierarchical text classification holds both semantic and syntactic information, we aim to design our model with the guidance of structural entropy. Specifically, we implement a suite of algorithms to minimize the structural entropy of label hierarchies by constructing their coding trees. Subsequently, we design a structure encoder to perform representation learning on coding trees, in which the leaf-node embeddings are initialized by the text encoder while the non-leaf node embeddings are iteratively obtained from bottom to top. Afterward, the structure encoder generates an overall representation of the coding tree, which serves as a contrastive sample for the text encoder. Additionally, we provide a definition of information lossless learning and prove that the information retained by our approach is the upper bound of any augmented data. In comparison with other contrastive and supervised learning methods, our model achieves significant performance gains on three common datasets. Overall, the contributions of our work can be summarized as follows:

  • To realize information lossless learning, we decode the essential structure of label hierarchies through the proposed algorithms under the guidance of structural entropy, supporting the semantic analysis for HTC.

  • We propose a contrastive learning framework, namely HILL, which fuses the structural information from the label hierarchies into the given document embeddings, while the semantic information from the input document is maximally preserved.

  • We define information lossless learning for HTC and prove that the information retained by HILL is the upper bound of any other augmentation-based methods.

  • Experiments conducted on three common datasets demonstrate the effectiveness and efficiency of HILL. For reproducibility, source code is available at https://github.com/Rooooyy/HILL.

2 Related Works

Hierarchical Text Classification.

Existing works for HTC could be categorized into local and global approaches (Zhou et al., 2020). Local approaches build multiple models for labels in different levels in the hierarchy, conveying the information from models in the upper levels to those in the bottom (Kowsari et al., 2017; Shimura et al., 2018; Banerjee et al., 2019; Huang et al., 2019). On the contrary, global studies treat HTC as a flat multi-label classification problem(Gopal and Yang, 2013; You et al., 2019; Aly et al., 2019; Mao et al., 2019; Wu et al., 2019; Rojas et al., 2020).

Recently, Zhou et al. (2020) introduce a dual-encoder framework consisting of a text and a graph encoder to separately handle the text and the label hierarchy. Based on HiAGM (Zhou et al., 2020), Chen et al. (2020a) jointly model the text and labels in the hyperbolic space. Chen et al. (2021) formulate HTC as a semantic matching problem. Deng et al. (2021) introduce information maximization to capture the interaction between text and label while erasing irrelevant information.

Given the success of Pretrained Language Models (PLMs), researchers attempt to utilize their powerful abilities in HTC. Wang et al. (2022a) propose a contrastive learning framework for HTC to make BERT learn from the structure-encoder-controlled text augmentation. Wang et al. (2022b) introduce prompt tuning and construct dynamic templates for HTC. Jiang et al. (2022) encode the global hierarchies with BERT, extract the local hierarchies, and feed them into BERT in a prompt-tuning-like schema. Despite their success, neither existing contrastive learning nor prompt tuning methods try to improve the structure encoder.

Contrastive Learning.

Inspired by the pretext tasks in GPT (Radford and Narasimhan, 2018) and BERT (Devlin et al., 2019), researchers originally proposed contrastive learning in computer vision (Chen et al., 2020b; He et al., 2020), addressing the limitations of previous methods in training with massive unlabeled visual data. Numerous studies have shown that the key to contrastive learning lies in the construction of positive pairs (Tian et al., 2020; Caron et al., 2020; Jaiswal et al., 2020; Grill et al., 2020), especially in natural language processing (Wu et al., 2020; Yan et al., 2021; Meng et al., 2021; Pan et al., 2022).

Structural Entropy.

Structural entropy (Li and Pan, 2016) is a natural extension of Shannon entropy (Shannon, 1948) on a structural system, which could measure the structure complexity of the system. Non-Euclidean data, especially graph data, is a typical structured system. The structural entropy of a graph is defined as the average length of the codewords obtained by a random walk under a specific coding scheme, namely coding tree (Li and Pan, 2016). In the past few years, structural entropy has been successfully applied in community security (Liu et al., 2019) graph classification (Wu et al., 2022b; Yang et al., 2023), text classification (Zhang et al., 2022), graph pooling (Wu et al., 2022a), graph contrastive learning (Wu et al., 2023), and node clustering (Wang et al., 2023).

Refer to caption
Figure 2: An example of our model with K=3𝐾3K=3italic_K = 3. We first feed the document D𝐷Ditalic_D into the text encoder to extract the semantic information. Next, the structure encoder takes label hierarchy GLsubscript𝐺𝐿G_{L}italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT as input and constructs the optimal coding tree TLsubscript𝑇𝐿T_{L}italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT with Algorithm 1 under the guidance of structural entropy. In the hierarchical representation learning module, the leaf node embeddings are initialized by the document embeddings, and the representations of non-leaf nodes are learned from bottom to top. The structure encoder finally generates an information lossless positive view for the text encoder, which is formulated in Section 8 and proved in Appendix C.

3 Methodology

In this section, we first give a problem definition of hierarchical text classification. Next, we elaborate on the working process of the text encoder (in Section 3.2) and the structure encoder (in Section 3.3) of the proposed HILL. Theoretical analysis is further given to reveal the information lossless property of HILL for HTC in Section 8. Overall, the framework of HILL is shown in Figure 2.

3.1 Problem Definition

In hierarchical text classification, the label set is predefined and represented as a directed acyclic graph, namely the label hierarchy. Every label that appears in the corpus corresponds to a unique node on the hierarchy. Each non-root node is pointed by only one node in the upper levels, i.e. its parent node. In the ground-truth label set 𝒴𝒴\mathcal{Y}caligraphic_Y of any sample, a non-root label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT always co-occurs with its parent nodes, put differently, for any yi𝒴subscript𝑦𝑖𝒴y_{i}\in\mathcal{Y}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y, the parent node of yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is also in 𝒴𝒴\mathcal{Y}caligraphic_Y. Given a document D𝐷Ditalic_D to be classified, where D={w1,w2,,wN}𝐷subscript𝑤1subscript𝑤2subscript𝑤𝑁D=\{w_{1},w_{2},\dots,w_{N}\}italic_D = { italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } is commonly treated as a sequence with N𝑁Nitalic_N tokens, an HTC model should predict a subset 𝒴^^𝒴\hat{\mathcal{Y}}over^ start_ARG caligraphic_Y end_ARG of the complete label set 𝕐𝕐\mathbb{Y}blackboard_Y.

3.2 Text Encoder

Our framework is compatible with multiple document representation models. To maintain consistency with previous works, we utilize BERT (Devlin et al., 2019) as the text encoder.

First, the input document D𝐷Ditalic_D is tokenized into a sequence with N𝑁Nitalic_N tokens and then padded with two special tokens:

D~={[CLS],w1,w2,,wN,[SEP]},~𝐷delimited-[]𝐶𝐿𝑆subscript𝑤1subscript𝑤2subscript𝑤𝑁delimited-[]𝑆𝐸𝑃\tilde{D}=\{[CLS],w_{1},w_{2},\dots,w_{N},[SEP]\},over~ start_ARG italic_D end_ARG = { [ italic_C italic_L italic_S ] , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , [ italic_S italic_E italic_P ] } , (1)

where [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] and [SEP]delimited-[]𝑆𝐸𝑃[SEP][ italic_S italic_E italic_P ] are respectively recognized as the beginning and the end of the document.

Next, the BERT encoder takes the padded document D~~𝐷\tilde{D}over~ start_ARG italic_D end_ARG as input and generates hidden representations of each token, formally, HBERT=BERT(D~),subscript𝐻𝐵𝐸𝑅𝑇subscript𝐵𝐸𝑅𝑇~𝐷H_{BERT}=\mathcal{F}_{BERT}(\tilde{D}),italic_H start_POSTSUBSCRIPT italic_B italic_E italic_R italic_T end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_B italic_E italic_R italic_T end_POSTSUBSCRIPT ( over~ start_ARG italic_D end_ARG ) , where HBERT(N+2)×dBsubscript𝐻𝐵𝐸𝑅𝑇superscript𝑁2subscript𝑑𝐵H_{BERT}\in\mathbb{R}^{(N+2)\times d_{B}}italic_H start_POSTSUBSCRIPT italic_B italic_E italic_R italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N + 2 ) × italic_d start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the token embedding matrix while BERT()subscript𝐵𝐸𝑅𝑇\mathcal{F}_{BERT}(\cdot)caligraphic_F start_POSTSUBSCRIPT italic_B italic_E italic_R italic_T end_POSTSUBSCRIPT ( ⋅ ) denotes the holistic BERT model. Afterward, the [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] embedding is taken as the representation of the entire document. That is, hD=HBERT[CLS]=HBERT0,subscript𝐷superscriptsubscript𝐻𝐵𝐸𝑅𝑇delimited-[]𝐶𝐿𝑆superscriptsubscript𝐻𝐵𝐸𝑅𝑇0h_{D}=H_{BERT}^{[CLS]}=H_{BERT}^{0},italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_B italic_E italic_R italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_C italic_L italic_S ] end_POSTSUPERSCRIPT = italic_H start_POSTSUBSCRIPT italic_B italic_E italic_R italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , where hDdBsubscript𝐷superscriptsubscript𝑑𝐵h_{D}\in\mathbb{R}^{d_{B}}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the document embedding, and dBsubscript𝑑𝐵d_{B}italic_d start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is the hidden size of BERT.

3.3 Structure Encoder

To implement information lossless contrastive learning in the structure encoder, we propose an algorithm to extract structural information from the label hierarchy via structural entropy (Li and Pan, 2016) minimization and a hierarchical representation learning mechanism to inject the structural information into text embeddings. Thereafter, the structure encoder generates a positive view of the document that retains both semantic and structural information losslessly.

Structural Entropy.

In Li and Pan (2016), the structural entropy of a graph G=(VG,EG)𝐺subscript𝑉𝐺subscript𝐸𝐺G=(V_{G},E_{G})italic_G = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) is defined as the average length of the codewords obtained by a random walk under a specific coding pattern named coding tree, formally,

HT(G)=vTgvvol(G)logvol(v)vol(v+),superscript𝐻𝑇𝐺subscript𝑣𝑇subscript𝑔𝑣𝑣𝑜𝑙𝐺𝑣𝑜𝑙𝑣𝑣𝑜𝑙superscript𝑣H^{T}(G)=-\sum_{v\in T}\frac{g_{v}}{vol(G)}\log{\frac{vol(v)}{vol(v^{+})}},italic_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_G ) = - ∑ start_POSTSUBSCRIPT italic_v ∈ italic_T end_POSTSUBSCRIPT divide start_ARG italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_ARG italic_v italic_o italic_l ( italic_G ) end_ARG roman_log divide start_ARG italic_v italic_o italic_l ( italic_v ) end_ARG start_ARG italic_v italic_o italic_l ( italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_ARG , (2)

where v𝑣vitalic_v is a non-root node of coding tree T𝑇Titalic_T which represents a subset of VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, v+superscript𝑣v^{+}italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the parent node of v𝑣vitalic_v on the coding tree. gvsubscript𝑔𝑣g_{v}italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT represents the number of v𝑣vitalic_v’s cut edges on G𝐺Gitalic_G. vol(G)𝑣𝑜𝑙𝐺vol(G)italic_v italic_o italic_l ( italic_G ) denotes the volume of graph G𝐺Gitalic_G while vol(v)𝑣𝑜𝑙𝑣vol(v)italic_v italic_o italic_l ( italic_v ) and vol(v+)𝑣𝑜𝑙superscript𝑣vol(v^{+})italic_v italic_o italic_l ( italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) is the sum of the degree of nodes partitioned by v𝑣vitalic_v and v+superscript𝑣v^{+}italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT.

The height of the coding tree should be fixed to formulate a certain coding scheme. Therefore, the K𝐾Kitalic_K-dimensional structural entropy of the graph G𝐺Gitalic_G determined by the coding tree T𝑇Titalic_T with a certain height K𝐾Kitalic_K is defined as:

HK(G)=min{T|height(T)K}HT(G).subscript𝐻𝐾𝐺subscriptconditional-set𝑇𝑒𝑖𝑔𝑡𝑇𝐾superscript𝐻𝑇𝐺H_{K}(G)=\min_{\{T|height(T)\leq K\}}H^{T}(G).italic_H start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_G ) = roman_min start_POSTSUBSCRIPT { italic_T | italic_h italic_e italic_i italic_g italic_h italic_t ( italic_T ) ≤ italic_K } end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_G ) . (3)

More details about structural entropy and coding trees are provided in Appendix A.

Structural Entropy Minimization.

To minimize the structural entropy is to construct the optimal coding tree of graph G𝐺Gitalic_G. Thus, we design two algorithms to heuristically construct a coding tree T𝑇Titalic_T with height K𝐾Kitalic_K. In Algorithm 1, we take VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as the leaf nodes, connect them directly to the root node vrTsuperscriptsubscript𝑣𝑟𝑇v_{r}^{T}italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and call Algorithm 2 to construct an initial coding tree T𝑇Titalic_T. Due to different scopes, we use T𝑇Titalic_T and 𝐓𝐓\mathbf{T}bold_T to distinguish the coding trees in Algorithm 1 and Algorithm 2. Algorithm 2 creates a new coding tree 𝐓𝐓\mathbf{T}bold_T of height 1 with 𝐯𝐯\mathbf{v}bold_v and iteratively compresses two child nodes from the children set C(𝐯)𝐶𝐯C(\mathbf{v})italic_C ( bold_v ) of root node 𝐯𝐯\mathbf{v}bold_v in the first while𝑤𝑖𝑙𝑒whileitalic_w italic_h italic_i italic_l italic_e loop (lines 3-6), prioritizing the nodes that result in the largest reduction in structural entropy. Since tree 𝐓𝐓\mathbf{T}bold_T’s height may exceed K𝐾Kitalic_K, in the second while𝑤𝑖𝑙𝑒whileitalic_w italic_h italic_i italic_l italic_e loop (lines 7-10), we iteratively remove non-leaf nodes of 𝐓𝐓\mathbf{T}bold_T, prioritizing nodes with the smallest entropy increase upon deletion. Afterward, leaf nodes of 𝐓𝐓\mathbf{T}bold_T might have different heights, contradicting the definition of coding trees. Thus, we adopt the operation in line 11 to align leaf nodes. Algorithm 2 always returns a coding tree 𝐓𝐓\mathbf{T}bold_T with height 2. Algorithm 1 will iteratively invoke Algorithm 2 until the height of T𝑇Titalic_T reaches K𝐾Kitalic_K. More precisely, each iteration within the while𝑤𝑖𝑙𝑒whileitalic_w italic_h italic_i italic_l italic_e loop will increment the height of T𝑇Titalic_T by 1 by calling Algorithm 2 on the root node vrTsuperscriptsubscript𝑣𝑟𝑇v_{r}^{T}italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT or on all nodes in VT1superscriptsubscript𝑉𝑇1V_{T}^{1}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, depending on the reduction in structural entropyIn the implementation, we apply pruning strategies to improve the efficiency, but they are omitted here for clarity.. More details about the proposed algorithms can be found in Appendix  B.

Algorithm 1 Greedy Coding Tree Construction

Input: A graph G=(VG,EG)𝐺subscript𝑉𝐺subscript𝐸𝐺G=(V_{G},E_{G})italic_G = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) and a positive integer K𝐾Kitalic_K.
Output: Coding tree T=(VT,ET)𝑇subscript𝑉𝑇subscript𝐸𝑇T=(V_{T},E_{T})italic_T = ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) of the graph G𝐺Gitalic_G with height K𝐾Kitalic_K.

1:  VT1:={vrT},VT0:=C(vrT):=VGformulae-sequenceassignsuperscriptsubscript𝑉𝑇1superscriptsubscript𝑣𝑟𝑇assignsuperscriptsubscript𝑉𝑇0𝐶superscriptsubscript𝑣𝑟𝑇assignsubscript𝑉𝐺V_{T}^{1}:=\{v_{r}^{T}\},V_{T}^{0}:=C(v_{r}^{T}):=V_{G}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT := { italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } , italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT := italic_C ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) := italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT;
2:  T=Algo.2(𝐯:=vrT)formulae-sequence𝑇𝐴𝑙𝑔𝑜2assign𝐯superscriptsubscript𝑣𝑟𝑇T=Algo.~{}\ref{alg:2}(\mathbf{v}:=v_{r}^{T})italic_T = italic_A italic_l italic_g italic_o . ( bold_v := italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT );
3:  while T.height<Kformulae-sequence𝑇𝑒𝑖𝑔𝑡𝐾T.height<Kitalic_T . italic_h italic_e italic_i italic_g italic_h italic_t < italic_K do
4:     T1=T.merge(Algo.2(𝐯:=vrT))T_{1}=T.merge(Algo.~{}\ref{alg:2}(\mathbf{v}:=v_{r}^{T}))italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_T . italic_m italic_e italic_r italic_g italic_e ( italic_A italic_l italic_g italic_o . ( bold_v := italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) );
5:     T2=T.merge({𝐓=Algo.2(𝐯:=v^)|v^VT1})T_{2}=T.merge(\{\mathbf{T}=Algo.~{}\ref{alg:2}(\mathbf{v}:=\hat{v})|\forall% \hat{v}\in V_{T}^{1}\})italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_T . italic_m italic_e italic_r italic_g italic_e ( { bold_T = italic_A italic_l italic_g italic_o . ( bold_v := over^ start_ARG italic_v end_ARG ) | ∀ over^ start_ARG italic_v end_ARG ∈ italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } );
6:     T=(HT1(G)<HT2(G))?T1:T2:𝑇superscript𝐻subscript𝑇1𝐺superscript𝐻subscript𝑇2𝐺?subscript𝑇1subscript𝑇2T=(H^{T_{1}}(G)<H^{T_{2}}(G))\ ?\ T_{1}:T_{2}italic_T = ( italic_H start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_G ) < italic_H start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_G ) ) ? italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT;
7:  end while
8:  return T𝑇Titalic_T;
Algorithm 2 2-level sub-coding tree construction.

Input: A node 𝐯𝐯\mathbf{v}bold_v.
Output: Coding tree 𝐓=(V𝐓,E𝐓)𝐓subscript𝑉𝐓subscript𝐸𝐓\mathbf{T}=(V_{\mathbf{T}},E_{\mathbf{T}})bold_T = ( italic_V start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT ) with height 2.

1:  vr𝐓:=𝐯,V𝐓0:=C(𝐯)formulae-sequenceassignsuperscriptsubscript𝑣𝑟𝐓𝐯assignsuperscriptsubscript𝑉𝐓0𝐶𝐯v_{r}^{\mathbf{T}}:=\mathbf{v},V_{\mathbf{T}}^{0}:=C(\mathbf{v})italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT := bold_v , italic_V start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT := italic_C ( bold_v );
2:  vV𝐓0,v.parent:=vr𝐓,C(vr𝐓):=vC(vr𝐓)formulae-sequencefor-all𝑣superscriptsubscript𝑉𝐓0𝑣formulae-sequenceassign𝑝𝑎𝑟𝑒𝑛𝑡superscriptsubscript𝑣𝑟𝐓assign𝐶superscriptsubscript𝑣𝑟𝐓𝑣𝐶superscriptsubscript𝑣𝑟𝐓\forall v\in V_{\mathbf{T}}^{0},v.parent:=v_{r}^{\mathbf{T}},C(v_{r}^{\mathbf{% T}}):={v}\cup C(v_{r}^{\mathbf{T}})∀ italic_v ∈ italic_V start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_v . italic_p italic_a italic_r italic_e italic_n italic_t := italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT , italic_C ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT ) := italic_v ∪ italic_C ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT )
3:  while |C(vr𝐓)|>2𝐶superscriptsubscript𝑣𝑟𝐓2|C(v_{r}^{\mathbf{T}})|>2| italic_C ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT ) | > 2 do
4:     (vα,vβ)=argmax(v,v){H𝐓(G)H𝐓.compress(v,v)(G)}subscript𝑣𝛼subscript𝑣𝛽𝑣superscript𝑣𝑎𝑟𝑔𝑚𝑎𝑥superscript𝐻𝐓𝐺superscript𝐻formulae-sequence𝐓𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑣superscript𝑣𝐺(v_{\alpha},v_{\beta})=\underset{(v,v^{\prime})}{argmax}\{H^{\mathbf{T}}(G)-H^% {{\mathbf{T}}.compress(v,v^{\prime})}(G)\}( italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) = start_UNDERACCENT ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG italic_a italic_r italic_g italic_m italic_a italic_x end_ARG { italic_H start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT ( italic_G ) - italic_H start_POSTSUPERSCRIPT bold_T . italic_c italic_o italic_m italic_p italic_r italic_e italic_s italic_s ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_G ) }
5:     𝐓.compress(vα,vβ)formulae-sequence𝐓𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠subscript𝑣𝛼subscript𝑣𝛽\mathbf{T}.compress(v_{\alpha},v_{\beta})bold_T . italic_c italic_o italic_m italic_p italic_r italic_e italic_s italic_s ( italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT )
6:  end while
7:  while 𝐓.height>2formulae-sequence𝐓𝑒𝑖𝑔𝑡2\mathbf{T}.height>2bold_T . italic_h italic_e italic_i italic_g italic_h italic_t > 2 do
8:     vi=argmin𝑣{H𝐓.remove(v)(G)H𝐓(G)}subscript𝑣𝑖𝑣𝑎𝑟𝑔𝑚𝑖𝑛superscript𝐻formulae-sequence𝐓𝑟𝑒𝑚𝑜𝑣𝑒𝑣𝐺superscript𝐻𝐓𝐺v_{i}=\underset{v}{argmin}\{H^{\mathbf{T}.remove(v)}(G)-H^{\mathbf{T}}(G)\}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = underitalic_v start_ARG italic_a italic_r italic_g italic_m italic_i italic_n end_ARG { italic_H start_POSTSUPERSCRIPT bold_T . italic_r italic_e italic_m italic_o italic_v italic_e ( italic_v ) end_POSTSUPERSCRIPT ( italic_G ) - italic_H start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT ( italic_G ) }
9:     𝐓.remove(vi)formulae-sequence𝐓𝑟𝑒𝑚𝑜𝑣𝑒subscript𝑣𝑖\mathbf{T}.remove(v_{i})bold_T . italic_r italic_e italic_m italic_o italic_v italic_e ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
10:  end while
11:  𝐓.align()formulae-sequence𝐓𝑎𝑙𝑖𝑔𝑛\mathbf{T}.align()bold_T . italic_a italic_l italic_i italic_g italic_n ( )
12:  return 𝐓𝐓\mathbf{T}bold_T

Hierarchical Representation Learning.

After calling Algorithm 1(GL=(VG,EG),Ksubscript𝐺𝐿subscript𝑉𝐺subscript𝐸𝐺𝐾G_{L}=(V_{G},E_{G}),Kitalic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) , italic_K), we get a coding tree TL=(VTL,ETL)subscript𝑇𝐿subscript𝑉subscript𝑇𝐿subscript𝐸subscript𝑇𝐿T_{L}=(V_{T_{L}},E_{T_{L}})italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) of label hierarchy GLsubscript𝐺𝐿G_{L}italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT with TL.height=Kformulae-sequencesubscript𝑇𝐿𝑒𝑖𝑔𝑡𝐾T_{L}.height=Kitalic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT . italic_h italic_e italic_i italic_g italic_h italic_t = italic_K . For representation learning, reformulate the label hierarchy and its coding tree as triplets: GL=(VGL,EGL,XGL)subscript𝐺𝐿subscript𝑉subscript𝐺𝐿subscript𝐸subscript𝐺𝐿subscript𝑋subscript𝐺𝐿G_{L}=(V_{G_{L}},E_{G_{L}},X_{G_{L}})italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), TL=(VTL,ETL,XTL)subscript𝑇𝐿subscript𝑉subscript𝑇𝐿subscript𝐸subscript𝑇𝐿subscript𝑋subscript𝑇𝐿T_{L}=(V_{T_{L}},E_{T_{L}},X_{T_{L}})italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) where XGL𝕐×dVsubscript𝑋subscript𝐺𝐿superscript𝕐subscript𝑑𝑉X_{G_{L}}\in\mathbb{R}^{\mathbb{Y}\times d_{V}}italic_X start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT blackboard_Y × italic_d start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is derived from document embedding hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT via two-dimensional projector, formally, XGL=ϕproj(hD)subscript𝑋subscript𝐺𝐿subscriptitalic-ϕ𝑝𝑟𝑜𝑗subscript𝐷X_{G_{L}}=\phi_{proj}(h_{D})italic_X start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). ϕproj()subscriptitalic-ϕ𝑝𝑟𝑜𝑗\phi_{proj}(\cdot)italic_ϕ start_POSTSUBSCRIPT italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT ( ⋅ ) consists of a (𝕐×1)𝕐1(\mathbb{Y}\times 1)( blackboard_Y × 1 ) and a (dB×dV)subscript𝑑𝐵subscript𝑑𝑉(d_{B}\times d_{V})( italic_d start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) feed-forward network, where dVsubscript𝑑𝑉d_{V}italic_d start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT is the hidden size for vertices. Meanwhile, XTL={XTL0,XTL1,,XTLK}subscript𝑋subscript𝑇𝐿superscriptsubscript𝑋subscript𝑇𝐿0superscriptsubscript𝑋subscript𝑇𝐿1superscriptsubscript𝑋subscript𝑇𝐿𝐾X_{T_{L}}=\{X_{T_{L}}^{0},X_{T_{L}}^{1},\dots,X_{T_{L}}^{K}\}italic_X start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT } represents the node embeddings of VTLisuperscriptsubscript𝑉subscript𝑇𝐿𝑖V_{T_{L}}^{i}italic_V start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, i[0,K]𝑖0𝐾i\in[0,K]italic_i ∈ [ 0 , italic_K ].

In Algorithm 1, the leaf nodes VTL0superscriptsubscript𝑉subscript𝑇𝐿0V_{T_{L}}^{0}italic_V start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT are initialized with VGLsubscript𝑉subscript𝐺𝐿V_{G_{L}}italic_V start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT, thus XTL0:=XGLassignsuperscriptsubscript𝑋subscript𝑇𝐿0subscript𝑋subscript𝐺𝐿X_{T_{L}}^{0}:=X_{G_{L}}italic_X start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT := italic_X start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Furthermore, {VTLk|k[1,K]}conditional-setsuperscriptsubscript𝑉subscript𝑇𝐿𝑘𝑘1𝐾\{V_{T_{L}}^{k}|k\in[1,K]\}{ italic_V start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_k ∈ [ 1 , italic_K ] } is given by Algorithm 1 while their node embeddings {XTLk|k[1,K]}conditional-setsuperscriptsubscript𝑋subscript𝑇𝐿𝑘𝑘1𝐾\{X_{T_{L}}^{k}|k\in[1,K]\}{ italic_X start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_k ∈ [ 1 , italic_K ] } need to be fetched. Based on the structure of coding trees, we design a hierarchical representation learning module. For xvkXTLksuperscriptsubscript𝑥𝑣𝑘superscriptsubscript𝑋subscript𝑇𝐿𝑘x_{v}^{k}\in X_{T_{L}}^{k}italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT in the k𝑘kitalic_k-th layer,

xvk=ϕFFNk(nC(v)xnk1),superscriptsubscript𝑥𝑣𝑘superscriptsubscriptitalic-ϕ𝐹𝐹𝑁𝑘subscript𝑛𝐶𝑣superscriptsubscript𝑥𝑛𝑘1x_{v}^{k}=\phi_{FFN}^{k}(\sum\nolimits_{n\in C(v)}x_{n}^{k-1}),italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_F italic_F italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_n ∈ italic_C ( italic_v ) end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) , (4)

where vVTk𝑣superscriptsubscript𝑉𝑇𝑘v\in V_{T}^{k}italic_v ∈ italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, xvkdVsuperscriptsubscript𝑥𝑣𝑘superscriptsubscript𝑑𝑉x_{v}^{k}\in\mathbb{R}^{d_{V}}italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the feature vector of node v𝑣vitalic_v, and C(v)𝐶𝑣C(v)italic_C ( italic_v ) represents the child nodes of v𝑣vitalic_v in coding tree TLsubscript𝑇𝐿T_{L}italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. ϕFFNi()superscriptsubscriptitalic-ϕ𝐹𝐹𝑁𝑖\phi_{FFN}^{i}(\cdot)italic_ϕ start_POSTSUBSCRIPT italic_F italic_F italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( ⋅ ) denotes a feed-forward network. The information from leaf nodes propagates layer by layer until it reaches the root node. Finally, to capture the multi-granular information provided by nodes at different levels, we utilize Equation 5 to integrate information from each layer of TLsubscript𝑇𝐿T_{L}italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT:

hT=k=1Kη({xvk|vVTLk}),subscript𝑇superscriptsubscriptsquare-union𝑘1𝐾𝜂conditional-setsuperscriptsubscript𝑥𝑣𝑘𝑣superscriptsubscript𝑉subscript𝑇𝐿𝑘\begin{split}h_{T}=\bigsqcup\limits_{k=1}^{K}{\eta(\{x_{v}^{k}|v\in V_{T_{L}}^% {k}\})},\end{split}start_ROW start_CELL italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ⨆ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η ( { italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_v ∈ italic_V start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } ) , end_CELL end_ROW (5)

where ()square-union\bigsqcup(\cdot)⨆ ( ⋅ ) indicates the concatenation operation. η()𝜂\eta(\cdot)italic_η ( ⋅ ) could be a feature-wise summation or averaging function. hTdTsubscript𝑇superscriptsubscript𝑑𝑇h_{T}\in\mathbb{R}^{d_{T}}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the final output of the structure encoder.

3.4 Contrastive Learning Module

Positive Pairs and Contrastive Loss.

We expect the text encoder and the structure encoder to learn from each other. Thus, the document embedding hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and structural embedding hTsubscript𝑇h_{T}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT of the same sample form the positive pair in our model. Considering hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and hTsubscript𝑇h_{T}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT might be in different distributions, we first project them into an embedding space via independent projectors, formally,

h=WD2ReLU(WD1hD+bD)superscriptsubscript𝑊𝐷2𝑅𝑒𝐿𝑈superscriptsubscript𝑊𝐷1subscript𝐷subscript𝑏𝐷\displaystyle h=W_{D}^{2}ReLU(W_{D}^{1}\cdot h_{D}+b_{D})italic_h = italic_W start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R italic_e italic_L italic_U ( italic_W start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) (6)
h^=WT2ReLU(WT1hT+bT),^superscriptsubscript𝑊𝑇2𝑅𝑒𝐿𝑈superscriptsubscript𝑊𝑇1subscript𝑇subscript𝑏𝑇\displaystyle\hat{h}=W_{T}^{2}ReLU(W_{T}^{1}\cdot h_{T}+b_{T}),over^ start_ARG italic_h end_ARG = italic_W start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R italic_e italic_L italic_U ( italic_W start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , (7)

where hhitalic_h and h^^\hat{h}over^ start_ARG italic_h end_ARG are the projected vectors of hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and hTsubscript𝑇h_{T}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. WD1superscriptsubscript𝑊𝐷1W_{D}^{1}italic_W start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, WD2superscriptsubscript𝑊𝐷2W_{D}^{2}italic_W start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, bDsubscript𝑏𝐷b_{D}italic_b start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, WT1superscriptsubscript𝑊𝑇1W_{T}^{1}italic_W start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, WT2superscriptsubscript𝑊𝑇2W_{T}^{2}italic_W start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and bTsubscript𝑏𝑇b_{T}italic_b start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are weights of projectors.

Next, we utilize NT-Xent loss (Chen et al., 2020b) to achieve contrastive learning. Let hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and hi^^subscript𝑖\hat{h_{i}}over^ start_ARG italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG denote the projected positive pair of the i𝑖iitalic_i-th document in a batch, the contrastive learning loss Lclrsubscript𝐿𝑐𝑙𝑟L_{clr}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT is formulated as:

Lclr=logeε(hi,hi^)/τj=1,ji|𝐁|eε(hj,hj^)/τ,subscript𝐿𝑐𝑙𝑟𝑙𝑜𝑔superscript𝑒𝜀subscript𝑖^subscript𝑖𝜏superscriptsubscriptformulae-sequence𝑗1𝑗𝑖𝐁superscript𝑒𝜀subscript𝑗^subscript𝑗𝜏L_{clr}=-log\frac{e^{\varepsilon(h_{i},\hat{h_{i}})}/\tau}{\sum_{j=1,j\neq i}^% {|\mathbf{B}|}e^{\varepsilon(h_{j},\hat{h_{j}})/\tau}},italic_L start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT = - italic_l italic_o italic_g divide start_ARG italic_e start_POSTSUPERSCRIPT italic_ε ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) end_POSTSUPERSCRIPT / italic_τ end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_B | end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_ε ( italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) / italic_τ end_POSTSUPERSCRIPT end_ARG , (8)

where |𝐁|𝐁|\mathbf{B}|| bold_B | denotes the batch size, τ𝜏\tauitalic_τ is the temperature parameter, and ε(h,h^)𝜀^\varepsilon(h,\hat{h})italic_ε ( italic_h , over^ start_ARG italic_h end_ARG ) is a distance metric implemented by cosine similarity hh^hh^top^normnorm^\frac{h\top\hat{h}}{||h||\cdot||\hat{h}||}divide start_ARG italic_h ⊤ over^ start_ARG italic_h end_ARG end_ARG start_ARG | | italic_h | | ⋅ | | over^ start_ARG italic_h end_ARG | | end_ARG.

Information Lossless Contrastive Learning for HTC.

Information lossless learning dose not imply retaining any input data. It is preserving the minimal but sufficient information, i.e., the mutual information required by the downstream task. This is how we define information lossless learning for hierarchical text classification:

Definition 1

In HTC, the mutual information between inputs and targets can be written as:

I((𝒢𝒟);Y),𝐼subscript𝒢𝒟𝑌I((\mathcal{G_{L}}\circ\mathcal{D});Y),italic_I ( ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_Y ) , (9)

where 𝒢𝔾,𝒟𝔻formulae-sequencesubscript𝒢𝔾𝒟𝔻\mathcal{G_{L}}\in\mathbb{G},\mathcal{D}\in\mathbb{D}caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∈ blackboard_G , caligraphic_D ∈ blackboard_D are random variables for the label hierarchy GLsubscript𝐺𝐿G_{L}italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and the document D𝐷Ditalic_D. I(X1;X2)𝐼subscript𝑋1subscript𝑋2I(X_{1};X_{2})italic_I ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) denotes the mutual information between random variables X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. \circ indicates any input combination of 𝒢subscript𝒢\mathcal{G_{L}}caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT and 𝒟𝒟\mathcal{D}caligraphic_D

Definition 2

Given a function 𝔽𝕋×𝔽𝔾subscript𝔽𝕋subscript𝔽𝔾\mathcal{F}\subseteq\mathbb{F_{T}}\times\mathbb{F_{G}}caligraphic_F ⊆ blackboard_F start_POSTSUBSCRIPT blackboard_T end_POSTSUBSCRIPT × blackboard_F start_POSTSUBSCRIPT blackboard_G end_POSTSUBSCRIPT, which is an arbitrary fusion of a text and a structure encoder. Define the optimal function superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT if and only if superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT satisfies:

=argmax𝔽𝕋×𝔽𝔾I((𝒢𝒟);(𝒢𝒟)).superscriptsubscript𝔽𝕋subscript𝔽𝔾𝑎𝑟𝑔𝑚𝑎𝑥𝐼subscript𝒢𝒟subscript𝒢𝒟\mathcal{F}^{*}=\underset{\mathcal{F}\subseteq\mathbb{F_{T}}\times\mathbb{F_{G% }}}{argmax}I(\mathcal{F}(\mathcal{G_{L}}\circ\mathcal{D});(\mathcal{G_{L}}% \circ\mathcal{D})).caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT caligraphic_F ⊆ blackboard_F start_POSTSUBSCRIPT blackboard_T end_POSTSUBSCRIPT × blackboard_F start_POSTSUBSCRIPT blackboard_G end_POSTSUBSCRIPT end_UNDERACCENT start_ARG italic_a italic_r italic_g italic_m italic_a italic_x end_ARG italic_I ( caligraphic_F ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ) . (10)

That is, superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTretains the most mutual information between the input random variables and the encoded random variables. Apparently, superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a deterministic map** as the embedding (GLD)superscriptsubscript𝐺𝐿𝐷\mathcal{F}^{*}(G_{L}\circ D)caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∘ italic_D ) is fixed for downstream prediction when given (GLD)subscript𝐺𝐿𝐷(G_{L}\circ D)( italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∘ italic_D ). Thus, for any random variable ξ𝜉\xiitalic_ξ,

I((𝒢𝒟);ξ)=I((𝒢𝒟);ξ).𝐼superscriptsubscript𝒢𝒟𝜉𝐼subscript𝒢𝒟𝜉\displaystyle I(\mathcal{F}^{*}(\mathcal{G_{L}}\circ\mathcal{D});\xi)=I((% \mathcal{G_{L}}\circ\mathcal{D});\xi).italic_I ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_ξ ) = italic_I ( ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_ξ ) . (11)

When ξ=Y𝜉𝑌\xi=Yitalic_ξ = italic_Y, we could have,

I((𝒢𝒟);Y)=I((𝒢𝒟);Y).𝐼superscriptsubscript𝒢𝒟𝑌𝐼subscript𝒢𝒟𝑌\displaystyle I(\mathcal{F}^{*}(\mathcal{G_{L}}\circ\mathcal{D});Y)=I((% \mathcal{G_{L}}\circ\mathcal{D});Y).italic_I ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_Y ) = italic_I ( ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_Y ) . (12)
Theorem 1

Given a document D𝐷Ditalic_D and the coding tree TLsubscript𝑇𝐿T_{L}italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT of the label hierarchy GLsubscript𝐺𝐿G_{L}italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. Denote their random variable as 𝒟𝒟\mathcal{D}caligraphic_D, 𝒯subscript𝒯\mathcal{T_{L}}caligraphic_T start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT, and 𝒢subscript𝒢\mathcal{G_{L}}caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT. For any augmentation function θ𝜃\thetaitalic_θ, we have,

I((𝒯𝒟);Y)I(θ(𝒢,𝒟);Y).𝐼subscript𝒯𝒟𝑌𝐼𝜃subscript𝒢𝒟𝑌I((\mathcal{T_{L}}\circ\mathcal{D});Y)\geq I(\theta(\mathcal{G_{L}},\mathcal{D% });Y).italic_I ( ( caligraphic_T start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_Y ) ≥ italic_I ( italic_θ ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT , caligraphic_D ) ; italic_Y ) . (13)

The proof for Theorem 1 is given in Appendix C.

Supervised Contrastive Learning.

Following the training strategy of previous contrastive learning methods for HTC, we train HILL with classification loss and contrastive loss simultaneously. After executing the structure encoder, the document embedding hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and its positive view hTsubscript𝑇h_{T}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are concatenated and fed into the classifier along with the sigmoid function:

P=Sigmoid([hD;hT]Wc+bc),𝑃𝑆𝑖𝑔𝑚𝑜𝑖𝑑subscript𝐷subscript𝑇subscript𝑊𝑐subscript𝑏𝑐P=Sigmoid([h_{D};h_{T}]\cdot W_{c}+b_{c}),italic_P = italic_S italic_i italic_g italic_m italic_o italic_i italic_d ( [ italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ; italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] ⋅ italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , (14)

where Wc(dB+dT)×|𝕐|subscript𝑊𝑐superscriptsubscript𝑑𝐵subscript𝑑𝑇𝕐W_{c}\in\mathbb{R}^{(d_{B}+d_{T})\times|\mathbb{Y}|}italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) × | blackboard_Y | end_POSTSUPERSCRIPT and bc|𝕐|subscript𝑏𝑐superscript𝕐b_{c}\in\mathbb{R}^{|\mathbb{Y}|}italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | blackboard_Y | end_POSTSUPERSCRIPT are weights and bias of the classifier while |𝕐|𝕐|\mathbb{Y}|| blackboard_Y | is the volume of the label set. For multi-label classification, we adopt the Binary Cross-Entropy Loss as the classification loss:

LC=1|𝕐|j|𝕐|yjlog(pj)+(1yj)log(1pj),subscript𝐿𝐶1𝕐superscriptsubscript𝑗𝕐subscript𝑦𝑗𝑙𝑜𝑔subscript𝑝𝑗1subscript𝑦𝑗𝑙𝑜𝑔1subscript𝑝𝑗{L_{C}=-\frac{1}{|\mathbb{Y}|}\sum_{j}^{|\mathbb{Y}|}y_{j}log(p_{j})+(1-y_{j})% log(1-p_{j}),}italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG | blackboard_Y | end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | blackboard_Y | end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + ( 1 - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_l italic_o italic_g ( 1 - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (15)

where yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the ground truth of the j𝑗jitalic_j-th label while pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the j𝑗jitalic_j-th element of P𝑃Pitalic_P.

Overall, the final loss function can be formulated as:

L=LC+λclrLclr.𝐿subscript𝐿𝐶subscript𝜆𝑐𝑙𝑟subscript𝐿𝑐𝑙𝑟L=L_{C}+\lambda_{clr}\cdot L_{clr}.italic_L = italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT . (16)

where λclrsubscript𝜆𝑐𝑙𝑟\lambda_{clr}italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT is the weight of Lclrsubscript𝐿𝑐𝑙𝑟L_{clr}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT.

Dataset |𝕐|𝕐|\mathbb{Y}|| blackboard_Y | Avg(|𝒴)|)Avg(|\mathcal{Y})|)italic_A italic_v italic_g ( | caligraphic_Y ) | ) Depth # Train # Dev # Test
WOS 141 2.0 2 30,070 7,518 9,397
RCV1-v2 103 3.24 4 20,833 2,316 781,265
NYTimes 166 7.6 8 23,345 5,834 7,292
Table 1: Summary statistics of the three datasets.
Model WOS RCV1-v2 NYTimes Average
Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1
Supervised Learning Models
TextRCNN (Zhou et al., 2020) 83.55 76.99 81.57 59.25 70.83 56.18 78.65 64.14
HiAGM (Zhou et al., 2020) 85.82 80.28 83.96 63.35 74.97 60.83 81.58 68.15
HTCInfoMax (Deng et al., 2021) 85.58 80.05 83.51 62.71 - - - -
HiMatch (Chen et al., 2021) 86.20 80.53 84.73 64.11 - - - -
Supervised Learning Models (BERT-based)
BERT \dagger 85.63 79.07 85.65 67.02 78.24 65.62 83.17 70.57
BERT (Chen et al., 2021) 86.26 80.58 86.26 67.35 - - - -
BERT+HiAGM \dagger 86.04 80.19 85.58 67.93 78.64 66.76 83.42 71.67
BERT+HTCInfoMax \dagger 86.30 79.97 85.53 67.09 78.75 67.31 83.53 71.46
BERT+HiMatch \dagger 86.70 81.06 86.33 68.66 - - - -
Contrastive Learning Models
HGCLR (Wang et al., 2022a) 87.11 81.20 86.49 68.31 78.86 67.96 84.15 72.49
HILL(Ours) 87.28 81.77 87.31 70.12 80.47 69.96 85.02 73.95
Table 2: Experimental results of our proposed model on three datasets and their average performance. The supervised learning models (in the upper part) originally take TextRCNN (Lai et al., 2015) as the text encoder. For fairness, we compared with their BERT-based versions implemented by Wang et al. (2022a) (in the middle part). The best results are marked in bold. “-” means not reported or not applicable.

4 Experiment

4.1 Experiment Setup

Datasets and Evaluation Metrics.

Experiments are conducted on three popular datasets in HTC. RCV1-v2 (Lewis et al., 2004) and NYTimes (Sandhaus, Evan, 2008) consists of news articles, while WOS (Kowsari et al., 2017) includes abstracts of academic papers. Each of these datasets is annotated with ground-truth labels existing in a predefined hierarchy. We split and preprocess these datasets following (Wang et al., 2022a). The statistics of these datasets are shown in Table 1. The experimental results are measured with Micro-F1 and Macro-F1 (Gopal and Yang, 2013). Micro-F1 is the harmonic mean of the overall precision and recall of all the test instances, while Macro-F1 is the average F1-score of each category. Thus, Micro-F1 reflects the performance on more frequent labels, while Macro-F1 treats labels equally.

Implementation Details.

For the text encoder, we use the BertModel of bert-base-uncased and update its parameters with Adam (Kingma and Ba, 2015) as the initial learning rate is 3e-5. For all three datasets, the hidden sizes dB,dV,dTsubscript𝑑𝐵subscript𝑑𝑉subscript𝑑𝑇d_{B},d_{V},d_{T}italic_d start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are all set to 768. ϕFFNksuperscriptsubscriptitalic-ϕ𝐹𝐹𝑁𝑘\phi_{FFN}^{k}italic_ϕ start_POSTSUBSCRIPT italic_F italic_F italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT in Equation 4 is implemented by K𝐾Kitalic_K independent multi-layer perceptions which consists of 2-layer linear transformations and non-linear functions. The batch sizes are set to 24 for WOS and NYTimes while 16 for RCV1-v2. The η()𝜂\eta(\cdot)italic_η ( ⋅ ) in Equation 5 is implemented by a summation for WOS and NYTimes while averaging for RCV1-v2. The learning rate of the structure encoder is set to 1e-3 for WOS, 1e-4 for RCV1 and NYTimes. The weight of contrastive loss λclrsubscript𝜆𝑐𝑙𝑟\lambda_{clr}italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT is respectively set to 0.001, 0.1, 0.3 for WOS, RCV1, and NYTimes. The optimal height K𝐾Kitalic_K of coding trees goes to 3, 2, and 3.

Baselines.

We compare HILL with self-supervised only models including HiAGM(Zhou et al., 2020), HTCInfoMax (Deng et al., 2021), HiMatch (Chen et al., 2021) and their BERT-based version and a contrastive learning model HGCLR (Wang et al., 2022a). HiAGM, HTCInfoMax, and HiMatch use different fusion strategies to model text-hierarchy correlations. Specifically, HiAGM proposes a multi-label attention (HiAGM-LA) and a text feature propagation technique (HiAGM-TP) to get hierarchy-aware representations. HTCInfoMax enhances HiAGM-LA with information maximization to model the interaction between text and hierarchy. HiMatch treats HTC as a matching problem by map** text and labels into a joint embedding space. HGCLR makes BERT learn from the structure encoder with a controlled document augmentation.

4.2 Results and Analysis

The main results are presented in Table 2. In the supervised-only models, HTCInfoMax enhances HiAGM through the maximization of mutual information principles. HiMatch treats HTC as a matching problem, and the reported results stand out as the best among these models. The replacement of TextRCNN with BERT has minimal impact on the relative ranking of their outcomes. This implies that the text encoder primarily influences the overall effectiveness of their models, while their specific merits are determined by their efforts beyond the text encoder. To some extent, it also indicates that their text encoder and structure encoder operate independently.

HGCLR is the inaugural contrastive learning approach for HTC. Despite the involvement of the structure encoder, the positive sample construction in HGCLR still relies on data augmentation. Conversely, our model effectively utilizes the syntactic information extracted by the structure encoder, enabling cooperation between the text encoder and the structure encoder. Specifically, the proposed HILL surpasses all supervised learning models and the contrastive learning model across all three datasets. Our model demonstrates average improvements of 1.85% and 3.38% on Micro-F1 and Macro-F1 compared to vanilla BERT. In comparison with HGCLR, our model achieves nearly a 2% performance boost on both RCV1-v2 and NYTimes. Additionally, the improvement in WOS is notable, though slightly less than that observed in the other two datasets. A likely reason is that BERT is pretrained on news corpus, and the domain knowledge acquired may differ from that of paper abstracts. However, this could also indicate effective collaboration between the text encoder and the structural encoder in our framework. As the text encoder learns robust features, the structural encoder becomes increasingly powerful, and vice versa.

Ablation Models RCV1-v2
Micro-F1 Macro-F1
HILL 87.31 70.12
r.p. GIN (Xu et al., 2019) 86.48 69.30
r.p. GAT (Velickovic et al., 2018) 86.51 68.12
r.p. GCN (Kipf and Welling, 2017) 86.24 68.71
r.m. Lclrsubscript𝐿𝑐𝑙𝑟L_{clr}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT 86.51 68.60
r.m. Algorithm 1 86.67 67.92
Table 3: Performance when replacing or removing some components of HILL on the test set of RCV1-v2. r.p. stands for the replacement and r.m. stands for remove. The results of r.m. Lclrsubscript𝐿𝑐𝑙𝑟L_{clr}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT and r.m. Algorithm 1 are both obtained from 5 runs under different random seeds, each of which is distinguished from HILL’s at a significant level of 95% under a one-sample t-test.
Refer to caption
(a) WOS
Refer to caption
(b) RCV1-v2
Refer to caption
(c) NYTimes
Figure 3: Test performance of HILL with different height K𝐾Kitalic_K of the coding tree on three datasets.

4.3 Ablation Studies

The necessity of proposed methods.

We conduct ablation studies by removing one component of our model at a time while kee** other conditions consistent whenever possible. The results of the ablation studies are presented in Table 3. To demonstrate the effectiveness of proposed hierarchical representation learning, we replaced the structure encoder with three commonly used graph neural networks including GCN (Kipf and Welling, 2017), GAT (Velickovic et al., 2018), and GIN (Xu et al., 2019). All of them are fed with the label hierarchy GLsubscript𝐺𝐿G_{L}italic_G start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and the document embedding hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT to initialize node embeddings. For GCN and GAT, the number of layers is set to 2 while other parameters are the default settings in PyTorch-Geometric (Fey and Lenssen, 2019). Regarding GIN, the combine function is a 2-layer multi-layer perception with ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0, and the iteration is set to 3. We find that the hierarchical learning module outperforms all graph encoders on RCV1-v2, which empirically proves that syntactic information extraction is successful in HILL. Moreover, the performance of the supervised-only model (r.m. Lclrsubscript𝐿𝑐𝑙𝑟L_{clr}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT) declines by 0.92% and 2.17%, underscoring the necessity of contrastive learning. Additionally, we directly feed the initial coding tree, i.e., a coding tree within the root node vrsubscript𝑣𝑟v_{r}italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT connecting to the leaf nodes, into the structure encoder. The model (r.m. Algorithm 1) exhibits performance decreases of 0.73% and 3.14%, emphasizing the effectiveness of structural entropy minimization. Results and analysis on the other two datasets can be found in Appendix D.

The Height K𝐾Kitalic_K of Coding Trees.

The height of the coding tree affects the performance of HILL. Higher coding trees may involve more explosive gradients. To investigate the impact of K𝐾Kitalic_K, we run HILL with different heights K𝐾Kitalic_K of the coding tree while kee** other settings the same. Figure 3 shows the test performance of different height coding trees on WOS, RCV1-v2, and NYTimes. As K𝐾Kitalic_K grows, the performance of HILL sharply degrades. The optimal K𝐾Kitalic_K seems to be unrelated to the heights of label hierarchies, since the heights of the three datasets are 2, 4, and 8, while the optimal K𝐾Kitalic_K is 3, 2, and 3. On the contrary, the optimal K𝐾Kitalic_K is more likely to positively correlate with the volumes of label set 𝕐𝕐\mathbb{Y}blackboard_Y, 141, 103, and 166.

Refer to caption
(a) Trainable parameters
Refer to caption
(b) Training time per epoch
Figure 4: The number of trainable parameters (M) and the average training time (s) of our model and HGCLR on WOS, RCV1-v2, and NYTimes.

Time-and-Memory-Saving Contrastive Learning.

The structure encoder of HILL is considerably smaller than that of HGCLR, as the kernel of hierarchical representation learning consists of only K𝐾Kitalic_K multi-layer perceptions, while Graphormer (Ying et al., 2021) is built upon multi-head attention. In this comparison, we assess the number of learnable parameters and the training speed of HILL in contrast to HGCLR. Specifically, we set the hidden state sizes of both HILL and HGCLR to 768, with batch sizes set to 16. The count of trainable parameters is determined by the numel()𝑛𝑢𝑚𝑒𝑙numel(\cdot)italic_n italic_u italic_m italic_e italic_l ( ⋅ ) function in PyTorch (Paszke et al., 2019), excluding those related to BERT. As indicated in Figure 4, our model exhibits significantly fewer parameters than HGCLR, averaging 7.34M compared to 19.04M. Additionally, the training speed is evaluated after 20 epochs of trainingBoth of them converge around the 20th epoch.. The average training time for HILL is 789.2s, which is half the time taken by HGCLR (1504.7s). Overall, this analysis suggests that the efficient architecture of HILL contributes to its status as a time- and memory-saving model.

5 Conclusion

In this paper, we design a suite of methods to address the limitations of existing contrastive learning models for HTC. In particular, we propose HILL, in which the syntactic information is sufficiently fused with the semantic information after structural entropy minimization and hierarchical representation learning. Theoretically, we give the definition on information lossless learning for HTC and the information extracted by HILL is proved to be the upper bound of other contrastive learning based methods. Experimental results demonstrate the effectiveness and efficiency of our model against state-of-the-arts.

Limitations

In Table 2, we only provide the results of HILL and HGCLR when using BERT as the text encoder. Due to the focus on designing the structure encoder, we do not report results on a smaller model, for instance, TextRCNN, or a larger language model as the text encoder. On the other hand, we adopt supervised contrastive learning in accordance with the settings of HGCLR. The performance of HILL under contrastive-supervised two-stage training remains to be explored.

References

  • Aly et al. (2019) Rami Aly, Steffen Remus, and Chris Biemann. 2019. Hierarchical multi-label classification of text with capsule networks. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 2: Student Research Workshop, pages 323–330. Association for Computational Linguistics.
  • Banerjee et al. (2019) Siddhartha Banerjee, Cem Akkaya, Francisco Perez-Sorrosal, and Kostas Tsioutsiouliklis. 2019. Hierarchical transfer learning for multi-label text classification. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 6295–6300. Association for Computational Linguistics.
  • Bromley et al. (1993) Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a siamese time delay neural network. In Advances in Neural Information Processing Systems 6, [7th NIPS Conference, Denver, Colorado, USA, 1993], pages 737–744. Morgan Kaufmann.
  • Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Chen et al. (2020a) Boli Chen, ** **g. 2020a. Hyperbolic interaction model for hierarchical multi-label classification. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7496–7503. AAAI Press.
  • Chen et al. (2021) Haibin Chen, Qianli Ma, Zhenxi Lin, and Jiangyue Yan. 2021. Hierarchy-aware label semantics matching network for hierarchical text classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 4370–4379. Association for Computational Linguistics.
  • Chen et al. (2020b) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020b. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR.
  • Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas. 2006. Elements of information theory (2. ed.). Wiley.
  • Deng et al. (2021) Zhongfen Deng, Hao Peng, Dongxiao He, Jianxin Li, and Philip S. Yu. 2021. Htcinfomax: A global model for hierarchical text classification via information maximization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 3259–3265. Association for Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  • Fey and Lenssen (2019) Matthias Fey and Jan E. Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.
  • Gopal and Yang (2013) Siddharth Gopal and Yiming Yang. 2013. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining.
  • Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap your own latent - A new approach to self-supervised learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 9726–9735. Computer Vision Foundation / IEEE.
  • Huang et al. (2019) Wei Huang, Enhong Chen, Qi Liu, Yuying Chen, Zai Huang, Yang Liu, Zhou Zhao, Dandan Zhang, and Shi** Wang. 2019. Hierarchical multi-label text classification: An attention-based recurrent network approach. Proceedings of the 28th ACM International Conference on Information and Knowledge Management.
  • Jaiswal et al. (2020) Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. 2020. A survey on contrastive self-supervised learning. ArXiv, abs/2011.00362.
  • Jiang et al. (2022) Ting Jiang, Deqing Wang, Leilei Sun, Zhongzhi Chen, Fuzhen Zhuang, and Qinghong Yang. 2022. Exploiting global and local hierarchies for hierarchical text classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 4030–4039. Association for Computational Linguistics.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  • Kowsari et al. (2017) Kamran Kowsari, Donald E. Brown, Mojtaba Heidarysafa, K. Meimandi, Matthew S. Gerber, and Laura E. Barnes. 2017. Hdltex: Hierarchical deep learning for text classification. 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 364–371.
  • Lai et al. (2015) Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pages 2267–2273. AAAI Press.
  • Lewis et al. (2004) David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361–397.
  • Li and Pan (2016) Angsheng Li and Yicheng Pan. 2016. Structural information and dynamical complexity of networks. IEEE Transactions on Information Theory, 62:3290–3339.
  • Liu et al. (2019) Yiwei Liu, Jiamou Liu, Zijian Zhang, Liehuang Zhu, and Angsheng Li. 2019. REM: from structural entropy to community structure deception. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12918–12928.
  • Mao et al. (2019) Yuning Mao, **g**g Tian, Jiawei Han, and Xiang Ren. 2019. Hierarchical text classification with reinforced label assignment. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 445–455. Association for Computational Linguistics.
  • Meng et al. (2021) Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Tiwary, Paul Bennett, Jiawei Han, and Xia Song. 2021. COCO-LM: correcting and contrasting text sequences for language model pretraining. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 23102–23114.
  • Pan et al. (2022) Lin Pan, Chung-Wei Hang, Avirup Sil, and Saloni Potdar. 2022. Improved text classification via contrastive adversarial training. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 11130–11138. AAAI Press.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  • Radford and Narasimhan (2018) Alec Radford and Karthik Narasimhan. 2018. Improving language understanding by generative pre-training.
  • Rojas et al. (2020) Kervy Rivas Rojas, Gina Bustamante, Arturo Oncevay, and Marco Antonio Sobrevilla Cabezudo. 2020. Efficient strategies for hierarchical text classification: External knowledge and auxiliary tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 2252–2257. Association for Computational Linguistics.
  • Sandhaus, Evan (2008) Sandhaus, Evan. 2008. The new york times annotated corpus.
  • Shannon (1948) Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Tech. J., 27(3):379–423.
  • Shimura et al. (2018) Kazuya Shimura, Jiyi Li, and Fumiyo Fukumoto. 2018. HFT-CNN: learning hierarchical category structure for multi-label short text categorization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 811–816. Association for Computational Linguistics.
  • Tian et al. (2020) Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning? In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  • Wang et al. (2023) Yifei Wang, Yupan Wang, Zeyu Zhang, Song Yang, Kaiqi Zhao, and Jiamou Liu. 2023. USER: unsupervised structural entropy-based robust graph neural network. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 10235–10243. AAAI Press.
  • Wang et al. (2022a) Zihan Wang, Peiyi Wang, Lianzhe Huang, Xin Sun, and Houfeng Wang. 2022a. Incorporating hierarchy into text encoder: a contrastive learning approach for hierarchical text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 7109–7119. Association for Computational Linguistics.
  • Wang et al. (2022b) Zihan Wang, Peiyi Wang, Tianyu Liu, Binghuai Lin, Yunbo Cao, Zhifang Sui, and Houfeng Wang. 2022b. HPT: hierarchy-aware prompt tuning for hierarchical text classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 3740–3751. Association for Computational Linguistics.
  • Wu et al. (2019) Jiawei Wu, Wenhan Xiong, and William Yang Wang. 2019. Learning to learn and predict: A meta-learning approach for multi-label classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4353–4363. Association for Computational Linguistics.
  • Wu et al. (2023) Junran Wu, Xueyuan Chen, Bowen Shi, Shangzhe Li, and Ke Xu. 2023. Sega: Structural entropy guided anchor view for graph contrastive learning. In International Conference on Machine Learning. PMLR.
  • Wu et al. (2022a) Junran Wu, Xueyuan Chen, Ke Xu, and Shangzhe Li. 2022a. Structural entropy guided graph hierarchical pooling. In International Conference on Machine Learning, pages 24017–24030. PMLR.
  • Wu et al. (2022b) Junran Wu, Shangzhe Li, Jianhao Li, Yicheng Pan, and Ke Xu. 2022b. A simple yet effective method for graph classification. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 3580–3586. ijcai.org.
  • Wu et al. (2020) Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, and Hao Ma. 2020. CLEAR: contrastive learning for sentence representation. CoRR, abs/2012.15466.
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  • Yan et al. (2021) Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. Consert: A contrastive framework for self-supervised sentence representation transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 5065–5075. Association for Computational Linguistics.
  • Yang et al. (2023) Zhenyu Yang, Ge Zhang, Jia Wu, Jian Yang, Quan Z. Sheng, Hao Peng, Angsheng Li, Shan Xue, and Jianlin Su. 2023. Minimum entropy principle guided graph neural networks. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM 2023, Singapore, 27 February 2023 - 3 March 2023, pages 114–122. ACM.
  • Ying et al. (2021) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021. Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 28877–28888.
  • You et al. (2019) Ronghui You, Zihan Zhang, Ziye Wang, Suyang Dai, Hiroshi Mamitsuka, and Shanfeng Zhu. 2019. Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 5812–5822.
  • Zhang et al. (2022) Chong Zhang, He Zhu, Xingyu Peng, Junran Wu, and Ke Xu. 2022. Hierarchical information matters: Text classification via tree based graph neural network. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 950–959. International Committee on Computational Linguistics.
  • Zhou et al. (2020) Jie Zhou, Chun** Ma, Dingkun Long, Guangwei Xu, Ning Ding, Haoyu Zhang, Pengjun Xie, and Gongshen Liu. 2020. Hierarchy-aware global model for hierarchical text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 1106–1117. Association for Computational Linguistics.
Refer to caption
Figure 5: An illustration of coding trees and structural entropy. The coding tree T𝑇Titalic_T provides us with multi-granularity partitions of the original graph G𝐺Gitalic_G, as shown by the three partitions in the example. Structural entropy is defined as the average amount of information of a random walk between two nodes in VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, considering all nodes partitioned (encoded and decoded) by coding tree T𝑇Titalic_T. Under the guidance of structural entropy, coding tree T𝑇Titalic_T could reveal the essential structure of graph G𝐺Gitalic_G.

Appendix A A Brief Introduction to Structural Entropy and Coding Trees

The exhaustive definitions and theorems of coding trees and structural entropy are originated by Li and Pan (2016). Here, we just briefly introduce some key concepts which are crucial to this paper. An illustration of coding trees and structural entropy is shown in Figure 5.

Coding tree.

A coding tree T=(VT,ET)𝑇subscript𝑉𝑇subscript𝐸𝑇T=(V_{T},E_{T})italic_T = ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) of graph G=(VG,EG)𝐺subscript𝑉𝐺subscript𝐸𝐺G=(V_{G},E_{G})italic_G = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) is a tree that satisfies:

  1. I.

    Denote vrTsuperscriptsubscript𝑣𝑟𝑇v_{r}^{T}italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT as the unique root node of T𝑇Titalic_T, Vζsubscript𝑉𝜁V_{\zeta}italic_V start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT as the leaf-node set of T𝑇Titalic_T, and T.heightformulae-sequence𝑇𝑒𝑖𝑔𝑡T.heightitalic_T . italic_h italic_e italic_i italic_g italic_h italic_t the height of T𝑇Titalic_T. The node set of T𝑇Titalic_T is VTsubscript𝑉𝑇V_{T}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, which consists of a few subsets, that is, VT={VT0,VT1,,VTT.height}subscript𝑉𝑇superscriptsubscript𝑉𝑇0superscriptsubscript𝑉𝑇1superscriptsubscript𝑉𝑇formulae-sequence𝑇𝑒𝑖𝑔𝑡V_{T}=\{V_{T}^{0},V_{T}^{1},\dots,V_{T}^{T.height}\}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = { italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T . italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT }. VTT.height:=vrT,VT0:=Vζformulae-sequenceassignsuperscriptsubscript𝑉𝑇formulae-sequence𝑇𝑒𝑖𝑔𝑡superscriptsubscript𝑣𝑟𝑇assignsuperscriptsubscript𝑉𝑇0subscript𝑉𝜁V_{T}^{T.height}:={v_{r}^{T}},V_{T}^{0}:=V_{\zeta}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T . italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT := italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT := italic_V start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT.

  2. II.

    Each node vVT𝑣subscript𝑉𝑇v\in V_{T}italic_v ∈ italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is the marker (Li and Pan, 2016) of a subset of VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. For instance, vrTsuperscriptsubscript𝑣𝑟𝑇v_{r}^{T}italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the marker VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, while each vζVζsubscript𝑣𝜁subscript𝑉𝜁v_{\zeta}\in V_{\zeta}italic_v start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ∈ italic_V start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT marks a single node in VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT.

  3. III.

    For any V{VT0,VT1,,VTT.height}𝑉superscriptsubscript𝑉𝑇0superscriptsubscript𝑉𝑇1superscriptsubscript𝑉𝑇formulae-sequence𝑇𝑒𝑖𝑔𝑡V\in\{V_{T}^{0},V_{T}^{1},\dots,V_{T}^{T.height}\}italic_V ∈ { italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T . italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT }. All nodes in V𝑉Vitalic_V has the same height (depth) and any subset of V𝑉Vitalic_V represents a partition of VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. Specifically, VT0=Vζsuperscriptsubscript𝑉𝑇0subscript𝑉𝜁V_{T}^{0}=V_{\zeta}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_V start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT is an element-wise partition for VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as there exists a one-to-one correspondence for nodes in Vζsubscript𝑉𝜁V_{\zeta}italic_V start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT and VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. While VTT.height=vrTsuperscriptsubscript𝑉𝑇formulae-sequence𝑇𝑒𝑖𝑔𝑡superscriptsubscript𝑣𝑟𝑇V_{T}^{T.height}={v_{r}^{T}}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T . italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT = italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the overall partition for VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as vrTsuperscriptsubscript𝑣𝑟𝑇v_{r}^{T}italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT marks the integrity of VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT.

Structural Entropy.

As coding trees can be regarded as coding patterns for graphs, the structural entropy of a graph is defined as the average amount of information under a determined coding tree. Specifically, as depicted in Figure 5, when a random walk on graph G𝐺Gitalic_G progresses from node v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to node v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, a portion of the information is encoded by their parent node v+superscript𝑣v^{+}italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, and at this point, only the information from v+limit-from𝑣v+italic_v + to v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT remains to be determined. Meanwhile, the conditional entropy H(v3|v1)𝐻conditionalsubscript𝑣3subscript𝑣1H(v_{3}|v_{1})italic_H ( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is then reduced to H(v3|v+)𝐻conditionalsubscript𝑣3superscript𝑣H(v_{3}|v^{+})italic_H ( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ). Conditional entropy should be defined with random variables, but we omit them here for simplicity. Thus, we have:

H(v3|v1)=H(v3|v+)=logvol(v3)vol(v+),𝐻conditionalsubscript𝑣3subscript𝑣1𝐻conditionalsubscript𝑣3superscript𝑣𝑙𝑜𝑔𝑣𝑜𝑙subscript𝑣3𝑣𝑜𝑙superscript𝑣H(v_{3}|v_{1})=H(v_{3}|v^{+})=-log\frac{vol\left(v_{3}\right)}{vol\left(v^{+}% \right)},italic_H ( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_H ( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) = - italic_l italic_o italic_g divide start_ARG italic_v italic_o italic_l ( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_v italic_o italic_l ( italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_ARG , (17)

where vol(v3)𝑣𝑜𝑙subscript𝑣3vol(v_{3})italic_v italic_o italic_l ( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) denotes the degree of v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT while (v+)superscript𝑣(v^{+})( italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) is the total degree of its children, i.e. vol(v+)=vol(v1)+vol(v3)+vol(v5)𝑣𝑜𝑙superscript𝑣𝑣𝑜𝑙subscript𝑣1𝑣𝑜𝑙subscript𝑣3𝑣𝑜𝑙subscript𝑣5vol(v^{+})=vol(v_{1})+vol(v_{3})+vol(v_{5})italic_v italic_o italic_l ( italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) = italic_v italic_o italic_l ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_v italic_o italic_l ( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + italic_v italic_o italic_l ( italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ). Structural entropy of a graph G𝐺Gitalic_G is defined as the average amount of information required to determine during a random walk between two accessible nodes. According to the derivation procedure provided by Li and Pan (2016), we have:

HT(G)=vTgvvol(G)logvol(v)vol(v+),superscript𝐻𝑇𝐺subscript𝑣𝑇subscript𝑔𝑣𝑣𝑜𝑙𝐺𝑣𝑜𝑙𝑣𝑣𝑜𝑙superscript𝑣H^{T}(G)=-\sum_{v\in T}\frac{g_{v}}{vol(G)}\log{\frac{vol(v)}{vol(v^{+})}},italic_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_G ) = - ∑ start_POSTSUBSCRIPT italic_v ∈ italic_T end_POSTSUBSCRIPT divide start_ARG italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_ARG italic_v italic_o italic_l ( italic_G ) end_ARG roman_log divide start_ARG italic_v italic_o italic_l ( italic_v ) end_ARG start_ARG italic_v italic_o italic_l ( italic_v start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_ARG , (18)

where gvsubscript𝑔𝑣g_{v}italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT represents the number of v𝑣vitalic_v’s cut edges on G𝐺Gitalic_G, vol(G)𝑣𝑜𝑙𝐺vol(G)italic_v italic_o italic_l ( italic_G ) denotes the volume of graph G𝐺Gitalic_G, and gvvol(G)subscript𝑔𝑣𝑣𝑜𝑙𝐺\frac{g_{v}}{vol(G)}divide start_ARG italic_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_ARG italic_v italic_o italic_l ( italic_G ) end_ARG indicates the probability of a random walk on G𝐺Gitalic_G involving the leaf nodes marked by v𝑣vitalic_v.

Refer to caption
Figure 6: The initialization stage of Algorithm 1, in which a 1-height coding tree is constructed.
Refer to caption
(a) An example of compressing leaf node vαsubscript𝑣𝛼v_{\alpha}italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and vβsubscript𝑣𝛽v_{\beta}italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT.
Refer to caption
(b) The final result after executing lines 3-6.
Figure 7: An illustration of 𝐓.compress()formulae-sequence𝐓𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠\mathbf{T}.compress(\cdot)bold_T . italic_c italic_o italic_m italic_p italic_r italic_e italic_s italic_s ( ⋅ ) operation. (a) A single execution of 𝐓.compress(vα,vβ)formulae-sequence𝐓𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠subscript𝑣𝛼subscript𝑣𝛽\mathbf{T}.compress(v_{\alpha},v_{\beta})bold_T . italic_c italic_o italic_m italic_p italic_r italic_e italic_s italic_s ( italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ). (b) The final state after executing lines 3-6 in Algorithm 2.
Refer to caption
(a) An example of deleting an intermediate node v𝑣vitalic_v.
Refer to caption
(b) The final result after executing lines 7-10.
Figure 8: An illustration of 𝐓.delete()formulae-sequence𝐓𝑑𝑒𝑙𝑒𝑡𝑒\mathbf{T}.delete(\cdot)bold_T . italic_d italic_e italic_l italic_e italic_t italic_e ( ⋅ ) operation. (a) A single execution of 𝐓.delete(v)formulae-sequence𝐓𝑑𝑒𝑙𝑒𝑡𝑒𝑣\mathbf{T}.delete(v)bold_T . italic_d italic_e italic_l italic_e italic_t italic_e ( italic_v ). (b) The final state after executing lines 7-10 in Algorithm 2.
Refer to caption
Figure 9: An illustration of 𝐓.align()formulae-sequence𝐓𝑎𝑙𝑖𝑔𝑛\mathbf{T}.align()bold_T . italic_a italic_l italic_i italic_g italic_n ( ) operation. 𝐓.align()formulae-sequence𝐓𝑎𝑙𝑖𝑔𝑛\mathbf{T}.align()bold_T . italic_a italic_l italic_i italic_g italic_n ( ) will align the height (depth) of all the leaf nodes to satisfy the definition of coding trees.
Refer to caption
Figure 10: An illustration of lines 3-7 in Algorithm 1.

Appendix B Explanations for The Proposed Algorithms

Definitions for Algorithm 1 and Algorithm 2.

Given a coding tree T=(VT,ET)𝑇subscript𝑉𝑇subscript𝐸𝑇T=(V_{T},E_{T})italic_T = ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), we define some attributes and member functions of T𝑇Titalic_T as follows,

  1. A.

    Given any two nodes vα,vβsubscript𝑣𝛼subscript𝑣𝛽v_{\alpha},v_{\beta}italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT VTabsentsubscript𝑉𝑇\in V_{T}∈ italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. If (α,β)ET𝛼𝛽subscript𝐸𝑇(\alpha,\beta)\in E_{T}( italic_α , italic_β ) ∈ italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, call vαsubscript𝑣𝛼v_{\alpha}italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT the parent of vβsubscript𝑣𝛽v_{\beta}italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT, and vβsubscript𝑣𝛽v_{\beta}italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT a child of vαsubscript𝑣𝛼v_{\alpha}italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, which is denoted as vβC(vα)subscript𝑣𝛽𝐶subscript𝑣𝛼v_{\beta}\in C(v_{\alpha})italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∈ italic_C ( italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ), vβ.parent=vαformulae-sequencesubscript𝑣𝛽𝑝𝑎𝑟𝑒𝑛𝑡subscript𝑣𝛼v_{\beta}.parent=v_{\alpha}italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT . italic_p italic_a italic_r italic_e italic_n italic_t = italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT.

  2. B.

    Function compress(,)𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠compress(\cdot,\cdot)italic_c italic_o italic_m italic_p italic_r italic_e italic_s italic_s ( ⋅ , ⋅ ). As illustrated in Figure 7(a), given any vα,vβsubscript𝑣𝛼subscript𝑣𝛽v_{\alpha},v_{\beta}italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT C(vrT)absent𝐶superscriptsubscript𝑣𝑟𝑇\in C(v_{r}^{T})∈ italic_C ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ). compress(vα,vβ)𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠subscript𝑣𝛼subscript𝑣𝛽compress(v_{\alpha},v_{\beta})italic_c italic_o italic_m italic_p italic_r italic_e italic_s italic_s ( italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) will spawn a new node vγsubscript𝑣𝛾v_{\gamma}italic_v start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT, remove vα,vβsubscript𝑣𝛼subscript𝑣𝛽v_{\alpha},v_{\beta}italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT from C(vrT)𝐶superscriptsubscript𝑣𝑟𝑇C(v_{r}^{T})italic_C ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ), make vγsubscript𝑣𝛾v_{\gamma}italic_v start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT be their parent, and add vγsubscript𝑣𝛾v_{\gamma}italic_v start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT to C(vrT)𝐶superscriptsubscript𝑣𝑟𝑇C(v_{r}^{T})italic_C ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ). After that, vα.parent=vβ.parent=vγformulae-sequencesubscript𝑣𝛼𝑝𝑎𝑟𝑒𝑛𝑡subscript𝑣𝛽𝑝𝑎𝑟𝑒𝑛𝑡subscript𝑣𝛾v_{\alpha}.parent=v_{\beta}.parent=v_{\gamma}italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT . italic_p italic_a italic_r italic_e italic_n italic_t = italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT . italic_p italic_a italic_r italic_e italic_n italic_t = italic_v start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT, and vγC(vrT)subscript𝑣𝛾𝐶superscriptsubscript𝑣𝑟𝑇v_{\gamma}\in C(v_{r}^{T})italic_v start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ∈ italic_C ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) while vα,vβC(vrT)subscript𝑣𝛼subscript𝑣𝛽𝐶superscriptsubscript𝑣𝑟𝑇v_{\alpha},v_{\beta}\notin C(v_{r}^{T})italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∉ italic_C ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ).

  3. C.

    Function delete()𝑑𝑒𝑙𝑒𝑡𝑒delete(\cdot)italic_d italic_e italic_l italic_e italic_t italic_e ( ⋅ ). As depicted in Figure 8(a), given any vVT,vvrformulae-sequence𝑣subscript𝑉𝑇𝑣subscript𝑣𝑟v\in V_{T},v\neq v_{r}italic_v ∈ italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_v ≠ italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. delete(v)𝑑𝑒𝑙𝑒𝑡𝑒𝑣delete(v)italic_d italic_e italic_l italic_e italic_t italic_e ( italic_v ) could remove v𝑣vitalic_v from VTsubscript𝑉𝑇V_{T}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and attach all its children to the parent of v𝑣vitalic_v. That is, vμC(v),vμ.parent:=v.parentformulae-sequencefor-allsubscript𝑣𝜇𝐶𝑣subscript𝑣𝜇assign𝑝𝑎𝑟𝑒𝑛𝑡𝑣𝑝𝑎𝑟𝑒𝑛𝑡\forall v_{\mu}\in C(v),v_{\mu}.parent:=v.parent∀ italic_v start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∈ italic_C ( italic_v ) , italic_v start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT . italic_p italic_a italic_r italic_e italic_n italic_t := italic_v . italic_p italic_a italic_r italic_e italic_n italic_t.

  4. D.

    Function align()𝑎𝑙𝑖𝑔𝑛align()italic_a italic_l italic_i italic_g italic_n ( ). For any leaf node vζVζsubscript𝑣𝜁subscript𝑉𝜁v_{\zeta}\in V_{\zeta}italic_v start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ∈ italic_V start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT, align()𝑎𝑙𝑖𝑔𝑛align()italic_a italic_l italic_i italic_g italic_n ( ) will insert a new node between vζsubscript𝑣𝜁v_{\zeta}italic_v start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT and vζ.parentformulae-sequencesubscript𝑣𝜁𝑝𝑎𝑟𝑒𝑛𝑡v_{\zeta}.parentitalic_v start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT . italic_p italic_a italic_r italic_e italic_n italic_t until the depth of vζsubscript𝑣𝜁v_{\zeta}italic_v start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT reaches K𝐾Kitalic_K. align()𝑎𝑙𝑖𝑔𝑛align()italic_a italic_l italic_i italic_g italic_n ( ) ensures that all leaf nodes of T𝑇Titalic_T reach the same height (depth) K𝐾Kitalic_K thereby satisfying the definition of a coding tree. Figure 9 shows an example of align()𝑎𝑙𝑖𝑔𝑛align()italic_a italic_l italic_i italic_g italic_n ( ) operation.

  5. E.

    Function merge()𝑚𝑒𝑟𝑔𝑒merge(\cdot)italic_m italic_e italic_r italic_g italic_e ( ⋅ ). For any node v^VT^𝑣subscript𝑉𝑇\hat{v}\in V_{T}over^ start_ARG italic_v end_ARG ∈ italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, Algorithm1(v^)𝐴𝑙𝑔𝑜𝑟𝑖𝑡𝑚1^𝑣Algorithm~{}\ref{alg:1}(\hat{v})italic_A italic_l italic_g italic_o italic_r italic_i italic_t italic_h italic_m ( over^ start_ARG italic_v end_ARG ) returns a new coding tree 𝐓𝐓\mathbf{T}bold_T with height equals to 2. T.merg(𝐓)formulae-sequence𝑇𝑚𝑒𝑟𝑔𝐓T.merg(\mathbf{T})italic_T . italic_m italic_e italic_r italic_g ( bold_T ) will replace the sub-tree of T𝑇Titalic_T rooted by v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG with 𝐓𝐓\mathbf{T}bold_T.

Note that all the operations above update ETsubscript𝐸𝑇E_{T}italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT accordingly. In no case does ETsubscript𝐸𝑇E_{T}italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT contain self-loops or skip connections. That is, for any (vα,vβ)ETsubscript𝑣𝛼subscript𝑣𝛽subscript𝐸𝑇(v_{\alpha},v_{\beta})\in E_{T}( italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) ∈ italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, |vα.heightvβ.height|1|v_{\alpha}.height-v_{\beta}.height|\equiv 1| italic_v start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT . italic_h italic_e italic_i italic_g italic_h italic_t - italic_v start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT . italic_h italic_e italic_i italic_g italic_h italic_t | ≡ 1.

Illustrations for Algorithm 1 and Algorithm  2.

Here, we present several diagrams to deliver a running example of proposed algorithms.

As shown in Figure 6, the original graph G𝐺Gitalic_G is fed into Algorithm 1 and initialized as a 1-height coding tree T𝑇Titalic_T, in which all nodes in VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are treated as leaf node and directly connect to a new root node vrTsuperscriptsubscript𝑣𝑟𝑇v_{r}^{T}italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Thereafter, Algorithm 1 will call Algorithm 2 several times to construct a coding tree of the specified height K𝐾Kitalic_K.

Each invocation of Algorithm 2 takes a non-leaf node 𝐯𝐯\mathbf{v}bold_v in tree T𝑇Titalic_T as input and yields a (sub-)coding tree 𝐓𝐓\mathbf{T}bold_T with a height of 2 wherein 𝐯𝐯\mathbf{v}bold_v acting as the root. Algorithm 2 first initializes 𝐓𝐓\mathbf{T}bold_T in a similar procedure to that illustrated in Figure 6. Subsequently, in the first while𝑤𝑖𝑙𝑒whileitalic_w italic_h italic_i italic_l italic_e loop (lines 3-6), we systematically compress the structural entropy by iteratively combining two children of root node 𝐯𝐯\mathbf{v}bold_v with 𝐓.compress(,)formulae-sequence𝐓𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠\mathbf{T}.compress(\cdot,\cdot)bold_T . italic_c italic_o italic_m italic_p italic_r italic_e italic_s italic_s ( ⋅ , ⋅ ), prioritizing those nodes resulting in the largest structural entropy reduction. Ultimately, the maximal reduction in structural entropy is achieved, resulting in the creation of a full-height binary tree, as depicted in Figure 7(b). Given that the full-height binary tree may exceed the specified height of 2, we rectify this by condensing the tree through the invocation of 𝐓.delete()formulae-sequence𝐓𝑑𝑒𝑙𝑒𝑡𝑒\mathbf{T}.delete(\cdot)bold_T . italic_d italic_e italic_l italic_e italic_t italic_e ( ⋅ ) in the second while𝑤𝑖𝑙𝑒whileitalic_w italic_h italic_i italic_l italic_e loop (lines 7-10). The complete deletion process is illustrated in Figure 8(b). It is important to note that after condensation, tree 𝐓𝐓\mathbf{T}bold_T comprises leaf nodes with varying heights, which violates the definition of coding trees. To address this, in line 11, we employ 𝐓.align()formulae-sequence𝐓𝑎𝑙𝑖𝑔𝑛\mathbf{T}.align()bold_T . italic_a italic_l italic_i italic_g italic_n ( ) to introduce inter-nodes. An example of 𝐓.align()formulae-sequence𝐓𝑎𝑙𝑖𝑔𝑛\mathbf{T}.align()bold_T . italic_a italic_l italic_i italic_g italic_n ( ) is illustrated in Figure 9.

Once 𝐓𝐓\mathbf{T}bold_T returned, Algorithm 1 merges 𝐓𝐓\mathbf{T}bold_T into T𝑇Titalic_T at the original position of 𝐯𝐯\mathbf{v}bold_v. Since 𝐯𝐯\mathbf{v}bold_v is selected from either vrTsuperscriptsubscript𝑣𝑟𝑇v_{r}^{T}italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT or VT1superscriptsubscript𝑉𝑇1V_{T}^{1}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, both of which derive sub-tree(s) with height 1, merging 𝐓𝐓\mathbf{T}bold_T of height 2 will increase the height of T𝑇Titalic_T by 1. As depicted in Figure 10, Algorithm 1 aims to iteratively invoke Algorithm 2 and integrate the returned 𝐓𝐓\mathbf{T}bold_T until T𝑇Titalic_T reaches height K𝐾Kitalic_K. The decision of whether Algorithm 2 is invoked on vrTsuperscriptsubscript𝑣𝑟𝑇v_{r}^{T}italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT or VT1superscriptsubscript𝑉𝑇1V_{T}^{1}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT depends on which of the two selections results in less structural entropy for the merged coding tree T𝑇Titalic_T.

Appendix C Proof for Theorem 1

Proof. According to Li and Pan (2016), structural entropy decodes the essential structure of the original system while measuring the structural information to support the semantic modeling of the system. Thus, we have,

I((𝒢𝒟);Y)=𝐼superscriptsubscript𝒢𝒟𝑌absent\displaystyle I(\mathcal{F}^{*}(\mathcal{G_{L}}\circ\mathcal{D});Y)=italic_I ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_Y ) = I((𝒯𝒟);Y).𝐼superscriptsubscript𝒯𝒟𝑌\displaystyle I(\mathcal{F}^{*}(\mathcal{T_{L}}\circ\mathcal{D});Y).italic_I ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_T start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_Y ) . (19)

Considering the data processing inequality (Cover and Thomas, 2006) for data augmentation, we would have,

I((𝒢,𝒟);Y)I(θ(𝒢,𝒟);Y),𝐼subscript𝒢𝒟𝑌𝐼𝜃subscript𝒢𝒟𝑌\displaystyle I((\mathcal{G_{L}},\mathcal{D});Y)\geq I(\theta(\mathcal{G_{L}},% \mathcal{D});Y),italic_I ( ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT , caligraphic_D ) ; italic_Y ) ≥ italic_I ( italic_θ ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT , caligraphic_D ) ; italic_Y ) , (20)

where θ𝜃\thetaitalic_θ is a general data augmentation function acting on (𝒢𝒟)subscript𝒢𝒟(\mathcal{G_{L}}\circ\mathcal{D})( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ). Integrating the above equations, we have,

I((𝒯𝒟);Y)=(a)𝐼subscript𝒯𝒟𝑌𝑎\displaystyle I((\mathcal{T_{L}}\circ\mathcal{D});Y)\overset{(a)}{=}italic_I ( ( caligraphic_T start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_Y ) start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG I((𝒯𝒟);Y)𝐼superscriptsubscript𝒯𝒟𝑌\displaystyle I(\mathcal{F}^{*}(\mathcal{T_{L}}\circ\mathcal{D});Y)italic_I ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_T start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_Y ) (21)
=(b)𝑏\displaystyle\overset{(b)}{=}start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG = end_ARG I((𝒢𝒟);Y)𝐼superscriptsubscript𝒢𝒟𝑌\displaystyle I(\mathcal{F}^{*}(\mathcal{G_{L}}\circ\mathcal{D});Y)italic_I ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_Y ) (22)
=(a)𝑎\displaystyle\overset{(a)}{=}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG I((𝒢𝒟);Y)𝐼subscript𝒢𝒟𝑌\displaystyle I((\mathcal{G_{L}}\circ\mathcal{D});Y)italic_I ( ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ; italic_Y ) (23)
(c)𝑐\displaystyle\overset{(c)}{\geq}start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG ≥ end_ARG I(θ(𝒢𝒟));Y).\displaystyle I(\theta(\mathcal{G_{L}}\circ\mathcal{D}));Y).italic_I ( italic_θ ( caligraphic_G start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ∘ caligraphic_D ) ) ; italic_Y ) . (24)

where (a)𝑎(a)( italic_a ), (b)𝑏(b)( italic_b ), and (c)𝑐(c)( italic_c ) could be referred to Equation 12, Equation 19, and Equation 20, respectively. Here, we have concluded the proof that the information encoded by HILL is lossless, which is the upper bound of any other augmentation-based methods. \square

Ablation Models WOS RCV1-v2 NYTimes
Micro-F1 (ΔΔ\Deltaroman_Δ) Macro-F1 (ΔΔ\Deltaroman_Δ) Micro-F1 (ΔΔ\Deltaroman_Δ) Macro-F1 (ΔΔ\Deltaroman_Δ) Micro-F1 (ΔΔ\Deltaroman_Δ) Macro-F1 (ΔΔ\Deltaroman_Δ)
HILL 87.28 81.77 87.31 70.12 80.47 69.96
r.p. GIN (Xu et al., 2019) 86.67 (-0.61) 80.53 (-1.24) 86.48 (-0.83) 69.30 (-0.82) 79.43 (-1.04) 68.53 (-1.43)
r.p. GAT (Velickovic et al., 2018) 86.58 (-0.70) 80.41 (-1.36) 86.51 (-0.80) 68.12 (-2.00) 79.46 (-1.01) 69.11 (-0.85)
r.p. GCN (Kipf and Welling, 2017) 86.40 (-0.88) 80.47 (-1.30) 86.24 (-1.07) 68.71 (-1.41) 79.42 (-1.05) 69.33 (-0.63)
r.m. Lclrsubscript𝐿𝑐𝑙𝑟L_{clr}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT 86.96 (-0.32) 81.37 (-0.40) 86.51 (-0.80) 68.60 (-1.52) 79.64 (-0.83) 68.90 (-1.06)
r.m. Algorithm  1 86.21 (-1.07) 79.88 (-1.89) 86.67 (-0.64) 67.92 (-2.20) 79.63 (-0.84) 68.74 (-1.22)
Table 4: Performance when replacing or removing some components of HILL on the test set of WOS, RCV1-v2, and NYTimes. r.p. stands for the replacement and r.m. stands for remove. The results of r.m. Lclrsubscript𝐿𝑐𝑙𝑟L_{clr}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT and r.m. Algorithm 1 are both obtained from 5 runs under different random seeds, each of which is distinguished from HILL’s at a significant level of 95% under a one-sample t-test. ΔΔ\Deltaroman_Δ denotes the decrements.

Appendix D Ablation Studies on WOS and NYTimes

Table 4 present the results of the ablation experiments on WOS and NYTimes, respectively. The experimental setups are consistent with Section 4.3. From the results, we observe that the hierarchical representation learning module proposed in this paper outperforms all other structure encoder variants on WOS (Kowsari et al., 2017) and NYTimes (Sandhaus, Evan, 2008). Additionally, when HILL skips Algorithm 1 and completes the representation learning directly on the initial encoding tree, a significant performance drop is observed on all three datasets.

Although the results of the r.m. Lclrsubscript𝐿𝑐𝑙𝑟L_{clr}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT for all three datasets fall short of the original HILL results, the performance gap on WOS is less pronounced compared to RCV1-v2 and NYTimes. Notably, as mentioned in Section 4.1, the contrastive learning weight λclrsubscript𝜆𝑐𝑙𝑟\lambda_{clr}italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_r end_POSTSUBSCRIPT employed on WOS is also the smallest among the three datasets (0.001 vs. 0.1 and 0.3). However, this does not imply that introducing contrastive learning to WOS is a failure, as contrastive learning still results in substantial improvements. A more appropriate statement is that WOS reaps lower benefits due to contrastive learning. We attribute this to two factors. Firstly, there exists a substantial distributional gap between the WOS data and the pre-training data of BERT (Devlin et al., 2019). As depicted in Table 2, the performance gains achieved by replacing TextRCNN (Lai et al., 2015) in each baseline with BERT on WOS are comparatively lower than those on the other datasets. HTCInfoMAX (Deng et al., 2021) even exhibits a performance degradation. Secondly, the maximum label depth of WOS is 2, significantly lower than that of RCV1-v2 (4) and NYTimes (8). This suggests that the label hierarchy of WOS contains less essential structural information, resulting in a comparatively smaller gain in the hierarchy-aware contrastive learning process. Nevertheless, the experimental results withstand the t-test, adequately demonstrating the advantages of contrastive learning.