HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: contour

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: CC BY 4.0
arXiv:2312.10917v1 [cs.LG] 18 Dec 2023
\contourlength

0.1pt \contournumber10

Semi-Supervised Clustering via Structural Entropy with Different Constraints

Guangjie Zeng SKLSDE, Beihang University. {zengguangjie, penghao, angsheng, yangrunze}@buaa.edu.cn    Hao Peng11footnotemark: 1    Angsheng Li11footnotemark: 1    Zhiwei Liu Salesforce AI Research. [email protected]    Runze Yang11footnotemark: 1    Chunyang Liu Didi Chuxing. [email protected]    Lifang He Lehigh University. [email protected]
Abstract

Semi-supervised clustering techniques have emerged as valuable tools for leveraging prior information in the form of constraints to improve the quality of clustering outcomes. Despite the proliferation of such methods, the ability to seamlessly integrate various types of constraints remains limited. While structural entropy has proven to be a powerful clustering approach with wide-ranging applications, it has lacked a variant capable of accommodating these constraints. In this work, we present Semi-supervised clustering via Structural Entropy (SSE), a novel method that can incorporate different types of constraints from diverse sources to perform both partitioning and hierarchical clustering. Specifically, we formulate a uniform view for the commonly used pairwise and label constraints for both types of clustering. Then, we design objectives that incorporate these constraints into structural entropy and develop tailored algorithms for their optimization. We evaluate SSE on nine clustering datasets and compare it with eleven semi-supervised partitioning and hierarchical clustering methods. Experimental results demonstrate the superiority of SSE on clustering accuracy with different types of constraints. Additionally, the functionality of SSE for biological data analysis is demonstrated by cell clustering experiments conducted on four single-cell RNA-seq datasets.

Keywords: Semi-Supervised Clustering, Structural Entropy, Biological Data Analysis.

1 Introduction

Clustering is a key technique in machine learning that aims to group instances according to their similarity [8]. Yet, unsupervised clustering alone often fails to provide the desired level of accuracy and may not meet the diverse requirements of various users. In contrast, semi-supervised clustering harnesses the power of prior information in the form of constraints, significantly boosting clustering accuracy and aligning more effectively with user preferences [18].

Numerous semi-supervised clustering methods based on different classical unsupervised clustering methods have been proposed in recent years. The challenges of semi-supervised clustering are 1) to design an objective function integrating constraints into clustering methods and 2) to effectively and efficiently optimize the objective. The most widely-used way to utilize the prior information is to add a regularization on the original clustering objective [18, 12]. Alternatively, some methods propagate this information to augment the dataset itself [16, 14]. The provided prior information can manifest in various constraint forms, such as pairwise constraints [24], and label constraints [17], and triplet constraints [31]. Many existing semi-supervised clustering methods are tailored to handle a single type of constraint. Yet, it is common for prior information to come in diverse forms from multiple sources. The lack of ability to deal with different types of constraints limits the generalization ability of these methods.

Concerning the integration of different types of constraints into the semi-supervised clustering methods, earlier methods [7, 26] discuss them case by case with different algorithms. However, these methods lack a unified view of constraints and are unable to deal with mixed types of constraints. Bai et al. resolved this issue via a unified formulation of pairwise constraints and label constraints [1] and proposed the SC-MPI algorithm to optimize them simultaneously. However, SC-MPI, which is designed for partitioning clustering, cannot perform hierarchical clustering and thus has limited generalization ability. Hierarchical clustering does not require specifying the number of clusters in advance, and it produces a dendrogram that shows the nested structure of the data. This is useful for many applications, such as finding cell subtypes in biological data analysis [5].

To address aforementioned issues, we propose a more general Semi-supervised clustering method via Structural Entropy with different constraints, namely SSE, for both partitioning clustering and hierarchical clustering. First, we construct a data graph G𝐺Gitalic_G and a relation graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT sharing the same set of vertices to represent the information of input data and prior information in constraints, respectively. Vertices and edge weights of G𝐺Gitalic_G are data points and similarities between them, respectively. Different types of constraints are formulated as a uniform view and stored in Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with positive edge weights representing must-link relationships and negative weights representing cannot-link relationships between data points. Second, we devise the objective of two-dimensional (2-d) SSE for semi-supervised partitioning clustering by adding a penalty term to the objective of 2-d structural entropy and then optimize it through two modified operators merging and moving. Third, we devise the objective of high-dimensional (high-d) SSE for semi-supervised hierarchical clustering by extending the objective of 2-d SSE, and then optimize it through two modified operators stretching and compressing. A binary encoding tree is obtained by stretching and an encoding tree with a certain height is obtained by compressing. The source code is available on GitHub111https://github.com/SELGroup/SSE.

We comprehensively evaluate SSE regarding semi-supervised clustering methods with respect to two types of constraints. The results justify the better performance of SSE under both types of constraints. We also conduct experiments on four single-cell RNA-seq datasets to perform cell clustering, demonstrating the functionality of SSE for biological data analysis. The main contributions of this paper are summarized as follows: (1) We devise a uniform formulation for pairwise constraints and label constraints and use them in a penalty term to form the objective of SSE. (2) We design efficient algorithms to optimize the objective of SSE to enable semi-supervised partitioning clustering and hierarchical clustering. (3) The extensive experiments on nine clustering datasets and four single-cell RNA-seq datasets indicate that SSE achieves the best performance among semi-supervised clustering methods and is effective for biological data analysis.

2 Structural Entropy

We provide a brief introduction to structural entropy [15] before presenting our model. Intuitively, structural entropy methods encode tree structures via characterizing the uncertainty of the hierarchical topology. The structural entropy of a graph G𝐺Gitalic_G is defined as the minimum total number of bits required to determine the codewords of nodes in G𝐺Gitalic_G. Structural entropy has achieved success in the field of traffic forecast [32], social event detection [3], and reinforcement learning [29, 30]. Through minimizing the structural entropy of a given graph G𝐺Gitalic_G, the hierarchical clustering result of vertices in G𝐺Gitalic_G is retained by the associated encoding tree.

Encoding tree. Let G=(V,E,𝐖)𝐺𝑉𝐸𝐖G=(V,E,\textbf{W})italic_G = ( italic_V , italic_E , W ) be an undirected weighted graph, where V={v1,,vn}𝑉subscript𝑣1subscript𝑣𝑛V=\{v_{1},...,v_{n}\}italic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is the vertex set, E𝐸Eitalic_E is the edge set, and 𝐖n×n𝐖superscript𝑛𝑛\textbf{W}\in\mathbb{R}^{n\times n}W ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is the edge weight matrix. The encoding tree 𝒯𝒯\mathcal{T}caligraphic_T of G𝐺Gitalic_G as a hierarchical rooted tree is defined as follows: (1) For each tree node α𝒯𝛼𝒯\alpha\in\mathcal{T}italic_α ∈ caligraphic_T, a vertex subset TαVsubscript𝑇𝛼𝑉T_{\alpha}\in Vitalic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∈ italic_V is associated with it. (2) The root node λ𝜆\lambdaitalic_λ of 𝒯𝒯\mathcal{T}caligraphic_T is associated with the vertex set V𝑉Vitalic_V, i.e., Tλ=Vsubscript𝑇𝜆𝑉T_{\lambda}=Vitalic_T start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = italic_V. (3) For each α𝒯𝛼𝒯\alpha\in\mathcal{T}italic_α ∈ caligraphic_T, the immediate successors of it are labeled by αisuperscript𝛼delimited-⟨⟩𝑖\alpha^{\wedge}\langle i\rangleitalic_α start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT ⟨ italic_i ⟩ ordered from left to right as i𝑖iitalic_i increases, and the immediate predecessor of it is written as αsuperscript𝛼\alpha^{-}italic_α start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT. (4) For each α𝒯𝛼𝒯\alpha\in\mathcal{T}italic_α ∈ caligraphic_T with L𝐿Litalic_L immediate successors, vertex subsets Tαisubscript𝑇superscript𝛼delimited-⟨⟩𝑖T_{\alpha^{\wedge}\langle i\rangle}italic_T start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT ⟨ italic_i ⟩ end_POSTSUBSCRIPT are disjoint and Tα=i=1LTαisubscript𝑇𝛼superscriptsubscript𝑖1𝐿subscript𝑇superscript𝛼delimited-⟨⟩𝑖T_{\alpha}=\cup_{i=1}^{L}T_{\alpha^{\wedge}\langle i\rangle}italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT ⟨ italic_i ⟩ end_POSTSUBSCRIPT. (5) For each leaf node ν𝒯𝜈𝒯\nu\in\mathcal{T}italic_ν ∈ caligraphic_T, Tνsubscript𝑇𝜈T_{\nu}italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT contains only one vertex in V𝑉Vitalic_V.

K𝐾Kitalic_K-D Structural Entropy. Given an arbitrary rooted encoding tree 𝒯𝒯\mathcal{T}caligraphic_T of a graph G𝐺Gitalic_G, the structural entropy of G𝐺Gitalic_G on 𝒯𝒯\mathcal{T}caligraphic_T measures the amount of remaining complexity in G𝐺Gitalic_G after reduced by 𝒯𝒯\mathcal{T}caligraphic_T. For each non-root node α𝒯𝛼𝒯\alpha\in\mathcal{T}italic_α ∈ caligraphic_T, the assigned structural entropy of it is defined as:

(2.1) 𝒯(G;α)=gα𝒱Glog2𝒱α𝒱α,superscript𝒯𝐺𝛼subscript𝑔𝛼subscript𝒱𝐺subscript2subscript𝒱𝛼subscript𝒱superscript𝛼\mathcal{H}^{\mathcal{T}}(G;\alpha)=-\frac{g_{\alpha}}{\mathcal{V}_{G}}\log_{2% }\frac{\mathcal{V}_{\alpha}}{\mathcal{V}_{\alpha^{-}}},caligraphic_H start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G ; italic_α ) = - divide start_ARG italic_g start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ,

where gαsubscript𝑔𝛼g_{\alpha}italic_g start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is the cut, i.e., the weight sum of edges between nodes in and not in Tαsubscript𝑇𝛼T_{\alpha}italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, 𝒱αsubscript𝒱𝛼\mathcal{V}_{\alpha}caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and 𝒱Gsubscript𝒱𝐺\mathcal{V}_{G}caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are the volumes, i.e., the sum of node degrees in Tαsubscript𝑇𝛼T_{\alpha}italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and G𝐺Gitalic_G, respectively. The structural entropy of G𝐺Gitalic_G given by 𝒯𝒯\mathcal{T}caligraphic_T is defined as:

(2.2) 𝒯(G)=α𝒯,αλ𝒯(G;α).superscript𝒯𝐺subscriptformulae-sequence𝛼𝒯𝛼𝜆superscript𝒯𝐺𝛼\mathcal{H}^{\mathcal{T}}(G)=\sum_{\alpha\in\mathcal{T},\alpha\neq\lambda}% \mathcal{H}^{\mathcal{T}}(G;\alpha).caligraphic_H start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G ) = ∑ start_POSTSUBSCRIPT italic_α ∈ caligraphic_T , italic_α ≠ italic_λ end_POSTSUBSCRIPT caligraphic_H start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G ; italic_α ) .

To meet the requirements of downstream applications, the K𝐾Kitalic_K-dimensional structural entropy of G𝐺Gitalic_G is defined as:

(2.3) K(G)=min𝒯{𝒯(G)},superscript𝐾𝐺subscript𝒯superscript𝒯𝐺\mathcal{H}^{K}(G)=\min_{\mathcal{T}}\{\mathcal{H}^{\mathcal{T}}(G)\},caligraphic_H start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_G ) = roman_min start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT { caligraphic_H start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G ) } ,

where 𝒯𝒯\mathcal{T}caligraphic_T ranges over all encoding trees whose heights are at most K𝐾Kitalic_K.

2-D Structural Entropy. One special case of K𝐾Kitalic_K-d structural entropy is 2-d structural entropy, where the encoding tree represents a graph partitioning. A 2-d encoding tree 𝒯𝒯\mathcal{T}caligraphic_T can be formulated as a graph partitioning 𝒫={X1,X2,,XL}𝒫subscript𝑋1subscript𝑋2subscript𝑋𝐿\mathcal{P}=\{X_{1},X_{2},...,X_{L}\}caligraphic_P = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT } of V𝑉Vitalic_V, where Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a vertex subset called module associated with the i𝑖iitalic_i-th children of root λ𝜆\lambdaitalic_λ. The structural entropy of G𝐺Gitalic_G given by 𝒫𝒫\mathcal{P}caligraphic_P is defined as:

(2.4) 𝒫(G)=X𝒫viXgi𝒱Glog2di𝒱Xsuperscript𝒫𝐺subscript𝑋𝒫subscriptsubscript𝑣𝑖𝑋subscript𝑔𝑖subscript𝒱𝐺subscript2subscript𝑑𝑖subscript𝒱𝑋\displaystyle\mathcal{H}^{\mathcal{P}}(G)=-\sum_{X\in\mathcal{P}}\sum_{v_{i}% \in X}\frac{g_{i}}{\mathcal{V}_{G}}\log_{2}\frac{d_{i}}{\mathcal{V}_{X}}caligraphic_H start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_G ) = - ∑ start_POSTSUBSCRIPT italic_X ∈ caligraphic_P end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT divide start_ARG italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG
X𝒫gX𝒱Glog2𝒱X𝒱G,subscript𝑋𝒫subscript𝑔𝑋subscript𝒱𝐺subscript2subscript𝒱𝑋subscript𝒱𝐺\displaystyle-\sum_{X\in\mathcal{P}}\frac{g_{X}}{\mathcal{V}_{G}}\log_{2}\frac% {\mathcal{V}_{X}}{\mathcal{V}_{G}},- ∑ start_POSTSUBSCRIPT italic_X ∈ caligraphic_P end_POSTSUBSCRIPT divide start_ARG italic_g start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG ,

where disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the degree of vertex visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the cut, i.e., the weight sum of edges connecting visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and other vertices, 𝒱Xsubscript𝒱𝑋\mathcal{V}_{X}caligraphic_V start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and 𝒱Gsubscript𝒱𝐺\mathcal{V}_{G}caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are the volumes, i.e., the sum of node degrees in module X𝑋Xitalic_X and graph G𝐺Gitalic_G, respectively, and gXsubscript𝑔𝑋g_{X}italic_g start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is the cut, i.e., the weight sum of edges between vertices in and not in module X𝑋Xitalic_X.

3 Methodology

In this section, we present the proposed SSE algorithm for semi-supervised clustering. The framework of SSE is depicted in Figure 1. SSE has three components: graph construction, semi-supervised partitioning clustering, and semi-supervised hierarchical clustering. Input data and constraints are transformed into two different graphs sharing the same vertex set and then used to perform semi-supervised partitioning clustering and semi-supervised hierarchical clustering through 2-d SSE and high-d SSE minimization, respectively.

Refer to caption
Figure 1: Overview of SSE. (I) Two graphs G𝐺Gitalic_G and Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are constructed from input data and constraints, respectively. (II) Semi-supervised partitioning clustering is performed through two operators merging and moving. (III) Semi-supervised hierarchical clustering is performed through two operators stretching and compressing.

3.1 Uniform Formulation of Constraints.

Considering a graph G=(V,E,𝐖)𝐺𝑉𝐸𝐖G=(V,E,\mathbf{W})italic_G = ( italic_V , italic_E , bold_W ) associated to a given dataset 𝒳={x1,x2,,xn}𝒳subscript𝑥1subscript𝑥2subscript𝑥𝑛\mathcal{X}=\{x_{1},x_{2},...,x_{n}\}caligraphic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a data point, V={v1,v2,,vn}𝑉subscript𝑣1subscript𝑣2subscript𝑣𝑛V=\{v_{1},v_{2},...,v_{n}\}italic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } correspond to data points in 𝒳𝒳\mathcal{X}caligraphic_X, the edges in E𝐸Eitalic_E connect similar data points, and edge weights in 𝐖𝐖\mathbf{W}bold_W represent similarity of data points. We aim to partition graph vertices in G𝐺Gitalic_G with certain given prior information in the form of constraints to achieve semi-supervised data clustering. The pairwise constraints and label constraints are formulated as follows.

Pairwise constraints reveal the relationship between a pair of vertices in G𝐺Gitalic_G. They consist of a set of must-link constraints M={(vi,vj):li=lj}𝑀subscript𝑣𝑖subscript𝑣𝑗:subscript𝑙𝑖subscript𝑙𝑗M=\{(v_{i},v_{j}):\>l_{i}=l_{j}\}italic_M = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) : italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, indicating that vertex pair (vi,vj)subscript𝑣𝑖subscript𝑣𝑗(v_{i},v_{j})( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) must belong to the same cluster, and a set of cannot-link constraints C={(xi,xj):lilj}𝐶subscript𝑥𝑖subscript𝑥𝑗:subscript𝑙𝑖subscript𝑙𝑗C=\{(x_{i},x_{j}):\>l_{i}\neq l_{j}\}italic_C = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) : italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, indicating that vertex pair (vi,vj)subscript𝑣𝑖subscript𝑣𝑗(v_{i},v_{j})( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) must belong to different clusters, where lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the cluster indicator of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Pairwise constraints can be stored in a relation graph G=(V,E,𝐖)superscript𝐺𝑉superscript𝐸superscript𝐖G^{\prime}=(V,E^{\prime},\mathbf{W^{\prime}})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_V , italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), which shares the same vertex set with G𝐺Gitalic_G. If there exists a vertex pair (vi,vj)Msubscript𝑣𝑖subscript𝑣𝑗𝑀(v_{i},v_{j})\in M( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_M, an edge exists in Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with a positive value γMsubscript𝛾𝑀\gamma_{M}italic_γ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT added to the edge weight 𝐖ijsubscriptsuperscript𝐖𝑖𝑗\mathbf{W}^{\prime}_{ij}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. If there exists a vertex pair (vi,vj)Csubscript𝑣𝑖subscript𝑣𝑗𝐶(v_{i},v_{j})\in C( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_C, an edge exists in Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with a negative value γCsubscript𝛾𝐶\gamma_{C}italic_γ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT added to the edge weight 𝐖ijsubscriptsuperscript𝐖𝑖𝑗\mathbf{W}^{\prime}_{ij}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. The values of 𝐖𝐖\mathbf{W}bold_W and 𝐖superscript𝐖\mathbf{W^{\prime}}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are set in Implementation Details in Section 4.

Label constraints reveal the relationship between vertices in G𝐺Gitalic_G and ground truth class labels. They include a set of positive-label constraints P={(vi,ym):viym}𝑃subscript𝑣𝑖subscript𝑦𝑚:subscript𝑣𝑖subscript𝑦𝑚P=\{(v_{i},y_{m}):\>v_{i}\in y_{m}\}italic_P = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) : italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, indicating that the true class label of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is ymsubscript𝑦𝑚y_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and a set of negative-label constraints N={(vi,ym):viym}𝑁subscript𝑣𝑖subscript𝑦𝑚:subscript𝑣𝑖subscript𝑦𝑚N=\{(v_{i},y_{m}):\>v_{i}\notin y_{m}\}italic_N = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) : italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, indicating that the true class label of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is not ymsubscript𝑦𝑚y_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. To form a uniform representation of constraints, we convert label constraints into pairwise constraints which are more compatible for structural entropy. For two vertices visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the conversion rules are set as follows: (1) If they both have positive constraints with the same label, an edge exists in Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with a positive value γMsubscript𝛾𝑀\gamma_{M}italic_γ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT added to the edge weight 𝐖ijsubscriptsuperscript𝐖𝑖𝑗\mathbf{W}^{\prime}_{ij}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. (2) If they both have positive constraints with different labels, an edge exists in Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with a negative value γCsubscript𝛾𝐶\gamma_{C}italic_γ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT added to the edge weight 𝐖ijsubscriptsuperscript𝐖𝑖𝑗\mathbf{W}^{\prime}_{ij}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. (3) If they have positive constraint and negative constraints respectively with the same label, an edge exists in Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with a negative value γCsubscript𝛾𝐶\gamma_{C}italic_γ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT added to the edge weight 𝐖ijsubscriptsuperscript𝐖𝑖𝑗\mathbf{W}^{\prime}_{ij}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

The constraints are stored in the relation graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT after construction, where a positive value indicates a must-link relationship and a negative value indicates a cannot-link relationship. However, this relation graph can be further improved by exploiting constraint transitivity and entailment [23]. We apply them sequentially on Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT after constructing it.

3.2 2-D SSE.

In this subsection, we present 2-d SSE modified from 2-d structural entropy to perform semi-supervised partitioning clustering. For a graph G𝐺Gitalic_G associated with a data set 𝒳𝒳\mathcal{X}caligraphic_X with different types of constraints, we transform all types of constraints into a uniform formulation and store them in a relation graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We aim to find a graph partitioning 𝒫𝒫\mathcal{P}caligraphic_P of G𝐺Gitalic_G that minimizes the structural entropy of G𝐺Gitalic_G while also minimizing the number of violated constraints in the meantime. The optimization objective of two-dimensional structural entropy is defined as follows:

(3.5) 𝒫(G,G)=𝒫(G)+ϕ𝒫(G,G),superscript𝒫𝐺superscript𝐺superscript𝒫𝐺italic-ϕsuperscript𝒫𝐺superscript𝐺\mathcal{L}^{\mathcal{P}}(G,G^{\prime})=\mathcal{H}^{\mathcal{P}}(G)+\phi% \mathcal{E}^{\mathcal{P}}(G,G^{\prime}),caligraphic_L start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = caligraphic_H start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_G ) + italic_ϕ caligraphic_E start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,

where 𝒫(G,G)superscript𝒫𝐺superscript𝐺\mathcal{E}^{\mathcal{P}}(G,G^{\prime})caligraphic_E start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is a penalty term for constraints violation, and it is defined as:

(3.6) 𝒫(G,G)=X𝒫gX𝒱Glog2𝒱X𝒱G,superscript𝒫𝐺superscript𝐺subscript𝑋𝒫subscriptsuperscript𝑔𝑋subscript𝒱𝐺subscript2subscript𝒱𝑋subscript𝒱𝐺\displaystyle\mathcal{E}^{\mathcal{P}}(G,G^{\prime})=-\sum_{X\in\mathcal{P}}% \frac{g^{\prime}_{X}}{\mathcal{V}_{G}}\log_{2}\frac{\mathcal{V}_{X}}{\mathcal{% V}_{G}},caligraphic_E start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_X ∈ caligraphic_P end_POSTSUBSCRIPT divide start_ARG italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG ,

where gXsubscriptsuperscript𝑔𝑋g^{\prime}_{X}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is the weight sum of edges in Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT between vertices in and not in module X𝑋Xitalic_X, and other notations share the same meaning with notations in Eq. (2.4).

The intuition of the penalty term is that we modify gXsubscript𝑔𝑋g_{X}italic_g start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, i.e., the cut of module X𝑋Xitalic_X in Eq. (2.4) according to the constraints, which is increased when must-link constraints are violated, and decreased when cannot-link constraints are satisfied. A positive value of 𝐖ijsubscriptsuperscript𝐖𝑖𝑗\mathbf{W}^{\prime}_{ij}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT means visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT should belong to the same module, and 𝒫>0superscript𝒫0\mathcal{E}^{\mathcal{P}}>0caligraphic_E start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT > 0 if they are not, leading to a penalty added to 𝒫superscript𝒫\mathcal{L}^{\mathcal{P}}caligraphic_L start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT. A negative value of 𝐖ijsubscriptsuperscript𝐖𝑖𝑗\mathbf{W}^{\prime}_{ij}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT means visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT should belong to different modules, and 𝒫<0superscript𝒫0\mathcal{E}^{\mathcal{P}}<0caligraphic_E start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT < 0 if they are, leading to a reward added to 𝒫superscript𝒫\mathcal{L}^{\mathcal{P}}caligraphic_L start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT. When no constraint exists, i.e., 𝒫=0superscript𝒫0\mathcal{E}^{\mathcal{P}}=0caligraphic_E start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT = 0, we only minimize unsupervised 2-d structural entropy. In all, 𝒫superscript𝒫\mathcal{E}^{\mathcal{P}}caligraphic_E start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT penalizes modules that violate must-link constraints and rewards modules that satisfy cannot-link constraints.

Minimizing 2-D SSE. We minimize 2-d SSE via two operators merging [15] and moving on the encoding tree 𝒯𝒯\mathcal{T}caligraphic_T. For two sister nodes α,β𝒯𝛼𝛽𝒯\alpha,\beta\in\mathcal{T}italic_α , italic_β ∈ caligraphic_T with associated vertex subsets X𝑋Xitalic_X and Y𝑌Yitalic_Y, node merging is defined as: (1) set X=XY𝑋𝑋𝑌X=X\cup Yitalic_X = italic_X ∪ italic_Y, (2) delete β𝛽\betaitalic_β. The decrease amount of 𝒫(G,G)superscript𝒫𝐺superscript𝐺\mathcal{L}^{\mathcal{P}}(G,G^{\prime})caligraphic_L start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is given by:

(3.7) ΔX,Y=1𝒱G[(𝒱XgXgX)log2𝒱X+(𝒱YgYgY)log2𝒱Y(𝒱XYgXYgXY)log2𝒱XY+(gX+gYgXY+gX+gYgXY)log2𝒱G],Δsubscriptsuperscript𝑋𝑌1subscript𝒱𝐺delimited-[]subscript𝒱𝑋subscript𝑔𝑋subscriptsuperscript𝑔𝑋subscript2subscript𝒱𝑋subscript𝒱𝑌subscript𝑔𝑌subscriptsuperscript𝑔𝑌subscript2subscript𝒱𝑌subscript𝒱𝑋𝑌subscript𝑔𝑋𝑌subscriptsuperscript𝑔𝑋𝑌subscript2subscript𝒱𝑋𝑌subscript𝑔𝑋subscript𝑔𝑌subscript𝑔𝑋𝑌subscriptsuperscript𝑔𝑋subscriptsuperscript𝑔𝑌subscriptsuperscript𝑔𝑋𝑌subscript2subscript𝒱𝐺\begin{split}\Delta\mathcal{L}^{\mathcal{M}}_{X,Y}=\frac{1}{\mathcal{V}_{G}}[% \left(\mathcal{V}_{X}-g_{X}-g^{\prime}_{X}\right)\log_{2}\mathcal{V}_{X}\\ +\left(\mathcal{V}_{Y}-g_{Y}-g^{\prime}_{Y}\right)\log_{2}\mathcal{V}_{Y}\\ -\left(\mathcal{V}_{X\cup Y}-g_{X\cup Y}-g^{\prime}_{X\cup Y}\right)\log_{2}% \mathcal{V}_{X\cup Y}\\ +\left(g_{X}+g_{Y}-g_{X\cup Y}+g^{\prime}_{X}+g^{\prime}_{Y}-g^{\prime}_{X\cup Y% }\right)\log_{2}\mathcal{V}_{G}],\end{split}start_ROW start_CELL roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG [ ( caligraphic_V start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL + ( caligraphic_V start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - ( caligraphic_V start_POSTSUBSCRIPT italic_X ∪ italic_Y end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_X ∪ italic_Y end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X ∪ italic_Y end_POSTSUBSCRIPT ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_X ∪ italic_Y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL + ( italic_g start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_X ∪ italic_Y end_POSTSUBSCRIPT + italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT + italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X ∪ italic_Y end_POSTSUBSCRIPT ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ] , end_CELL end_ROW

where \mathcal{M}caligraphic_M denotes \mathcal{M}caligraphic_Merging operator, 𝒱Xsubscript𝒱𝑋\mathcal{V}_{X}caligraphic_V start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is the volume of X𝑋Xitalic_X in G𝐺Gitalic_G, 𝒱Gsubscript𝒱𝐺\mathcal{V}_{G}caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is the volume of G𝐺Gitalic_G, gXsubscript𝑔𝑋g_{X}italic_g start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and gXsubscriptsuperscript𝑔𝑋g^{\prime}_{X}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT are the cuts of X𝑋Xitalic_X in G𝐺Gitalic_G and Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, respectively. For a node α𝒯𝛼𝒯\alpha\in\mathcal{T}italic_α ∈ caligraphic_T with associated module X𝑋Xitalic_X and a vertex viXsubscript𝑣𝑖𝑋v_{i}\in Xitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X, the moving operator seeks to find a new node β𝒯𝛽𝒯\beta\in\mathcal{T}italic_β ∈ caligraphic_T with associated module Y𝑌Yitalic_Y and move visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from X𝑋Xitalic_X to Y𝑌Yitalic_Y. The decrease amount of 𝒫(G,G)superscript𝒫𝐺superscript𝐺\mathcal{L}^{\mathcal{P}}(G,G^{\prime})caligraphic_L start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) by removing visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from X𝑋Xitalic_X is given by:

(3.8) ΔX,vi=𝒱XgXgX𝒱Glog2𝒱X𝒱G𝒱X\{vi}gX\{vi}gX\{vi}𝒱Glog2𝒱X\{vi}𝒱G,Δsubscriptsuperscript𝑋subscript𝑣𝑖subscript𝒱𝑋subscript𝑔𝑋subscriptsuperscript𝑔𝑋subscript𝒱𝐺subscript2subscript𝒱𝑋subscript𝒱𝐺subscript𝒱\𝑋subscript𝑣𝑖subscript𝑔\𝑋subscript𝑣𝑖subscriptsuperscript𝑔\𝑋subscript𝑣𝑖subscript𝒱𝐺subscript2subscript𝒱\𝑋subscript𝑣𝑖subscript𝒱𝐺\begin{split}\Delta\mathcal{L}^{\mathcal{R}}_{X,v_{i}}=\frac{\mathcal{V}_{X}-g% _{X}-g^{\prime}_{X}}{\mathcal{V}_{G}}\log_{2}\frac{\mathcal{V}_{X}}{\mathcal{V% }_{G}}\\ -\frac{\mathcal{V}_{X\backslash\{v_{i}\}}-g_{X\backslash\{v_{i}\}}-g^{\prime}_% {X\backslash\{v_{i}\}}}{\mathcal{V}_{G}}\log_{2}\frac{\mathcal{V}_{X\backslash% \{v_{i}\}}}{\mathcal{V}_{G}},\end{split}start_ROW start_CELL roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL - divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_X \ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_X \ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X \ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_X \ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG , end_CELL end_ROW

where \mathcal{R}caligraphic_R denotes vertex \mathcal{R}caligraphic_Removing and X\{vi}\𝑋subscript𝑣𝑖X\backslash\{v_{i}\}italic_X \ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } denotes removing vertex visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from X𝑋Xitalic_X. The increase amount of 𝒫(G,G)superscript𝒫𝐺superscript𝐺\mathcal{L}^{\mathcal{P}}(G,G^{\prime})caligraphic_L start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) by inserting visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into Y𝑌Yitalic_Y is given by:

(3.9) ΔY,vi=𝒱YgYgY𝒱Glog2𝒱Y𝒱G+𝒱Y{vi}gY{vi}gY{vi}𝒱Glog2𝒱Y{vi}𝒱G,Δsubscriptsuperscript𝑌subscript𝑣𝑖subscript𝒱𝑌subscript𝑔𝑌subscriptsuperscript𝑔𝑌subscript𝒱𝐺subscript2subscript𝒱𝑌subscript𝒱𝐺subscript𝒱𝑌subscript𝑣𝑖subscript𝑔𝑌subscript𝑣𝑖subscriptsuperscript𝑔𝑌subscript𝑣𝑖subscript𝒱𝐺subscript2subscript𝒱𝑌subscript𝑣𝑖subscript𝒱𝐺\begin{split}\Delta\mathcal{L}^{\mathcal{I}}_{Y,v_{i}}=-\frac{\mathcal{V}_{Y}-% g_{Y}-g^{\prime}_{Y}}{\mathcal{V}_{G}}\log_{2}\frac{\mathcal{V}_{Y}}{\mathcal{% V}_{G}}\\ +\frac{\mathcal{V}_{Y\cup\{v_{i}\}}-g_{Y\cup\{v_{i}\}}-g^{\prime}_{Y\cup\{v_{i% }\}}}{\mathcal{V}_{G}}\log_{2}\frac{\mathcal{V}_{Y\cup\{v_{i}\}}}{\mathcal{V}_% {G}},\end{split}start_ROW start_CELL roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = - divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL + divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_Y ∪ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_Y ∪ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y ∪ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_Y ∪ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG , end_CELL end_ROW

where \mathcal{I}caligraphic_I denotes vertex \mathcal{I}caligraphic_Inserting and Y{vi}𝑌subscript𝑣𝑖Y\cup\{v_{i}\}italic_Y ∪ { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } denotes inserting visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into Y𝑌Yitalic_Y. We initialize 𝒯𝒯\mathcal{T}caligraphic_T to contain a root node λ𝜆\lambdaitalic_λ and n𝑛nitalic_n leaves where each leaf is associated with one vertex in G𝐺Gitalic_G, and then sequentially apply merging and moving operators until convergence. The optimization procedure is summarized in Algorithm 1.

Algorithm 1 2-d SSE minimization
0:  G=(V,E,𝐖)𝐺𝑉𝐸𝐖G=(V,E,\mathbf{W})italic_G = ( italic_V , italic_E , bold_W ), G=(V,E,𝐖)superscript𝐺𝑉superscript𝐸superscript𝐖G^{\prime}=(V,E^{\prime},\mathbf{W}^{\prime})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_V , italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
0:  Encoding tree 𝒯𝒯\mathcal{T}caligraphic_T and partitioning 𝒫𝒫\mathcal{P}caligraphic_P
1:  Initialize 𝒯𝒯\mathcal{T}caligraphic_T containing all vertices as tree leaves
2:  // Merging stage
3:  repeat
4:     Merge a chosen module pair (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) into XY𝑋𝑌X\cup Yitalic_X ∪ italic_Y condition on argmaxX,Y{ΔX,Y}subscript𝑋𝑌Δsubscriptsuperscript𝑋𝑌\arg\max_{X,Y}\{\Delta\mathcal{L}^{\mathcal{M}}_{X,Y}\}roman_arg roman_max start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT { roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT } via Eq. (3.7)
5:     Update ΔΔsuperscript\Delta\mathcal{L}^{\mathcal{M}}roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT for module pairs connected to X𝑋Xitalic_X or Y𝑌Yitalic_Y
6:  until Δ<0Δsuperscript0\Delta\mathcal{L}^{\mathcal{M}}<0roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT < 0 for all module pairs
7:  // Moving stage
8:  repeat
9:     for each vertex viVsubscript𝑣𝑖𝑉v_{i}\in Vitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V do
10:        Remove vertex visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the original module X𝑋Xitalic_X
11:        Insert node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into a chosen module Y𝑌Yitalic_Y condition on argmaxY{ΔX,viΔY,vi}subscript𝑌Δsubscriptsuperscript𝑋subscript𝑣𝑖Δsubscriptsuperscript𝑌subscript𝑣𝑖\arg\max_{Y}\{\Delta\mathcal{L}^{\mathcal{R}}_{X,v_{i}}-\Delta\mathcal{L}^{% \mathcal{I}}_{Y,v_{i}}\}roman_arg roman_max start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT { roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } via Eqs. (3.8) and (3.9)
12:     end for
13:  until 𝒫(G,G)superscript𝒫𝐺superscript𝐺\mathcal{L}^{\mathcal{P}}(G,G^{\prime})caligraphic_L start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) converges

In both merging stage and moving stage, 𝒫superscript𝒫\mathcal{L}^{\mathcal{P}}caligraphic_L start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT decreases after every iteration, and it converges when no improvement can be made. The time complexity of merging stage is O(nlog2n)𝑂𝑛superscript2𝑛O(n{\log^{2}n})italic_O ( italic_n roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n ) [15]. In the moving stage, each iteration requires calculating ΔX,viΔsubscriptsuperscript𝑋subscript𝑣𝑖\Delta\mathcal{L}^{\mathcal{R}}_{X,v_{i}}roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ΔY,viΔsubscriptsuperscript𝑌subscript𝑣𝑖\Delta\mathcal{L}^{\mathcal{I}}_{Y,v_{i}}roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for every vertex visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and every possible module Y𝑌Yitalic_Y, which takes the time of O(nl)𝑂𝑛𝑙O(nl)italic_O ( italic_n italic_l ). Taken together, the time complexity of Algorithm 1 is O(nlog2n+nlt)𝑂𝑛superscript2𝑛𝑛𝑙𝑡O(n\log^{2}n+nlt)italic_O ( italic_n roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n + italic_n italic_l italic_t ), where n𝑛nitalic_n, l𝑙litalic_l and t𝑡titalic_t denote the number of vertices, modules, and iterations respectively.

3.3 High-D SSE.

Hereafter, we generalize 2-d SSE into high-d SSE to perform semi-supervised hierarchical clustering. For a graph G𝐺Gitalic_G associated with data set 𝒳𝒳\mathcal{X}caligraphic_X and constraints stored in a relation graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we aim to find an encoding tree with the height of K>2𝐾2K>2italic_K > 2 to form a semi-supervised hierarchical clustering of vertices in G𝐺Gitalic_G. Following the definition of 2-d SSE in Section 3.2, we define the optimization objective of high-d SSE as follows:

(3.10) 𝒯(G,G)=𝒯(G)+ϕ𝒯(G,G),superscript𝒯𝐺superscript𝐺superscript𝒯𝐺italic-ϕsuperscript𝒯𝐺superscript𝐺\mathcal{L}^{\mathcal{T}}(G,G^{\prime})=\mathcal{H}^{\mathcal{T}}(G)+\phi% \mathcal{E}^{\mathcal{T}}(G,G^{\prime}),caligraphic_L start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = caligraphic_H start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G ) + italic_ϕ caligraphic_E start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,

where 𝒯(G,G)superscript𝒯𝐺superscript𝐺\mathcal{E}^{\mathcal{T}}(G,G^{\prime})caligraphic_E start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is a penalty term for constraints violation, and it is defined as:

(3.11) 𝒯(G,G)=α𝒯,1<|T(α)|<|V|gα𝒱Glog2𝒱α𝒱α,superscript𝒯𝐺superscript𝐺subscriptformulae-sequence𝛼𝒯1𝑇𝛼𝑉subscriptsuperscript𝑔𝛼subscript𝒱𝐺subscript2subscript𝒱𝛼subscript𝒱superscript𝛼\mathcal{E}^{\mathcal{T}}(G,G^{\prime})=\sum_{\alpha\in\mathcal{T},1<|T(\alpha% )|<|V|}-\frac{g^{\prime}_{\alpha}}{\mathcal{V}_{G}}\log_{2}\frac{\mathcal{V}_{% \alpha}}{\mathcal{V}_{\alpha^{-}}},caligraphic_E start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_α ∈ caligraphic_T , 1 < | italic_T ( italic_α ) | < | italic_V | end_POSTSUBSCRIPT - divide start_ARG italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ,

where gαsubscriptsuperscript𝑔𝛼g^{\prime}_{\alpha}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is the cut of α𝛼\alphaitalic_α in Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, |T(α)|𝑇𝛼|T(\alpha)|| italic_T ( italic_α ) | is the number of vertices in subset T(α)𝑇𝛼T(\alpha)italic_T ( italic_α ) associated to α𝛼\alphaitalic_α, and other notations share the same meaning with notations in Eq. (2.1). For each node except for leaves in 𝒯𝒯\mathcal{T}caligraphic_T, the penalty term penalizes the violation of must-link constraints and rewards the satisfaction of cannot-link constraints.

Minimizing High-D SSE. We minimize high-d SSE via two operators stretching and compressing on the encoding tree 𝒯𝒯\mathcal{T}caligraphic_T [19]. For a pair of sister nodes (α,β)𝒯𝛼𝛽𝒯(\alpha,\beta)\in\mathcal{T}( italic_α , italic_β ) ∈ caligraphic_T whose parent is γ𝛾\gammaitalic_γ, node stretching is defined as inserting a new node δ𝛿\deltaitalic_δ between γ𝛾\gammaitalic_γ and (α,β)𝛼𝛽\left(\alpha,\beta\right)( italic_α , italic_β ), i.e., (1) set α=δsuperscript𝛼𝛿\alpha^{-}=\deltaitalic_α start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_δ, (2) set β=δsuperscript𝛽𝛿\beta^{-}=\deltaitalic_β start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_δ, (3) set δ=γsuperscript𝛿𝛾\delta^{-}=\gammaitalic_δ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_γ. The decrease amount of 𝒯(G,G)superscript𝒯𝐺superscript𝐺\mathcal{L}^{\mathcal{T}}(G,G^{\prime})caligraphic_L start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is given by:

(3.12) Δα,β𝒮=gα+gβgδ+gα+gβgδ𝒱Glog2𝒱γ𝒱δ,Δsubscriptsuperscript𝒮𝛼𝛽subscript𝑔𝛼subscript𝑔𝛽subscript𝑔𝛿subscriptsuperscript𝑔𝛼subscriptsuperscript𝑔𝛽subscriptsuperscript𝑔𝛿subscript𝒱𝐺subscript2subscript𝒱𝛾subscript𝒱𝛿\begin{split}\Delta\mathcal{L}^{\mathcal{S}}_{\alpha,\beta}=\frac{g_{\alpha}+g% _{\beta}-g_{\delta}+g^{\prime}_{\alpha}+g^{\prime}_{\beta}-g^{\prime}_{\delta}% }{\mathcal{V}_{G}}\log_{2}\frac{\mathcal{V}_{\gamma}}{\mathcal{V}_{\delta}},% \end{split}start_ROW start_CELL roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α , italic_β end_POSTSUBSCRIPT = divide start_ARG italic_g start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT + italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_ARG , end_CELL end_ROW

where 𝒮𝒮\mathcal{S}caligraphic_S denotes node 𝒮𝒮\mathcal{S}caligraphic_Stretching. Applying stretching on the initial encoding tree iteratively results in a binary encoding tree 𝒯bsubscript𝒯𝑏\mathcal{T}_{b}caligraphic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. For a node α𝒯𝛼𝒯\alpha\in\mathcal{T}italic_α ∈ caligraphic_T contains a set of children {β1,,βm}subscript𝛽1subscript𝛽𝑚\{\beta_{1},...,\beta_{m}\}{ italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } and its parent is γ𝛾\gammaitalic_γ, node compressing is defined as: (1) remove node α𝛼\alphaitalic_α, (2) for each child node βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of α𝛼\alphaitalic_α, set βi=γsuperscriptsubscript𝛽𝑖𝛾\beta_{i}^{-}=\gammaitalic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_γ. The decrease amount of 𝒯(G,G)superscript𝒯𝐺superscript𝐺\mathcal{L}^{\mathcal{T}}(G,G^{\prime})caligraphic_L start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is given by:

(3.13) Δα𝒞=igβi+|T(βi)|>1gβigαgα𝒱Glog2𝒱α𝒱γ,Δsubscriptsuperscript𝒞𝛼subscript𝑖subscript𝑔subscript𝛽𝑖subscript𝑇subscript𝛽𝑖1subscriptsuperscript𝑔subscript𝛽𝑖subscript𝑔𝛼subscriptsuperscript𝑔𝛼subscript𝒱𝐺subscript2subscript𝒱𝛼subscript𝒱𝛾\begin{split}\Delta\mathcal{L}^{\mathcal{C}}_{\alpha}=\frac{\sum\limits_{i}g_{% \beta_{i}}+\sum\limits_{|T(\beta_{i})|>1}g^{\prime}_{\beta_{i}}-g_{\alpha}-g^{% \prime}_{\alpha}}{\mathcal{V}_{G}}\log_{2}\frac{\mathcal{V}_{\alpha}}{\mathcal% {V}_{\gamma}},\end{split}start_ROW start_CELL roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT | italic_T ( italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | > 1 end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_V start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_ARG , end_CELL end_ROW

where 𝒞𝒞\mathcal{C}caligraphic_C denotes node 𝒞𝒞\mathcal{C}caligraphic_Compressing. Applying compressing on the binary encoding tree results in a multinary encoding tree. By restricting the height of the encoding tree to be less than the required height K𝐾Kitalic_K, we can obtain the K𝐾Kitalic_K-d encoding tree. We summarize this optimization procedure in Algorithm 2. For a graph G𝐺Gitalic_G with n𝑛nitalic_n vertices and m𝑚mitalic_m edges, the time complexity of Algorithm 2 is O(hmax(mlogn+n))𝑂subscript𝑚𝑎𝑥𝑚𝑙𝑜𝑔𝑛𝑛O(h_{max}(mlogn+n))italic_O ( italic_h start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ( italic_m italic_l italic_o italic_g italic_n + italic_n ) ), where hmaxsubscript𝑚𝑎𝑥h_{max}italic_h start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the height of 𝒯bsubscript𝒯𝑏\mathcal{T}_{b}caligraphic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT.

Algorithm 2 High-d SSE minimization
0:  G=(V,E,𝐖)𝐺𝑉𝐸𝐖G=(V,E,\mathbf{W})italic_G = ( italic_V , italic_E , bold_W ), G=(V,E,𝐖)superscript𝐺𝑉superscript𝐸superscript𝐖G^{\prime}=(V,E^{\prime},\mathbf{W}^{\prime})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_V , italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), height K𝐾Kitalic_K
0:  Binary tree 𝒯bsubscript𝒯𝑏\mathcal{T}_{b}caligraphic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and height K𝐾Kitalic_K tree 𝒯Ksubscript𝒯𝐾\mathcal{T}_{K}caligraphic_T start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
1:  Initialize 𝒯𝒯\mathcal{T}caligraphic_T with a root node λ𝜆\lambdaitalic_λ and all vertices as tree leaves
2:  // Stretching stage
3:  repeat
4:     Stretch a chosen node pair {α,β}𝛼𝛽\{\alpha,\beta\}{ italic_α , italic_β } condition on argmaxα,β{Δα,β𝒮}subscript𝛼𝛽Δsubscriptsuperscript𝒮𝛼𝛽\arg\max_{\alpha,\beta}\{\Delta\mathcal{L}^{\mathcal{S}}_{\alpha,\beta}\}roman_arg roman_max start_POSTSUBSCRIPT italic_α , italic_β end_POSTSUBSCRIPT { roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α , italic_β end_POSTSUBSCRIPT } via Eq. (3.12)
5:     Update Δ𝒮Δsuperscript𝒮\Delta\mathcal{L^{\mathcal{S}}}roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT for node pairs connected to α𝛼\alphaitalic_α or β𝛽\betaitalic_β
6:  until The children number of λ𝜆\lambdaitalic_λ is two, resulting in binary tree 𝒯bsubscript𝒯𝑏\mathcal{T}_{b}caligraphic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT
7:  // Compressing stage
8:  repeat
9:     Remove a chosen tree node α𝒯𝛼𝒯\alpha\in\mathcal{T}italic_α ∈ caligraphic_T condition on argmaxα{Δα𝒞}subscript𝛼Δsubscriptsuperscript𝒞𝛼\arg\max_{\alpha}\{\Delta\mathcal{L}^{\mathcal{C}}_{\alpha}\}roman_arg roman_max start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT { roman_Δ caligraphic_L start_POSTSUPERSCRIPT caligraphic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } via Eq. (3.13)
10:  until Height of encoding tree 𝒯𝒯\mathcal{T}caligraphic_T is not larger than K𝐾Kitalic_K, resulting in 𝒯Ksubscript𝒯𝐾\mathcal{T}_{K}caligraphic_T start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT

4 Experiments

Our proposed SSE method is capable of tackling both semi-supervised partitioning clustering and hierarchical clustering. Regarding the evaluation for both tasks, we design two groups of experiments, in which we compare SSE against established baselines for semi-supervised partitioning clustering (Section 4.1) and semi-supervised hierarchical clustering (Section 4.2).

4.1 Semi-Supervised Partitioning Clustering.

Table 1: Performance of comparison methods for partitioning clustering on five clustering datasets. Bold: the best performance on each group of methods.
Method% Yale ORL COIL20 Isolet OpticalDigits
ARI\uparrow NMI\uparrow ARI\uparrow NMI\uparrow ARI\uparrow NMI\uparrow ARI\uparrow NMI\uparrow ARI\uparrow NMI\uparrow
SE 28.12 54.78 59.15 85.31 68.22 86.34 56.01 83.14 69.89 79.69
Pairwise PCPSNMF 24.80 54.11 52.02 81.86 51.49 80.75 38.39 69.15 48.82 68.71
OneStepPCP 25.50 52.22 40.58 78.14 52.81 79.70 49.93 76.01 77.90 86.02
CMS 07.06 35.54 29.37 73.18 59.81 78.32 48.77 77.38 88.75 91.17
SC-MPI 32.76 59.82 49.28 82.29 59.89 82.83 47.16 72.75 52.39 71.35
SSE (Ours) 37.12 61.37 65.42 87.51 75.36 87.50 61.37 82.77 77.57 84.34
Label Seeded-KMeans 25.21 52.06 46.35 78.56 67.59 81.40 66.48 81.13 73.52 77.89
S4NMF 23.85 49.60 47.23 77.08 62.48 79.34 57.05 77.32 84.72 88.32
LpCNMF 13.35 39.67 32.55 70.01 74.72 88.78 59.24 81.89 90.77 93.80
SC-MPI 20.91 50.82 26.28 70.73 89.18 94.21 64.43 80.38 93.04 93.41
SSE (Ours) 33.48 58.62 61.26 86.01 75.10 87.63 58.62 83.13 76.60 84.12
Table 2: Performance of comparison methods for partitioning clustering on four RNA-seq datasets. Bold: the best performance on each group of methods.
Method% 10X PBMC Mouse bladder Worm neuron Human kidney
ARI\uparrow NMI\uparrow ARI\uparrow NMI\uparrow ARI\uparrow NMI\uparrow ARI\uparrow NMI\uparrow
SE 63.89 74.80 67.41 77.13 20.90 41.70 54.20 72.86
Pairwise PCPSNMF 16.41 29.68 13.55 40.02 09.45 21.48 13.46 28.66
OneStepPCP 43.08 58.29 44.51 64.86 19.39 44.42 40.21 55.26
CMS 08.58 10.40 08.28 10.56 00.27 01.40 07.18 12.26
SC-MPI 20.24 30.57 18.70 40.88 08.98 17.03 18.12 31.78
SSE (Ours) 74.87 76.84 62.33 74.70 22.17 45.05 62.59 75.81
Label Seeded-KMeans 67.56 71.07 38.40 62.63 07.07 34.24 17.19 39.74
S4NMF 18.28 28.20 26.27 43.79 08.90 15.54 24.33 39.07
LpCNMF 44.40 62.94 44.64 73.02 34.67 56.37 45.90 64.88
SC-MPI 48.28 63.34 38.65 55.23 37.82 46.58 56.93 60.36
SSE (Ours) 74.15 77.05 63.11 75.38 28.43 46.94 65.45 78.23
Table 3: Performance of comparison methods for semi-supervised hierarchical clustering. Bold: the best performance on each group of methods.
Method% Wine Heart Br. Cancer Australian
DP\uparrow ARI\uparrow NMI\uparrow DP\uparrow ARI\uparrow NMI\uparrow DP\uparrow ARI\uparrow NMI\uparrow DP\uparrow ARI\uparrow NMI\uparrow
SE 84.87 73.85 74.19 61.97 07.70 21.30 95.75 88.00 80.93 57.02 01.95 08.03
SpecWRSC 84.87 76.94 77.10 70.13 33.13 29.23 95.68 88.55 81.57 54.92 -00.78 01.21
COBRA 86.50 81.26 80.07 64.12 26.07 20.92 92.23 82.38 72.41 66.55 32.03 24.61
SemiMulticons 90.52 82.99 82.69 69.29 28.44 26.91 92.68 82.77 73.79 72.13 39.47 33.83
SSE (Ours) 92.88 85.27 83.61 71.90 28.08 26.36 96.53 82.88 76.08 74.52 34.19 28.17

In this part, we aim to evaluate the performance of SSE on semi-supervised partitioning clustering. We conduct experiments on five clustering datasets including face image data (Yale and ORL), object image data (COIL20), spoken letter recognition data (Isolet), and handwritten digit data (OpticalDigits) following Bai et al. [1], whose size ranges from 165 to 5620. We also conduct experiments on four single-cell RAN-seq datasets including 10X PBMC, Mouse bladder, Worm neuron, and Human kidney taken from Tian et al. [22]. We choose the data preprocessed by the original authors to contain 2100 randomly sampled cells on the top 2000 highly dispersed genes in each dataset. We adopt two metrics including Adjusted Rand Index (ARI) [10] and Normalized Mutual Information (NMI) [21] for partitioning clustering performance evaluation. All experiments are repeated 10 times.

Baselines. We compare SSE with a variety of baseline methods, including an unsupervised clustering method based on structural entropy minimization, three semi-supervised clustering methods with pairwise constraints, three semi-supervised clustering methods with label constraints, and one semi-supervised clustering method with both pairwise constraints and label constraints. The unsupervised clustering based on structural entropy minimization is optimized by the merging operator (SE [15]). For semi-supervised clustering methods with pairwise constraints, we consider pairwise constraint propagation-induced symmetric NMF (PCPSNMF [25]), jointly optimized pairwise constraint propagation and spectral clustering (OneStepPCP [11]), and constrained mean shift clustering (CMS [20]). For semi-supervised clustering methods with label constraints, we consider seeded semi-supervised KMeans (Seeded-KMeans [2]), self-supervised semi-supervised NMF (S4NMF [4]), and label propagation based constrained NMF (LpCNMF [17]). SC-MPI [1] is a semi-supervised spectral clustering method capable of dealing with different types of constraints. Since SSE and SC-MPI are capable of dealing with both pairwise constraints and label constraints, they are compared in both groups.

Implementation Details. We construct graph G𝐺Gitalic_G from the given dataset 𝒳𝒳\mathcal{X}caligraphic_X by calculating the similarity between data points and then sparsify it into a p𝑝pitalic_p-nearest-neighbor graph by retaining p𝑝pitalic_p most significant edges from each node. For a given 𝒳𝒳\mathcal{X}caligraphic_X with n𝑛nitalic_n data points divided into k𝑘kitalic_k clusters according to the ground truth, we empirically set p𝑝pitalic_p to be 20k/log22n+120𝑘subscriptsuperscript22𝑛1\lfloor 20k/\log^{2}_{2}n\rfloor+1⌊ 20 italic_k / roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌋ + 1, since the number of clusters by minimizing 𝒫superscript𝒫\mathcal{H}^{\mathcal{P}}caligraphic_H start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT is approximately Θ(plog22n)Θ𝑝subscriptsuperscript22𝑛\Theta(p\log^{2}_{2}n)roman_Θ ( italic_p roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ) [15]. For five clustering datasets, the similarity is defined by a Gaussian kernel with kernel width σ=10𝜎10\sigma=10italic_σ = 10. For four single-cell RNA-seq datasets, the similarity is defined as cosine similarity since the features of these datasets are sparse.

We generate constraints using the ground truth class labels from the datasets. For experiments with pairwise constraints, we set the number of must-link constraints the same as cannot-link constraints to be 0.2n0.2𝑛0.2n0.2 italic_n. For experiments with label constraints, we set the number of positive constraints the same as negative constraints to be 0.1n0.1𝑛0.1n0.1 italic_n. The parameters γMsubscript𝛾𝑀\gamma_{M}italic_γ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and γCsubscript𝛾𝐶\gamma_{C}italic_γ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT control the role of constraints, we define them following Bai et al. [1]. For a pair of data points (xi,xj)subscript𝑥𝑖subscript𝑥𝑗(x_{i},x_{j})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) with similarity 𝐖ijsubscript𝐖𝑖𝑗\mathbf{W}_{ij}bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, we define γM=max(𝐖)𝐖ijsubscript𝛾𝑀𝑚𝑎𝑥𝐖subscript𝐖𝑖𝑗\gamma_{M}=max(\mathbf{W})-\mathbf{W}_{ij}italic_γ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = italic_m italic_a italic_x ( bold_W ) - bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, where max(𝐖)𝑚𝑎𝑥𝐖max(\mathbf{W})italic_m italic_a italic_x ( bold_W ) is the maximum similarity between all data points. The process of constraints conversion, constraints transitivity and entailment usually lead to more negative values than positive values in Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In order to balance them, we define γC=ρ(min(𝐖)𝐖ij)subscript𝛾𝐶𝜌𝑚𝑖𝑛𝐖subscript𝐖𝑖𝑗\gamma_{C}=\rho(min(\mathbf{W})-\mathbf{W}_{ij})italic_γ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = italic_ρ ( italic_m italic_i italic_n ( bold_W ) - bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), where ρ𝜌\rhoitalic_ρ is the ratio between the number of positive values and negative values in Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, min(𝐖)𝑚𝑖𝑛𝐖min(\mathbf{W})italic_m italic_i italic_n ( bold_W ) is the minimum similarity between all data points. The parameter ϕitalic-ϕ\phiitalic_ϕ balances the importance between input data and constraints, it is empirically set as ϕ=2italic-ϕ2\phi=2italic_ϕ = 2.

Experimental Results. The experimental results on five clustering datasets are presented in Table 1. Three groups of methods, i.e., unsupervised clustering, semi-supervised clustering with pairwise constraints, and semi-supervised clustering with label constraints, are compared separately. SSE with pairwise constraints outperforms its unsupervised baseline SE on all datasets and outperforms baseline methods in the pairwise constraint group on four out of five datasets. SSE with label constraints outperforms SE on all datasets and outperforms baseline methods in the label constraint group on three out of five datasets.

The experimental results on four single-cell RNA-seq datasets are presented in Table 2. SSE with pairwise constraints outperforms SE on three out of four datasets and outperforms baseline methods in the pairwise constraint group on all datasets. SSE with label constraints outperforms SE on three out of four datasets and outperforms baseline methods in the label constraint group on three out of four datasets. In all, SSE effectively incorporates prior information in the forms of pairwise constraints and label constraints, and achieves high clustering accuracy on both clustering datasets and single-cell RNA-seq datasets.

4.2 Semi-Supervised Hierarchical Clustering.

In this part, we aim to evaluate the performance of SSE on semi-supervised hierarchical clustering. We conduct experiments on four datasets downloaded from the LIBSVM webpage 222https://www.csie.ntu.edu.tw/similar-to\simcjlin/libsvmtools/datasets/ following Chierchia and Perret [6], whose size ranges from 175 to 690. We adopt three metrics including Dendrogram Purity (DP) [9, 27], ARI [10], and NMI [21] for hierarchical clustering performance evaluation. DP is a holistic measure of a cluster tree, which is defined as the weighted average purity of each node of the tree with respect to a ground truth labelization of the tree leaves. We take the cluster tree of SSE from the binary encoding tree 𝒯bsubscript𝒯𝑏\mathcal{T}_{b}caligraphic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. ARI and NMI require partitioning clustering results from the cluster trees. We perform the compressing𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑛𝑔compressingitalic_c italic_o italic_m italic_p italic_r italic_e italic_s italic_s italic_i italic_n italic_g operator until the height of the encoding tree is two to obtain the partitioning clustering results. For other methods, we choose the largest tree nodes from the cluster tree as the partitioning clustering results. All experiments are repeated 10 times.

Refer to caption
Figure 2: Performance of SSE for semi-supervised partitioning clustering with different constraint amounts.
Refer to caption
Figure 3: Performance of SSE for semi-supervised hierarchical clustering with different constraint amounts.

Baselines. We compare SSE with two unsupervised hierarchical clustering methods and two semi-supervised hierarchical clustering methods. For unsupervised hierarchical clustering methods, we consider structural entropy minimized by stretching operator and compressing operator (SE [19]) and sublinear time graph-based hierarchical clustering (SpecWRSC [13]). For semi-supervised hierarchical clustering methods, we consider merging-based active clustering (COBRA [23]) and closed pattern mining based semi-supervised consensus clustering (SemiMulticons [28]).

Implementation Details. We construct graph G𝐺Gitalic_G from the given dataset 𝒳𝒳\mathcal{X}caligraphic_X by calculating cosine similarity between data points and then sparsify it into a 5-nearest-neighbor graph by retaining 5 significant edges from each node. We generate 0.2n0.2𝑛0.2n0.2 italic_n must-link constraints and 0.2n0.2𝑛0.2n0.2 italic_n cannot-link constraints randomly for all methods except for COBRA, for which we generate 0.2n0.2𝑛0.2n0.2 italic_n positive constraints due to its requirements. We set γM=max(𝐖)𝐖ijsubscript𝛾𝑀𝑚𝑎𝑥𝐖subscript𝐖𝑖𝑗\gamma_{M}=max(\mathbf{W})\mathbf{W}_{ij}italic_γ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = italic_m italic_a italic_x ( bold_W ) bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and γC=ρ(min(𝐖)𝐖ij)subscript𝛾𝐶𝜌𝑚𝑖𝑛𝐖subscript𝐖𝑖𝑗\gamma_{C}=\rho(min(\mathbf{W})\mathbf{W}_{ij})italic_γ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = italic_ρ ( italic_m italic_i italic_n ( bold_W ) bold_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ). The parameter ϕitalic-ϕ\phiitalic_ϕ is set to be 2.

Experimental Results. The experimental results of semi-supervised hierarchical clustering are presented in Table 3. SSE achieves the highest DP values and outperforms SE in terms of DP on all datasets, indicating that the cluster trees of SSE have the highest holistic quality. The ARI and NMI values of SSE are comparable with baselines and higher than SE on three out of four datasets. In all, SSE achieves high clustering accuracy on semi-supervised hierarchical clustering.

4.3 Sensitivity Analysis.

The number of constraints has a great impact on the performance of semi-supervised clustering and a larger number of constraints is usually thought to lead to better performance. We evaluate the performance of SSE for partitioning clustering with different amounts of pairwise constraint, as shown in Figure 2. The ARI and NMI values are generally larger with more constraints except for OpticalDigits. For this dataset, SSE makes too many clusters when the constraints are more than 0.6n0.6𝑛0.6n0.6 italic_n, leading to poor performance. The cause of this problem is that the merging stage in Algorithm 1 stops earlier than expected, which calls for a better optimization algorithm. We also evaluate the performance of SSE for hierarchical clustering with different amounts of pairwise constraint, as shown in Figure 3. The growth of DP values can be barely seen, since the DP values of all amounts of constraint are very high. The ARI and NMI values grow a lot with more constraints provided. In all, SSE performs better with more constraints under most circumstances.

5 Related Work

Semi-supervised clustering methods incorporate prior information into the process of clustering to enhance clustering quality and better align user preferences, and have attracted great interest in recent years. Prior information can take different forms of constraints, among them pairwise constraints and label constraints are mostly used. Pairwise constraints indicate whether a pair of data points should be in the same cluster or not [18]. Many methods that incorporate pairwise constraints have been proposed, such as semi-supervised spectral clustering [1], semi-supervised NMF clustering [25], and semi-supervised density peak clustering [20]. Label constraints reveal class labels of some data points, specifying whether they belong to certain classes or not. These constraints can be used through label propagation [17] or penalizing violated data points[4].

6 Conclusion

In this paper, we propose SSE, a novel and more general semi-supervised clustering method that can integrate different types of constraints. We give a uniform formulation of pairwise constraints and label constraints and make them both compatible with SSE. Moreover, SSE can perform both semi-supervised partitioning clustering and hierarchical clustering, thanks to the structural entropy measure that it is based on. We conduct extensive experiments on nine clustering datasets and compare SSE with eleven baselines, justifying the superiority of SSE on high clustering accuracy. We also apply SSE to four single-cell RNA-seq datasets for cell clustering, demonstrating its functionality in biological data analysis. Future work on SSE may focus on better optimization algorithms.

Acknowledgments

The corresponding author is Hao Peng. This work is supported by National Key R&D Program of China through grant 2021YFB1714800, NSFC through grants 61932002 and 62322202, Bei**g Natural Science Foundation through grant 4222030, CCF-DiDi GAIA Collaborative Research Funds for Young Scholars, and the Fundamental Research Funds for the Central Universities.

References

  • [1] L. Bai, J. Liang, and F. Cao, Semi-supervised clustering with constraints of different types from multiple information sources, IEEE TPAMI, 43 (2020), pp. 3247–3258.
  • [2] S. Basu, A. Banerjee, and R. J. Mooney, Semi-supervised clustering by seeding, in ICML, 2002, pp. 19–26.
  • [3] Y. Cao, H. Peng, Z. Yu, and Y. Philip S., Hierarchical and incremental structural entropy minimization for unsupervised social event detection, in AAAI, 2024, pp. 1–10.
  • [4] J. Chavoshinejad, S. A. Seyedi, F. A. Tab, and N. Salahian, Self-supervised semi-supervised nonnegative matrix factorization for data clustering, Pattern Recognition, 137 (2023), p. 109282.
  • [5] L. Chen and S. Li, Incorporating cell hierarchy to decipher the functional diversity of single cells, Nucleic Acids Research, 51 (2022), pp. e9–e9.
  • [6] G. Chierchia and B. Perret, Ultrametric fitting by gradient descent, in NeurIPS, vol. 32, 2019.
  • [7] I. Davidson and S. Ravi, Agglomerative hierarchical clustering with constraints: Theoretical and empirical results, in ECML PKDD, Springer, 2005, pp. 59–70.
  • [8] G. Gan, C. Ma, and J. Wu, Data clustering: theory, algorithms, and applications, SIAM, 2020.
  • [9] K. A. Heller and Z. Ghahramani, Bayesian hierarchical clustering, in ICML, 2005, pp. 297–304.
  • [10] L. Hubert and P. Arabie, Comparing partitions, Journal of classification, 2 (1985), pp. 193–218.
  • [11] Y. Jia, W. Wu, R. Wang, J. Hou, and S. Kwong, Joint optimization for pairwise constraint propagation, IEEE TNNLS, 32 (2020), pp. 3168–3180.
  • [12] Z. Jiang, Y. Zhan, Q. Mao, and Y. Du, Semi-supervised clustering under a “compact-cluster” assumption, IEEE TKDE, 35 (2022), pp. 5244–5256.
  • [13] M. Kapralov, A. Kumar, S. Lattanzi, and A. Mousavifar, Learning hierarchical cluster structure of graphs in sublinear time, in SODA, SIAM, 2023, pp. 925–939.
  • [14] L. Lan, T. Liu, X. Zhang, C. Xu, and Z. Luo, Label propagated nonnegative matrix factorization for clustering, IEEE TKDE, 34 (2020), pp. 340–351.
  • [15] A. Li and Y. Pan, Structural information and dynamical complexity of networks, IEEE Transactions on Information Theory, 62 (2016), pp. 3290–3339.
  • [16] J. Lipor and L. Balzano, Leveraging union of subspace structure to improve constrained clustering, in ICML, PMLR, 2017, pp. 2130–2139.
  • [17] J. Liu, Y. Wang, J. Ma, D. Han, and Y. Huang, Constrained nonnegative matrix factorization based on label propagation for data representation, IEEE TAI, (2023).
  • [18] F. Nie, H. Zhang, R. Wang, and X. Li, Semi-supervised clustering via pairwise constrained optimal graph, in IJCAI, 2021, pp. 3160–3166.
  • [19] Y. Pan, F. Zheng, and B. Fan, An information-theoretic perspective of hierarchical clustering, arXiv preprint arXiv:2108.06036, (2021).
  • [20] M. Schier, C. Reinders, and B. Rosenhahn, Constrained mean shift clustering, in SDM, SIAM, 2022, pp. 235–243.
  • [21] A. Strehl and J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, JMLR, 3 (2002), pp. 583–617.
  • [22] T. Tian, J. Zhang, X. Lin, Z. Wei, and H. Hakonarson, Model-based deep embedding for constrained clustering analysis of single cell rna-seq data, Nature communications, 12 (2021), p. 1873.
  • [23] T. Van Craenendonck, S. Dumančic, and H. Blockeel, Cobra: a fast and simple method for active clustering with pairwise constraints, in IJCAI, 2017, pp. 2871–2877.
  • [24] K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al., Constrained k-means clustering with background knowledge, in ICML, vol. 1, 2001, pp. 577–584.
  • [25] W. Wu, Y. Jia, S. Kwong, and J. Hou, Pairwise constraint propagation-induced symmetric nonnegative matrix factorization, IEEE TNNLS, 29 (2018), pp. 6348–6361.
  • [26] W. Xiao, Y. Yang, H. Wang, T. Li, and H. Xing, Semi-supervised hierarchical clustering ensemble and its application, Neurocomputing, 173 (2016), pp. 1362–1376.
  • [27] N. Yadav, A. Kobren, N. Monath, and A. Mccallum, Supervised hierarchical clustering with exponential linkage, in ICML, PMLR, 2019, pp. 6973–6983.
  • [28] T. Yang, N. Pasquier, and F. Precioso, Semi-supervised consensus clustering based on closed patterns, Knowledge-Based Systems, 235 (2022), p. 107599.
  • [29] X. Zeng, H. Peng, and A. Li, Effective and stable role-based multi-agent collaboration by structural information principles, in AAAI, 2023.
  • [30] X. Zeng, H. Peng, and A. Li, Adversarial socialbots modeling based on structural information principles, in AAAI, 2024, pp. 1–10.
  • [31] L. Zheng and T. Li, Semi-supervised hierarchical clustering, in IEEE ICDM, 2011, pp. 982–991.
  • [32] D. Zou, S. Wang, X. Li, H. Peng, Y. Wang, C. Liu, K. Sheng, and B. Zhang, Multispans: A multi-range spatial-temporal transformer network for traffic forecast via structural entropy optimization, in ACM WSDM, 2024, pp. 1–10.