CDC: A Simple Framework for Complex Data Clustering^†^†thanks: Z. Kang, X. Xie, B. Li, E. Pan are with the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China (e-mail: [email protected]; [email protected]; bingheng86, [email protected]).

Zhao Kang, Xuanting Xie, Bingheng Li and Erlin Pan

Abstract

In today’s data-driven digital era, the amount as well as complexity, such as multi-view, non-Euclidean, and multi-relational, of the collected data are growing exponentially or even faster. Clustering, which unsupervisely extracts valid knowledge from data, is extremely useful in practice. However, existing methods are independently developed to handle one particular challenge at the expense of the others. In this work, we propose a simple but effective framework for complex data clustering (CDC) that can efficiently process different types of data with linear complexity. We first utilize graph filtering to fuse geometry structure and attribute information. We then reduce the complexity with high-quality anchors that are adaptively learned via a novel similarity-preserving regularizer. We illustrate the cluster-ability of our proposed method theoretically and experimentally. In particular, we deploy CDC to graph data of size 111M.

Index Terms:

Anchor graph, clustering, large-scale data, topology structure, multiview learning

I Introduction

Clustering is a fundamental technique for unsupervised learning that groups data points into different clusters without labels. It is driven by diverse applications in scientific research and industrial development, which induce complex data types [1], such as multi-view, non-Euclidean, and multi-relational. Specifically, in many real-world applications, data are often gathered from multiple sources or different extractors and therefore exhibit different features, dubbed as multi-view data [2]. Despite the fact that each view may be noisy and incomplete, important factors, such as geometry and semantics, tend to be shared across all views. Different views also provide complementary information, so it is paramount for multi-view clustering (MVC) methods to integrate diverse features. For example, [3] learns a consensus graph with a rank constraint on the corresponding Laplacian matrix from multiple views for clustering. [4] employs intra-view collaborative learning to harvest complementary and consistent information among different views.

Along with the development of sophisticated data collection and storage techniques, the size of data increases explosively. To handle large-scale data efficiently, several MVC methods with linear complexity have been proposed. [5] learns an anchor graph for each view and concatenates them for multi-view subspace clustering. [6], [7], and [8] construct bipartite graphs or learn representations based on anchors. [9] effectively integrates the collaborative information from multiple views via learning discrete representations and binary cluster structures jointly. Despite these progresses, they often produce unstable performance towards different datasets because of the randomness in anchor selection.

Recently, non-Euclidean graph data become pervasive since they contain not only node attributes but also topological structure information, which characterizes relations among data points [10]. Social network users, for example, have their own profiles and social relationships reflected by the topological graph. Traditional clustering methods exploit either attribute or graph structure and could not achieve the best performance [11]. Graph Neural Network (GNN) is a powerful tool to simultaneously explore node attribute and structure [12]. Based on it, several graph clustering methods have been designed [13, 14, 15]. In some applications, the graph could exhibit multi-view attributes or multi-relational. To cluster multi-view graphs, [16] learns a representation for each view and forces them to be close. To handle multi-relational graphs, [17] finds the most informative graph to recover multiple graphs.

Though the remarkable success of GNN-based methods in graph clustering, there is still one crucial question, i.e., scalability, which prevents the deployment of them to web-scale graph data. For example, ogbn-papers100M contains more than 100M nodes, which could not be processed by most graph clustering methods. Although [18, 19] have made advances in scalable graph clustering through applying a light-weight encoder and contrastive learning, their performance highly depends on graph augmentation. Therefore, scalability for graph clustering is still under-explored, and more dedicated efforts are pressingly needed.

We can see that some specialized methods have been developed to address one of the above problems and there lacks a unified model for complex data clustering that generalizes well, while still being scalable. To fill this gap, we propose a simple yet effective framework for Complex Data Clustering (CDC). We first use graph filtering to fuse raw features and topology information, which produces cluster-ability representations and provides a flexible way to handle different types of data. Rather than construct a complete graph from all data points, CDC learns anchor graphs, resulting in linear computation complexity. In particular, we generate anchors adaptively with a similarity-preserving regularizer to alleviate the randomness from anchor selection. To summarize, we make the following contributions:

•

We propose a simple clustering framework for complex data, e.g., single-view and multi-view, graph and non-graph, small-scale and large-scale data. Our method has linear time and space complexity.
•

(Section III) We are the first to propose a similarity-preserving regularizer to automatically learn high-quality anchors from data.
•

(Section •

(Section V) CDC achieves impressive performance on 14 complex datasets. Most notably, it scales beyond the graph with more than 111M nodes.

II Related Work

II-A Multi-view Clustering

MVC methods generally focus on enhancing performance by utilizing the global consensus and complementary information among multiple views. [20] mines view-shared information via adding a sample-level contrastive module to align angles between representations. [21] uses Hilbert Schmidt Independence Criterion (HSIC) to explore underlying cluster structure shared by multiple views. [22] generates an automatic partitioning with data of multiple views via a multi-objective clustering framework. [23] achieves cross-view consensus by projecting data points into a space where geometric and cluster assignments are consistent.
Different from shallow methods, deep MVC methods learn good representations via designed neural networks. [24] applies attention encoder and multi-view mutual information maximization to capture the complementary information, consistency information, and internal relations of each view. Recently, some methods have combined the contrastive learning mechanisms to obtain clustering-favorable representations. For example, [25] performs instance-level and category-level contrastive learning to improve cross-view consistency. However, these methods are not scalable to large-scale data. To reduce the complexity, [5, 26] construct bipartite graphs between cluster centroids of $K$ -means and raw data points, where the anchors are chosen randomly and fixed for subsequent learning. [27] leverages features, anchors, and neighbors jointly to construct bipartite graphs. [28] captures the view-specific and consistent information by constructing a consensus graph from view-independent anchors. Although they all have linear complexity, their performance could be sub-optimal since the pre-defined anchors are not updated according to the downstream task. Differently, we generate high-quality anchors adaptively, which is efficient and stable on complex data.

II-B Graph Clustering

Graph clustering methods aim to group nodes based on node attributes and topological structure information. Some representation learning methods, such as Node2vec [29] and GAE [30], can be used to learn embeddings for traditional clustering techniques. However, the obtained embeddings might not be appropriate for clustering because they are not specific to learn representations of cluster-ability. MVGRL [31], BGRL [32], and GRACE [33] obtain classification-favorable representation via contrastive graph learning, but are not applicable on large-scale graph due to their computation cost of data augmentation. Although MCGC [11] is augmentation-free by regarding k-nearest neighbors as positive pairs, the algorithm has a square complexity. Other deep graph clustering methods, like SDCN [34], DFCN [35], and DCRN [36], achieve promising performance via training MLP and GNNs jointly on small-scale/medium-scale graph. MvAGC [37] has low complexity, but is not efficient owing to its anchor sampling strategy. Hence, these graph clustering methods cannot effectively and efficiently handle large-scale graph. Though S ${}^{3}$ GC [18] obtains promising results on extra-large-scale graph via random walk-based sampler and light-weight encoder, there is expensive computation cost of training. Our method can handle graph clustering in linear time with promising performance.

III Methodology

Notation

Define the generic data as $\mathcal{G}=\{\mathcal{V},E_{1},...,E_{V_{1}},X^{1},...,X^{V_{2}}\}$ , where $\mathcal{V}$ represents the set of $N$ nodes, $e_{ij}\in E_{v}$ denotes the relationship between node $i$ and node $j$ in the $v$ -th view. $V_{1}\geq 0$ and $V_{2}>0$ are the number of relational graphs and attributes, and the data is non-graph when initial $V_{1}=0$ . $X^{v}=\{x_{1}^{v},...,x_{N}^{v}\}^{\top}\in\mathbb{R}^{N\times d_{v}}$ is the feature matrix, $d_{v}$ is the dimension of features. Adjacency matrices $\{\widetilde{A}^{v}\}^{V_{1}}_{v=1}\in\mathbb{R}^{N\times N}$ characterize the initial graph structure. For non-graph data, we construct adjacency matrices in each view via the 5-nearest neighbor method. There are $V$ views after graph filtering for each dataset, where $V=V_{1}\times V_{2}$ for graph data and $V=V_{1}=V_{2}$ for non-graph data. $\{D^{v}\}^{V_{1}}_{v=1}$ represent the degree matrices in various views. The normalized adjacency matrix is $A^{v}=(D^{v})^{-\frac{1}{2}}(\widetilde{A}^{v}+I)(D^{v})^{-\frac{1}{2}}$ and the corresponding graph Laplacian is $L^{v}=I-A^{v}$ .

III-A Graph Filtering

Filtered features are more clustering-favorable [38], and we apply graph filtering to remove undesirable high-frequency noise while preserving the graph’s geometric features. Similar to [11], smoothed $H$ is obtained by solving the following optimization problem:

\min_{H}\|H-X\|_{F}^{2}+\frac{1}{2}\operatorname{Tr}\left({\mathrm{H}}^{\top}% \mathrm{LH}\right).

(1)

We keep the first-order Taylor series of $H$ from Eq. (1) and apply $k$ -order filtering, which yields:

H=(I-\frac{1}{2}L)^{k}X,

(2)

where $k$ is a non-negative integer and it controls the depth of feature aggregation and smoothness of representation. In addition to learning smooth features, graph filtering is also used to unify different types of data into our framework.

III-B Anchor Graph Learning

We use the idea of data self-expression to capture the relations among data points, i.e., each sample can be presented as a linear combination of other data points. The combination coefficient matrix can be regarded as a reconstructed graph [11]. To reduce the computation complexity, $m$ representative samples $B\in\mathbb{R}^{m\times d}$ called anchors are selected to construct anchor graph $Z\in\mathbb{R}^{m\times N}$ [39]. However, the performance of this approach is unstable since it introduces anchors in a probabilistic way. Moreover, once the anchors are chosen, they won’t be updated, which could lead to sub-optimal performance. To get rid of uncertainty in anchor selection, we propose to learn anchors from data, i.e., anchors $B$ are generated adaptively. To guarantee the quality of anchors, we enforce that the similarity between $B$ and $H$ is preserved, i.e., $BH^{\top}=Z$ . Then we formalize the graph learning problem as:

\min_{Z,B}\left\|H^{\top}-B^{\top}Z\right\|_{F}^{2}+\beta\left\|Z\right\|_{F}^% {2}\text{, s.t. }BH^{\top}=Z,

(3)

where $\beta$ is a balance parameter. To make it easy to solve, we relax the above problem to:

\min_{Z,B}\left\|H^{\top}-B^{\top}Z\right\|_{F}^{2}+\beta\left\|Z\right\|_{F}^% {2}+\alpha\|BH^{\top}-Z\|^{2}_{F}.

(4)

It has two advantages over other anchor-based methods: efficient and adaptive generation of high-quality anchors. First, existing methods often repeat many times to reduce the uncertainty in results, which is time-consuming and is not suitable for large-scale data. Second, existing methods perform anchor selection and graph learning in two separate steps. By contrast, we follow a joint learning approach, where anchors and anchor graphs will be mutually boosted by each other.
For a multi-view scenario, each view could contribute differently. Therefore, we introduce learnable weights $\{\lambda_{v}\}_{v=1}^{V}$ and achieve a consensus anchor graph $Z$ by solving the following model:

$\displaystyle\min_{Z,\{B^{v}\}_{v=1}^{V},\{\lambda_{v}\}_{v=1}^{V}}$	$\displaystyle\sum_{v=1}^{V}\lambda_{v}^{2}(\left\\|H^{v}{}^{\top}-B^{v}{}^{\top% }Z\right\\|_{F}^{2}$	(5)
	$\displaystyle+\alpha\\|B^{v}H^{v}{}^{\top}-Z\\|^{2}_{F})+\beta\left\\|Z\right\\|_{% F}^{2},$
	$\displaystyle\text{s.t.}\sum_{v=1}^{V}\lambda_{v}=1,\lambda_{v}>0.$

Note that we learn anchors for each view to capture distinctive information. After constructing anchor graph $Z$ , $Z^{\top}\Delta Z$ can be used as input to obtain the spectral embedding for clustering in traditional anchor-based methods, where $\Delta$ is a diagonal matrix with $\Delta_{ii}=\sum_{j=1}^{N}Z_{ji}$ . According to [40], the right singular vectors of $Z$ are the same as the eigenvectors of $Z^{\top}\Delta Z$ . Consequently, we perform singular value decomposition (SVD) on $Z$ and then run $K$ -means on the right vector to produce the final result, which needs $\mathcal{O}(m^{2}N)$ instead of $\mathcal{O}(N^{3})$ .

III-C Optimization

To solve Eq. (5), we use an alternative strategy.

III-C1 Initialization of $B^{v}$

We could optionally initialize $B^{v}$ with the cluster centers by dividing $H^{v}$ into $m$ partitions with a $K$ -means algorithm.

III-C2 Update $Z$

By fixing $\{B^{v}\}_{v=1}^{V}$ and $\{\lambda_{v}\}_{v=1}^{V}$ , we set the derivative of the objective function with respect to $Z$ to zero, we have:

Z={(1+\alpha)}[\beta I_{m}+\sum_{v=1}^{V}\lambda_{v}^{2}(B^{v}B^{v}{}^{\top}+% \alpha I_{m})]^{-1}(\sum_{v=1}^{V}\lambda_{v}^{2}B^{v}H^{v}{}^{\top})

(6)

For single-view scenario, the solution is $Z=(1+\alpha)(BB^{\top}+(\alpha+\beta)I_{m})^{-1}(BH^{\top})$ .

III-C3 Update $\{B^{v}\}_{v=1}^{V}$

By fixing $Z$ and $\{\lambda_{v}\}_{v=1}^{V}$ , Eq. (5) can be rewritten as:

RB^{v}+\beta B^{v}T^{v}=C^{v},

(7)

where $R=ZZ^{\top}\in\mathbb{R}^{m\times m}$ , $T^{v}=H^{v}{}^{\top}H^{v}\in\mathbb{R}^{d_{v}\times d_{v}}$ , $C^{v}=(1+\alpha)ZH^{v}\in\mathbb{R}^{m\times d_{v}}$ . Then we can obtain $B^{v}$ by solving Sylvester Equation.

III-C4 Update $\{\lambda_{v}\}_{v=1}^{V}$

Fixing $Z$ and $\{B^{v}\}_{v=1}^{V}$ , we let $M_{v}=\left\|H^{v}{}^{\top}-B^{v}{}^{\top}Z\right\|_{F}^{2}+\alpha\|B^{v}H^{v}% {}^{\top}-Z\|^{2}_{F}$ . Then the problem is simplified as:

\underset{\lambda_{v}}{\operatorname{min}}\sum_{v=1}^{V}\lambda_{v}^{2}M_{v},% \quad\text{s.t.}\hskip 2.84544pt\sum_{v=1}^{V}\lambda_{v}=1,\lambda_{v}>0.

(8)

This is a standard quadratic programming problem, which yields: $\lambda_{v}=\frac{\frac{1}{M_{v}}}{\sum_{p=1}^{V}\frac{1}{M_{p}}}.$

Comment The optimization procedure will monotonically decrease the objective function value in Eq. (5) in each iteration. Since the objective function has a lower bound, such as zero, the above iteration converges.

III-D Complexity Analysis

The adjacency graph is often sparse in real-world scenarios. Consequently, we implement graph filtering with sparse matrix techniques, which takes linear time while multiplication operation takes $\mathcal{O}(f_{1}N^{2})$ in general, where $f_{1}=\sum_{v=1}^{V}d_{v}$ . Assume there are $t$ iterations in total, then the optimization of $Z$ takes $\mathcal{O}(t\operatorname{max}(m^{3},mf_{1}N))$ . Specifically, all multiplications and additions take $\mathcal{O}(tmf_{1}N)$ and the inverse operation needs $\mathcal{O}(tm^{3})$ . Then optimization of $B^{v}$ and $\{\lambda_{v}\}_{v=1}^{V}$ takes $\mathcal{O}(tf_{2})$ and $\mathcal{O}(tmf_{1}N)$ , where $f_{2}=\sum_{v=1}^{V}d_{v}^{3}$ . It is worth pointing out that anchor generation has a constant complexity, which won’t be limited by the size of the data. We perform SVD on $Z$ and implement $K$ -means to obtain clustering result, which takes $\mathcal{O}(m^{2}N)$ and $\mathcal{O}(\bar{t}cmN)$ respectively, where $\bar{t}$ is the iteration number of $K$ -means and $c$ is cluster number. In practice, $d\ll N$ , $m\ll N$ , and $t\ll N$ , $c$ and $\bar{t}$ are constants, so the proposed method has a linear time complexity. Moreover, the largest space cost is $m\times N$ or $N\times d$ , which means our approach has a linear space complexity.

We compare our complexity with baselines in Table I. The iteration number $t$ is omitted. The $\widehat{P}$ represents the average degree of the graph in S ${}^{3}$ GC. $l$ and $K$ are the number of view groups and nearest neighbors for each view group, where view group is defined as a group of multiple randomly selected views. $B$ is the batch size, remaining symbols are the same as those in the main body of CDC. It can be seen that our method has clear advantages and only suffers from feature dimension. What’s more, for high-dimensional data, dimension reduction techniques can be applied.

TABLE I: The brief complexity analysis of recent SOTA methods.

	Methods	Time	Space
Single-view	MVGRL	$\mathcal{O}(dN^{2}+d^{2}N)$	$\mathcal{O}(N^{2}+dN)$
Single-view	S ${}^{3}$ GC	$\mathcal{O}(\widehat{P}dN)$	$\mathcal{O}(d(B\widehat{P}+N))$
Multi-view	MCGC	$\mathcal{O}(N^{2}+dN)$	$\mathcal{O}(N^{2})$
Multi-view	MvAGC	$\mathcal{O}(mdN)$	$\mathcal{O}((m+d)N)$
Non-graph	EOMSC-CA	$\mathcal{O}((m^{2}+d)N+m^{3})$	$\mathcal{O}((m+d)N)$
Non-graph	FastMICE	$\mathcal{O}(lm^{\frac{1}{2}}V^{\frac{1}{2}}N)$	$\mathcal{O}((c+K+l+V)N)$
Proposed	CDC	$\mathcal{O}((m^{2}+d)N+d^{3})$	$\mathcal{O}((m+d)N)$

IV Theoretical Analysis

We establish theoretical support for our method: 1) filtered features encode node attribute and topology structure; 2) the learned anchor graph is clustering-favorable.

Definition IV.1 (Grou** effect [41]).

There are two similar nodes $i$ and $j$ in terms of local topology and node features, i.e., $\mathcal{V}_{i}\rightarrow\mathcal{V}_{j}\Longleftrightarrow\left(\left\|A_{i}% -A_{j}\right\|^{2}\rightarrow 0\right)\wedge\left(\left\|x_{i}-x_{j}\right\|^{% 2}\rightarrow 0\right)$ , the matrix $G$ is said to have a grou** effect if $\mathcal{V}_{i}\rightarrow\mathcal{V}_{j}\Longrightarrow|G_{ip}-G_{jp}|% \rightarrow 0,\forall 1\leq p\leq N.$

Theorem IV.2.

Define the distance between filtered node $i$ and $j$ is $\|h_{i}-h_{j}\|^{2}$ , we have $\|h_{i}-h_{j}\|^{2}\leq\frac{1}{2^{2k}}[\|(A_{i}-A_{j})\sum_{i=0}^{k-1}{i% \choose N}A^{i}X\|^{2}+\|x_{i}-x_{j}\|^{2}]$ , i.e., the filtered features $H$ preserve both topology and attribute similarity.

Proof.

Note $L=I-A$ , then $I-\frac{1}{2}L=\frac{A+I}{2}$ . Then we have $H=(I-\frac{1}{2}L)^{k}X=\frac{(A+I)^{k}}{2^{k}}X$ . Expand it as follows:

	$\displaystyle H$	$\displaystyle=\frac{(A+I)^{k}}{2^{k}}X=\frac{A\sum_{i=0}^{k-1}{i\choose N}A^{i% }+I}{2^{k}}X$
		$\displaystyle=\frac{A\sum_{i=0}^{k-1}{i\choose N}A^{k-i}X+X}{2^{k}}.$

Then compute the distance of node $i$ and $j$ :

		$\displaystyle\\|h_{i}-h_{j}\\|^{2}$		(9)
		$\displaystyle=\\|\frac{(A\sum_{i=0}^{k-1}{i\choose N}A^{i}X+X)_{i}-(A\sum_{i=0}% ^{k-1}{i\choose N}A^{i}X+X)_{j}}{2^{k}}\\|^{2}$
		$\displaystyle=\frac{1}{2^{2k}}\|\|(A_{i}-A_{j})\sum_{i=0}^{k-1}{i\choose N}A^{i}% X+(X_{i}-X_{j})\\|^{2}$
		$\displaystyle\leq\frac{1}{2^{2k}}[\\|(A_{i}-A_{j})\sum_{i=0}^{k-1}{i\choose N}A% ^{i}X\\|^{2}+\\|X_{i}-X_{j}\\|^{2}]$

So, if $\mathcal{V}_{i}\rightarrow\mathcal{V}_{j}$ , $\|h_{i}-h_{j}\|\rightarrow 0$ . However, when nodes are similar to each other in only one space, i.e., either $\|A_{i}-A_{j}\|^{2}\rightarrow 0$ or $\|x_{i}-x_{j}\|^{2}\rightarrow 0$ , $\|h_{i}-h_{j}\|^{2}$ has a non-zero upper bound unless $k$ is large enough. This indicates that the filtered representations of similar nodes in both attribute and topology space get closer, and different graph filtering order will adjust this bias.

Theorem IV.3.

Let $G=Z^{\top}$ , then $|G_{ip}-G_{jp}|^{2}\leq\|h_{i}-h_{j}\|^{2}\|C_{2}\|^{2}_{F}$ , where $C_{2}$ is a constant matrix. We have $\mathcal{V}_{i}\rightarrow\mathcal{V}_{j}$ , $|G_{ip}-G_{jp}|^{2}\rightarrow 0$ , i.e., the learned anchor graph $Z$ have a grou** effect.

∎

Proof.

Define $G^{*}=Z^{*}{}^{\top}=(1+\alpha)(HB^{\top})(BB^{\top}+(\alpha+\beta)I_{m})^{-1}$ and $\mathcal{L}_{i}=\|h_{i}-g_{i}B\|^{2}+\alpha\|h_{i}B^{\top}-g_{i}\|^{2}+\beta\|% g_{i}\|^{2}$ , where $g_{i}$ is the $i$ th row of $G$ . Then let $\frac{\partial\mathcal{L}_{i}}{\partial G_{ip}}|_{g_{i}=g_{i}^{*}}=0$ , which yields $G_{ip}=\frac{(h_{i}-g_{i}B)B_{p}^{\top}+\alpha h_{i}B_{p}^{\top}}{\beta-\alpha}$ . Let $C_{1}=(BB^{\top}+(\alpha+\beta)I_{m})^{-1}$ , thus $g_{i}=(1+\alpha)h_{i}B^{\top}C_{1}$ . Eventually,

	$\displaystyle G_{ip}$	$\displaystyle=\frac{[h_{i}-(1+\alpha)h_{i}B^{\top}C_{1}B]B_{p}^{\top}+\alpha h% _{i}B_{p}^{\top}}{\beta-\alpha}$
		$\displaystyle=\frac{h_{i}(1+\alpha)(I_{d}-B^{\top}C_{1}B)B_{p}^{\top}}{\beta-% \alpha}.$

Remarking $C_{2}=\frac{(1+\alpha)(I_{d}-B^{\top}C_{1}B)B_{p}^{\top}}{\beta-\alpha}$ , we obtain $|G_{ip}-G_{jp}|^{2}\leq\|h_{i}-h_{j}\|^{2}\|C_{2}\|^{2}$ . ∎

This indicates that local structures of similar nodes tend to be identical on the learned graph $Z$ , which makes corresponding nodes be clustered into the same group. In other words, the learned graph is clustering-friendly. To intuitively demonstrate the grou** effect of the anchor graph, we plot five diagrams of $Z$ in Fig. Z𝑍Zitalic_Z on ACM has a stronger grou** effect than the one on Pubmed.

V Experiments

V-A Datasets and Metrics

TABLE II: Statistical information of datasets.

Type		Datasets	Samples	Edges/Dims	Clusters
Graph	Single-view	Citeseer	3327	4614 / 3703	6
	Single-view	Pubmed	19717	44325 / 500	3
	Multi-relational	ACM	3025	29281, 2210761 / 1830	3
	Multi-relational	DBLP	4057	11113, 5000495, 6776335 / 334	4
	Multi-attribute	AMAP	7487	119043 / 745, 7487	8
	Multi-attribute	AMAC	13381	245778 / 767, 13381	10
	Extra-/Large-scale	Products	2449029	61859140 / 100	47
	Extra-/Large-scale	Papers100M	111059956	1615685872 / 128	172
Non-graph	Large-scale multi-view	YTF-31	101499	507495 / 64, 512, 64, 647, 838	31
Non-graph	Large-scale multi-view	YTF-400	398191	1990955 / 944, 576, 512, 640	400

To show the effectiveness and efficiency of the CDC, we evaluate CDC on 10 benchmark datasets, including 6 multi-view data and 4 single-view data. More specifically, ACM and DBLP [17] are multi-view graphs with multiple relations, AMAP and AMAC [37] are multi-attribute graphs, YTF-31 [7] and YTF-400 [27] are multi-view non-graph data (YouTube-Faces); Citeseer, Pubmed [12], Products and Papers100M [42] are single-view graphs, where the latter two are from Open Graph Benchmarks [43]. The statistical information of these datasets is shown in Table II. Most notably, YTF-400 represents the largest multi-view non-graph dataset, while Papers100M is the largest graph used in the clustering task. We adopt four popular clustering metrics, including ACCuracy (ACC), Normalized Mutual Information (NMI), F1 score, and Adjusted Rand Index (ARI). A higher value of them indicates a better performance.

V-B Experimental Setup

We compare CDC with a number of single-view methods as well as multi-view methods.
Single-view graph Baselines include MinCutPool [44], METIS [45], Node2vec [29], DGI [46], DMoN [47], GRACE [33], BGRL [32], MVGRL [31], and S ${}^{3}$ GC [18].

METIS uses only structural information to partition graphs. Node2vec is a well-known graph embedding algorithm based on random walks. MinCutPool and DMoN integrate spectral clustering with graph neural networks. DGI learns node representations by maximizing mutual information between patch representations and corresponding high-level summaries of graphs. GRACE, BGRL, and MVGRL are three contrastive graph representation learning methods. S ${}^{3}$ GC is a recent scalable graph clustering method, which uses light-weight encoder and random walk-based sampler.

Multi-view graph There are 10 baselines on multi-view graphs clustering, including SDCN [34], DAEGC [14], O2MAC [17], HDMI [48], CMGEC [24], COMPLETER [49], MvAGC [37], MCGC [11], MVGRL [31], and MAGCN [16]. The first six methods are only applicable to data with multiple graphs or multiple attributes, whereas the last four are applicable to general multi-view graph data.
Graph attention auto-encoders and GCNs are used in SDCN and DAEGC, respectively.

To get consistent embedding, CMGEC adds a graph fusion network to multiple graph auto-encoders. In O2MAC, the most informative view is selected to learn cluster representation. HDMI learns node embeddings by using high-order mutual information. MAGCN applies graph auto-encoder on both attributes and topological graphs to learn consensus representations. Through contrastive mechanisms, COMPLETER and MVGRL learn a common representation shared across multiple views and graphs. MCGC uses a contrastive regularizer to boost the quality of the learned graph. In MvAGC, high-order topological interactions are explored to improve clustering performance.

Non-graph We compare CDC with six scalable MVC methods on non-graph data, including BMVC [9], LMVSC [5], MSGL [26], FPMVS [40], EOMSC-CA [7], and FastMICE [27].

BMVC learns discrete representations and binary cluster structures jointly to integrate collaborative information. MSGL and LMVSC are two scalable subspace clustering methods. FPMVS and EMOMSC-CA are two adaptive anchor-based algorithms. The differences between CDC and them are: 1) CDC uses a similarity preservation regularizer while anchor matrices are assumed to be unitary matrices in FPMVS and EMOMSC; 2) the complexity of anchor generation in CDC is not linked with data size. FastMICE constructs anchor graphs by using features, anchors, and neighbors jointly.

Parameter setting The balance parameters $\alpha$ and $\beta$ are set as $[1e-3,1,1e1,1e3,1e4]$ . The number of anchors $m$ is set as $[c,10,30,50,70,100]$ . All experiments are conducted on the same machine with the Intel(R) Core(TM) i9-12900k CPU, two GeForce GTX 3090 GPUs, and 128GB RAM.

V-C Results

TABLE III: Results on extra-large-scale graph.

Metrics	Papers100M
Metrics	$K$ -means	Node2vec	DGI	S ${}^{3}$ GC	CDC
ACC	0.144	0.175	0.151	0.173	0.174
NMI	0.368	0.380	0.416	0.453	0.427
ARI	0.074	0.112	0.096	0.110	0.114
F1	0.101	0.099	0.111	0.118	0.119

V-C1 Single-view Scenario

The results on the small-scale graph Citeseer, medium-scale graph Pubmed, large-scale graph Products, and extra-large-scale graph Papers100M are shown in Table III and Table IV. Note that most neural network-based methods can’t handle large and extra-large-scale graphs. On Citeseer and Pubmed, our method achieves the best results, and on Products and Papers100M, our method produces competitive results. In particular, on Pubmed, CDC surpasses the most recent S ${}^{3}$ GC method by more than 3 $\%$ in all metrics. CDC also shows a slight advantage against S ${}^{3}$ GC on the largest Papers100M dataset. Furthermore, CDC involves a lower time cost in comparison to S ${}^{3}$ GC. Specifically, it takes $\sim$ 5mins and $\sim$ 4hs on Products and Papers100M, while S ${}^{3}$ GC consumes $\sim$ 1h and $\sim$ 24hs, which proves CDC’s efficiency when it comes to large/extra-large-scale graphs. With respect to many GNN-based methods, like MinCutPool, DGI, DMON, GRACE, BGRL, and MVGRL, CDC demonstrates a clear edge.

TABLE IV: Results on single-view graphs. ”-” denotes that the method ran out of memory (OM) or didn’t converge. The best results are denoted with red and the with blue.

	Citeseer				Pubmed				Products
Method	ACC	NMI	ARI	F1	ACC	NMI	ARI	F1	ACC	NMI	ARI	F1
MinCutPool	0.537	0.295	0.262	0.516	0.521	0.214	0.175	0.445	0.257	0.430	0.180	0.130
METIS	0.413	0.170	0.150	0.400	0.693	0.297	0.323	0.682	0.294	0.468	0.220	0.145
Node2vec	0.421	0.240	0.116	0.401	0.641	0.288	0.258	0.634	0.357	0.489	0.247	0.170
DGI	0.686	0.435	0.445	0.643	0.657	0.322	0.292	0.654	0.320	0.467	0.192	0.174
DMoN	0.385	0.303	0.200	0.437	0.351	0.257	0.108	0.343	0.304	0.428	0.210	0.139
GRACE	0.631	0.399	0.377	0.603	0.637	0.308	0.276	0.628	-	-	-	-
BGRL	0.675	0.422	0.428	0.631	0.654	0.315	0.285	0.649	-	-	-	-
MVGRL	0.703	0.459	0.471	0.654	0.675	0.345	0.310	0.672	-	-	-	-
S ${}^{3}$ GC	0.688	0.441	0.448	0.643	0.713	0.333	0.345	0.703	0.402	0.536	0.25	0.23
CDC	0.709	0.444	0.471	0.661	0.741	0.371	0.383	0.737	0.366	0.390	0.121	0.187

V-C2 Multi-view Scenario

Clustering on Multi-view Graphs

CDC clustering results on multi-attribute and multi-relational graphs are reported in Table V and Table VI. The performance of CDC is much better than that of any other methods on four benchmarks in all metrics. For example, compared to the 2nd best method MCGC, ACC, NMI, ARI on ACM, AMAP, and AMAC are improved by more than 5 $\%$ , 7 $\%$ , and 9 $\%$ on average, respectively. Although MvAGC samples nodes as anchors, it takes more time than CDC since its sampling strategy suffers from low efficiency. Specifically, CDC is more than $2\times$ and $5\times$ faster on the multi-relational and multi-attribute graphs, respectively. Compared to other methods, the advantage is more significant. Therefore, CDC is a promising clustering method for graph data with various forms.

TABLE V: Results on the multi-relational graph.

Method	ACM				DBLP
Method	ACC	NMI	ARI	F1	ACC	NMI	ARI	F1
SDCN	0.863	0.578	0.639	0.862	0.650	0.298	0.310	0.638
DAEGC	0.891	0.643	0.705	0.891	0.873	0.674	0.701	0.862
O2MAC	0.904	0.692	0.739	0.905	0.907	0.729	0.778	0.901
HDMI	0.874	0.645	0.674	0.872	0.885	0.692	0.753	0.865
CMGEC	0.909	0.691	0.723	0.907	0.910	0.724	0.786	0.904
MvAGC	0.898	0.674	0.721	0.899	0.928	0.773	0.828	0.923
MCGC	0.915	0.713	0.763	0.916	0.930	0.775	0.830	0.925
CDC	0.936	0.769	0.817	0.936	0.933	0.781	0.836	0.929

TABLE VI: Results on the multi-attribute graph.

Datasets	AMAP				AMAC
Method	ACC	NMI	ARI	F1	ACC	NMI	ARI	F1
COMPLETER	0.368	0.261	0.076	0.307	0.242	0.156	0.054	0.160
MVGRL	0.505	0.433	0.238	0.460	0.245	0.101	0.055	0.171
MAGCN	0.517	0.390	0.240	0.474	-	-	-	-
MvAGC	0.678	0.524	0.397	0.640	0.580	0.396	0.322	0.412
MCGC	0.716	0.615	0.432	0.686	0.597	0.532	0.390	0.520
CDC	0.795	0.707	0.620	0.730	0.647	0.604	0.437	0.546

Clustering on Multi-view Non-graph data

TABLE VII: Results on large-scale multi-view non-graph data.

Method	YTF-31			YTF-400
Method	ACC	NMI	F1	ACC	NMI	F1
BMVC	0.090	0.059	0.058	-	-	-
LMVSC	0.140	0.118	0.083	0.489	0.767	0.589
MSGL	0.167	0.001	0.151	0.502	0.738	0.606
FPMVS	0.230	0.234	0.140	0.562	0.797	0.472
EOMSC-CA	0.265	0.003	0.164	0.570	0.779	0.408
FastMICE	0.275	0.236	0.295	0.564	0.798	0.509
CDC	0.285	0.260	0.298	0.571	0.745	0.591

To process non-graph data, we manually construct 5-nearest neighbor graphs for graph filtering. Table VII shows the clustering results on YTF-31 and YTF-400. We find that most existing methods can’t handle YTF-400, which is the largest non-graph multi-view data. CDC still achieves the best results in most cases. Though some others also use anchor ideas, their computation time cost is still high. Specifically, CDC takes $\sim$ 20s and $\sim$ 1min respectively, while EOMSC-CA needs $\sim$ 2mins, $\sim$ 6mins and FastMICE takes $\sim$ 30s, $\sim$ 3mins on these two datasets. This verifies that CDC is also a promising clustering method for non-graph data. The time cost of several recent SOTA methods is summarized in Fig. 2.

V-D Ablation Study

V-D1 Effect of Similarity-Preserving

Anchors are generated adaptively in the similarity space constrained by a similarity-preserving (marked as SP) regularizer. To clearly show the effect of SP, we remove it from the model and test the performance of CDC w/o SP on Pubmed and ACM in Table VIII. It’s clear that similarity preserving does improve the clustering performance by 3 $\%$ on average. Moreover, CDC takes less time than CDC w/o SP on two datasets. The reason is that the computation complexity for $B^{v}=(ZZ^{\top})^{-1}(ZH^{v})$ is $\mathcal{O}(md_{v}N)$ , which is higher than $\mathcal{O}(d_{v}^{3})$ in CDC. Therefore, as a bonus, SP regularizer helps to reduce the complexity of anchor generation. Moreover, it improves the quality of anchors. As observed in Fig. 3, CDC achieves the best results with a few anchors, which further reduces the computation cost. In fact, too many anchors could deteriorate the performance since some noisy anchors that are not representative could be introduced.

TABLE VIII: Results of CDC with/without GF and SR.

Method	Pubmed				ACM
Method	ACC	NMI	F1	Time (s)	ACC	NMI	F1	Times (s)
CDC	0.741	0.371	0.737	2.03	0.936	0.769	0.936	0.81
w/o SP	0.707	0.349	0.704	5.91	0.918	0.710	0.919	1.03
w/o SP	(-0.034)	(-0.022)	(-0.033)	(+3.88)	(-0.018)	(-0.059)	(-0.017)	(+0.22)
w/o GF	0.626	0.256	0.639	2.86	0.872	0.585	0.871	0.88
w/o GF	(-0.115)	(-0.115)	(-0.098)	(+0.83)	(-0.064)	(-0.184)	(-0.055)	(+0.07)

V-D2 Effect of Graph Filtering

Graph filtering (marked as GF) is applied to integrate node attributes and topology in our method. Besides theoretically showing that the learned anchor graph from filtered representations is clustering-favorable, we also show this experimentally in Table VIII. We can see that the performance of CDC w/o GF drops about 10 $\%$ on average, which validates the significance of graph filtering. We also observe the increase in run time, which could be caused by the slow convergence due to the loss of cluster-ability.

V-D3 Robustness on Heterophily

TABLE IX: Results on heterophilic graphs.

Methods	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
	Texas		Cornell		Wisconsin		Squirrel
CDRS	0.599	0.154	-	-	0.562	0.137	-	-
CGC	0.615	0.215	0.446	0.141	0.559	0.230	0.272	0.030
CDC	0.672	0.293	0.514	0.142	0.637	0.318	0.279	0.043

In some real-world applications, graphs could be heterophilic, where connected nodes tend to have different labels [50]. To show the robustness of CDC on heterophily, we report the results on several popular heterophilic graphs, including Texas, Cornell, Wisconsin [51], Squirrel [52]. As shown in Table IX, our proposed CDC dominates the recent SOTA methods, CDRS [53] and CGC [50]. Although the used low-pass filter is considered to be less useful for heterophily, CDC still works well because of the high-quality anchors and clustering-friendly graph. In fact, there are few graph clustering methods for heterophily, further works like an omnipotent filter could contribute a lot to handling clustering on heterophilic graphs.

VI Parameter Analysis

There are two trade-off parameters, $\alpha$ and $\beta$ , in our model. As shown in Figure 4, although CDC works well for a wide range of $\alpha$ and $\beta$ , fine-tuning does enhance its performance. $\beta$ makes less impact than $\alpha$ , which indicates that the similarity-preserving regularizer is more important. The proposed CDC is of linear complexity, so fine-tuning procedures take little time.

Besides, we also visualize the objective function value of CDC on ACM, Pubmed and YTF-31 in Fig. 5. It can be seen that losses converge fast.

VII Conclusion

In this paper, we propose a simple framework for clustering complex data, which is readily applicable to graph and non-graph, multi-view and single-view data. The developed method has linear complexity and nice theoretical properties. With graph filtering, we integrate deep structural information and learn representations with cluster-ability. In particular, a similarity-preserving regularizer is designed to adaptively generate high-quality anchors, which alleviates the burden and randomness of anchor selection. CDC demonstrates its effectiveness and efficiency with impressive results on 14 complex datasets. In particular, it even exceeds the performance of many complex GNN-based methods. In light of the simplicity of the proposed framework and its effectiveness on various types of data, this work could have a broad impact on the clustering community and have a high potential for deployment in real applications. One potential limitation of the CDC is that it might not be able to handle high-dimensional data efficiently, since anchor generation has a cubic complexity of sample dimension.

References

[1] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern recognition letters, vol. 31, no. 8, pp. 651–666, 2010.
[2] J. Zhao, X. Xie, X. Xu, and S. Sun, “Multi-view learning overview: Recent progress and new challenges,” Information Fusion, vol. 38, pp. 43–54, 2017.
[3] K. Zhan, F. Nie, J. Wang, and Y. Yang, “Multiview consensus graph clustering,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1261–1270, 2018.
[4] X. Yang, C. Deng, Z. Dang, and D. Tao, “Deep multiview collaborative clustering,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
[5] Z. Kang, W. Zhou, Z. Zhao, J. Shao, M. Han, and Z. Xu, “Large-scale multi-view subspace clustering in linear time,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 4412–4419.
[6] X. Li, H. Zhang, R. Wang, and F. Nie, “Multiview clustering: A scalable and parameter-free bipartite graph fusion method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 330–344, 2020.
[7] S. Liu, S. Wang, P. Zhang, K. Xu, X. Liu, C. Zhang, and F. Gao, “Efficient one-pass multi-view subspace clustering with consensus anchors,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 7, 2022, pp. 7576–7584.
[8] M. Sun, P. Zhang, S. Wang, S. Zhou, W. Tu, X. Liu, E. Zhu, and C. Wang, “Scalable multi-view subspace clustering with unified anchors,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3528–3536.
[9] Z. Zhang, L. Liu, F. Shen, H. T. Shen, and L. Shao, “Binary multi-view clustering,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1774–1782, 2018.
[10] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, “Community preserving network embedding,” in Proceedings of the AAAI conference on artificial intelligence, vol. 31, no. 1, 2017.
[11] E. Pan and Z. Kang, “Multi-view contrastive graph clustering,” Advances in neural information processing systems, vol. 34, pp. 2148–2159, 2021.
[12] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th International Conference on Learning Representations, 2017.
[13] X. Zhang, H. Liu, Q. Li, and X. Wu, “Attributed graph clustering via adaptive graph convolution,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, S. Kraus, Ed., 2019, pp. 4327–4333.
[14] C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang, “Attributed graph clustering: A deep attentional embedding approach,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, 2019, pp. 3670–3676.
[15] S. Pan, R. Hu, S.-f. Fung, G. Long, J. Jiang, and C. Zhang, “Learning graph embedding with adversarial training methods,” IEEE transactions on cybernetics, vol. 50, no. 6, pp. 2475–2487, 2019.
[16] J. Cheng, Q. Wang, Z. Tao, D. Xie, and Q. Gao, “Multi-view attribute graph convolution networks for clustering,” in Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2021, pp. 2973–2979.
[17] S. Fan, X. Wang, C. Shi, E. Lu, K. Lin, and B. Wang, “One2multi graph autoencoder for multi-view graph clustering,” in Proceedings of The Web Conference 2020, 2020, pp. 3070–3076.
[18] F. Devvrit, A. Sinha, I. S. Dhillon, and P. Jain, “S3GC: Scalable self-supervised graph clustering,” in Advances in Neural Information Processing Systems, 2022.
[19] Y. Liu, K. Liang, J. Xia, S. Zhou, X. Yang, X. Liu, and S. Z. Li, “Dink-net: Neural clustering on large graphs,” in International Conference on Machine Learning, ICML 2023. PMLR, 2023.
[20] D. J. Trosten, S. Lokse, R. Jenssen, and M. Kampffmeyer, “Reconsidering representation alignment for multi-view clustering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1255–1265.
[21] R. Li, C. Zhang, Q. Hu, P. Zhu, and Z. Wang, “Flexible multi-view representation learning for subspace clustering.” pp. 2916–2922, 2019.
[22] S. Mitra, M. Hasanuzzaman, and S. Saha, “A unified multi-view clustering algorithm using multi-objective optimization coupled with generative model,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 14, no. 1, pp. 1–31, 2020.
[23] X. Peng, Z. Huang, J. Lv, H. Zhu, and J. T. Zhou, “Comic: Multi-view clustering without parameter selection,” in International conference on machine learning. PMLR, 2019, pp. 5092–5101.
[24] Y. Wang, D. Chang, Z. Fu, and Y. Zhao, “Consistent multiple graph embedding for multi-view clustering,” IEEE Transactions on Multimedia, 2021.
[25] Y. Lin, Y. Gou, X. Liu, J. Bai, J. Lv, and X. Peng, “Dual contrastive prediction for incomplete multi-view representation learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[26] Z. Kang, Z. Lin, X. Zhu, and W. Xu, “Structured graph learning for scalable subspace clustering: From single view to multiview,” IEEE Transactions on Cybernetics, vol. 52, no. 9, pp. 8976 – 8986, 2022.
[27] D. Huang, C.-D. Wang, and J.-H. Lai, “Fast multi-view clustering via ensembles: Towards scalability, superiority, and simplicity,” IEEE Transactions on Knowledge and Data Engineering, 2023.
[28] S. Liu, X. Liu, S. Wang, X. Niu, and E. Zhu, “Fast incomplete multi-view clustering with view-independent anchors,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[29] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 855–864.
[30] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” NIPS Workshop on Bayesian Deep Learning, 2016.
[31] K. Hassani and A. H. Khasahmadi, “Contrastive multi-view representation learning on graphs,” in International Conference on Machine Learning. PMLR, 2020, pp. 4116–4126.
[32] S. Thakoor, C. Tallec, M. G. Azar, R. Munos, P. Veličković, and M. Valko, “Bootstrapped representation learning on graphs,” in ICLR 2021 Workshop on Geometrical and Topological Representation Learning, 2021.
[33] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, “Deep Graph Contrastive Representation Learning,” in ICML Workshop on Graph Representation Learning and Beyond, 2020.
[34] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structural deep clustering network,” in Proceedings of The Web Conference 2020, 2020, pp. 1400–1410.
[35] W. Tu, S. Zhou, X. Liu, X. Guo, Z. Cai, E. Zhu, and J. Cheng, “Deep fusion clustering network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 11, 2021, pp. 9978–9987.
[36] Y. Liu, W. Tu, S. Zhou, X. Liu, L. Song, X. Yang, and E. Zhu, “Deep graph clustering via dual correlation reduction,” in Proc. of AAAI, 2022.
[37] Z. Lin and Z. Kang, “Graph filter-based multi-view attributed graph clustering.” in IJCAI, 2021, pp. 2723–2729.
[38] M. Hamidouche, C. Lassance, Y. Hu, L. Drumetz, B. Pasdeloup, and V. Gripon, “Improving classification accuracy with graph filtering,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 334–338.
[39] Z. Kang, W. Zhou, Z. Zhao, J. Shao, M. Han, and Z. Xu, “Large-scale multi-view subspace clustering in linear time,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 4412–4419.
[40] S. Wang, X. Liu, X. Zhu, P. Zhang, Y. Zhang, F. Gao, and E. Zhu, “Fast parameter-free multi-view subspace clustering with consensus anchor guidance,” IEEE Transactions on Image Processing, vol. 31, pp. 556–568, 2021.
[41] X. Li, B. Kao, C. Shan, D. Yin, and M. Ester, “CAST: A correlation-based adaptive spectral clustering algorithm on multi-scale data,” in The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2020, pp. 439–449.
[42] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
[43] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” Advances in neural information processing systems, vol. 33, pp. 22 118–22 133, 2020.
[44] F. M. Bianchi, D. Grattarola, and C. Alippi, “Spectral clustering with graph neural networks for graph pooling,” in International Conference on Machine Learning. PMLR, 2020, pp. 874–883.
[45] G. Karypis and V. Kumar, “Metis: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices,” 1997.
[46] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax.” ICLR, vol. 2, no. 3, p. 4, 2019.
[47] B. P. Anton Tsitsulin, John Palowitch and E. Müller, “Graph clustering with graph neural networks,” in Proceedings of the 16th International Workshop on Mining and Learning with Graphs (MLG), 2020.
[48] B. **g, C. Park, and H. Tong, “Hdmi: High-order deep multiplex infomax,” in Proceedings of the Web Conference 2021, 2021, pp. 2414–2424.
[49] Y. Lin, Y. Gou, Z. Liu, B. Li, J. Lv, and X. Peng, “Completer: Incomplete multi-view clustering via contrastive prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 174–11 183.
[50] X. Xie, W. Chen, Z. Kang, and C. Peng, “Contrastive graph clustering with adaptive filter,” Expert Systems with Applications, vol. 219, p. 119645, 2023.
[51] H. Pei, B. Wei, K. C. Chang, Y. Lei, and B. Yang, “Geom-gcn: Geometric graph convolutional networks,” in 8th International Conference on Learning Representations, ICLR 2020,, 2020.
[52] B. Rozemberczki, C. Allen, and R. Sarkar, “Multi-scale attributed node embedding,” Journal of Complex Networks, vol. 9, no. 2, 2021.
[53] P. Zhu, J. Li, Y. Wang, B. Xiao, S. Zhao, and Q. Hu, “Collaborative decision-reinforced self-supervision for attributed graph clustering,” IEEE Transactions on Neural Networks and Learning Systems, 2022.

		$\displaystyle\\|h_{i}-h_{j}\\|^{2}$		(9)
		$\displaystyle=\\|\frac{(A\sum_{i=0}^{k-1}{i\choose N}A^{i}X+X)_{i}-(A\sum_{i=0}% ^{k-1}{i\choose N}A^{i}X+X)_{j}}{2^{k}}\\|^{2}$
		$\displaystyle=\frac{1}{2^{2k}}\|\|(A_{i}-A_{j})\sum_{i=0}^{k-1}{i\choose N}A^{i}% X+(X_{i}-X_{j})\\|^{2}$
		$\displaystyle\leq\frac{1}{2^{2k}}[\\|(A_{i}-A_{j})\sum_{i=0}^{k-1}{i\choose N}A% ^{i}X\\|^{2}+\\|X_{i}-X_{j}\\|^{2}]$

Abstract

Index Terms:

I Introduction

II Related Work

II-A Multi-view Clustering

II-B Graph Clustering

III Methodology

Notation

III-A Graph Filtering

III-B Anchor Graph Learning

III-C Optimization

III-C1 Initialization of Bvsuperscript𝐵𝑣B^{v}italic_B start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT

III-C2 Update Z𝑍Zitalic_Z

III-C3 Update {Bv}v=1Vsuperscriptsubscriptsuperscript𝐵𝑣𝑣1𝑉\{B^{v}\}_{v=1}^{V}{ italic_B start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT

III-C4 Update {λv}v=1Vsuperscriptsubscriptsubscript𝜆𝑣𝑣1𝑉\{\lambda_{v}\}_{v=1}^{V}{ italic_λ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT

III-D Complexity Analysis

IV Theoretical Analysis

Definition IV.1 (Grou** effect [41]).

Theorem IV.2.

Proof.

Theorem IV.3.

Proof.

V Experiments

V-A Datasets and Metrics

V-B Experimental Setup

V-C Results

V-C1 Single-view Scenario

V-C2 Multi-view Scenario

Clustering on Multi-view Graphs

Clustering on Multi-view Non-graph data

V-D Ablation Study

V-D1 Effect of Similarity-Preserving

V-D2 Effect of Graph Filtering

V-D3 Robustness on Heterophily

VI Parameter Analysis

VII Conclusion

References

III-C1 Initialization of $B^{v}$

III-C2 Update $Z$

III-C3 Update $\{B^{v}\}_{v=1}^{V}$

III-C4 Update $\{\lambda_{v}\}_{v=1}^{V}$