\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

\fnm

Xi \surPeng*

A Survey on Deep Clustering: From the Prior Perspective

\fnmYiding \surLu [email protected]    \fnmHaobin \surLi [email protected]    \fnmYunfan \surLi [email protected]    \fnmYijie \surLin [email protected]    [email protected] \orgdivCollege of Computer Science, \orgnameSichuan University, \orgaddress\cityChengdu, \stateSichuan, \countryChina
Abstract

Facilitated by the powerful feature extraction ability of neural networks, deep clustering has achieved great success in analyzing high-dimensional and complex real-world data. The performance of deep clustering methods is affected by various factors such as network structures and learning objectives. However, as pointed out in this survey, the essence of deep clustering lies in the incorporation and utilization of prior knowledge, which is largely ignored by existing works. From pioneering deep clustering methods based on data structure assumptions to recent contrastive clustering methods based on data augmentation invariances, the development of deep clustering intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods by categorizing them into six types of prior knowledge. We find that in general the prior innovation follows two trends, namely, i) from mining to constructing, and ii) from internal to external. Besides, we provide a benchmark on five widely-used datasets and analyze the performance of methods with diverse priors. By providing a novel prior knowledge perspective, we hope this survey could provide some novel insights and inspire future research in the deep clustering community.

keywords:
Clustering, Deep Clustering, Unsupervised Learning

1 Introduction

As a fundamental problem in machine learning, clustering aims at grou** data instances into several clusters, where instances from the same cluster share similar semantics and instances from different clusters are dissimilar. Clustering could reveal the inherent semantic structure underlying the data, which benefits the down-stream analysis such as anomaly detection [86], person re-identification [115], community detection [96], and domain adaption [111], etc.

In the early stage, various classic clustering methods are developed, such as centroid-based clustering [64], density-based clustering [19], hierarchical clustering [71], and so on. These shallow methods are grounded in theory and enjoy high interpretability. Later on, some works extend shallow clustering methods to diverse data types, such as multi-view [117, 75, 76, 98] and graph data [73, 87]. Other efforts have been made to improve the scalability [118] of shallow clustering methods.

However, shallow clustering methods partition instances based on the similarity [64] or density [19] of the given raw or linear transformed data. Due to the limited feature extraction ability, shallow clustering methods would achieve sub-optimal results when confronted with complex, high-dimensional, and non-linear data in the real world. To tackle this challenge, deep clustering techniques are proposed to incorporate neural networks into clustering methods. In other words, deep clustering simultaneously learns discriminative representations and performs clustering on the learned features, progressively benefiting each other.

Over the past few years, many efforts have been devoted to improving the clustering performance from various aspects, such as network architectures [8, 74], training strategies [69], and loss functions [40, 124]. However, we would like to highlight that the fundamental challenge of deep clustering is the absence of data annotations. Consequently, the key to deep clustering lies in introducing proper prior knowledge to construct the supervision signals. From the early data structure assumption to the recent data augmentation invariance, the development of deep clustering methods intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods from the perspective of prior knowledge.

Inspired by traditional clustering and dimensionality reduction approaches [85, 4], the early deep clustering methods [33, 79, 91] build upon the structure prior of data. Based on the assumption that the inherent data structure could reflect the semantic relation, these methods incorporate classic manifold [85] or subspace learning [101] objectives to optimize the neural network for feature extraction and clustering. The second type of prior knowledge is the distribution prior, which assumes that instances from different clusters follow distinct distributions. Based on such a prior, several generative deep clustering methods [40, 69] propose to learn the latent distribution of samples for the data partition. In the past few years, the success of contrastive learning spawns a new category of prior knowledge, namely, augmentations invariance. Instead of mining data priors, researchers turn to constructing additional priors with the data augmentation technique. Leveraging the invariance across different augmented samples at the instance representation and clustering assignment levels, numerous contrastive clustering methods [39, 52] significantly improve the feature discriminability and clustering performance. Further, researchers find that instances of the same semantics are likely to be mapped into nearby points in the latent space, and accordingly propose the neighborhood consistency prior. Specifically, by encouraging neighboring samples to have similar cluster assignments, several works [97, 124] alleviate the false-negative problem in the contrastive clustering paradigm, thus advancing the clustering results. Another branch of progress is made based on the pseudo label prior, namely, cluster assignments with high confidence are likely to be correct. By selecting confident predictions as pseudo labels, several studies further boost the clustering performance through pseudo-labeling [53, 81] and semi-supervised learning [77]. Very recently, instead of pursuing internal priors from the data itself, some works [7, 54] attempt to introduce abundant external knowledge such as textual descriptions to guide clustering.

In summary, the essence of deep clustering lies in how to find and leverage effective prior knowledge, for both feature extraction and cluster assignment. To provide an overview of the development of deep clustering, in this paper, we categorize a series of state-of-the-art approaches according to the taxonomy of prior knowledge. We hope such a new perspective for deep clustering could inspire future research in the community. The rest of this paper is organized as follows: First, Section 2 introduces the preliminaries on deep clustering. Section 3 reviews the existing deep clustering methods from the prior knowledge perspective. Then, Section 4 provides experimental analyses of deep clustering methods. After that, Section 5 briefly introduces some applications of deep clustering in the vicinagearth security. Lastly, Section 6 summarizes some notable trends and challenges for deep clustering.

Related Surveys

We notice that several surveys on deep clustering have been proposed in recent years. Briefly, Min et al [66] categorizes deep clustering methods according to the network architecture. Dong et al [18] focuses on applications of deep clustering. Ren et al [84] summarizes existing methods from the view of data types, such as single- and multi-view. Zhou et al [125] discusses various interactions between representation learning and clustering. Distinct from existing surveys, this work systematically provides a new perspective from the prior knowledge, which plays a more intrinsic and essential role in deep clustering.

2 Problem Definition

In this section, we introduce the pipeline of deep clustering, including the notation and problem definition. Unless specially notified, in this paper, we use bold uppercase and lowercase to denote matrices and vectors, respectively. The commonly used notations are summarized in Table 1.

The deep clustering problem is formally defined as follows: given a set of instances 𝒟={𝐱i}i=1N𝒳𝒟superscriptsubscriptsubscript𝐱𝑖𝑖1𝑁𝒳\mathcal{D}=\left\{\mathbf{x}_{i}\right\}_{i=1}^{N}\in\mathcal{X}caligraphic_D = { bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_X that belongs to C𝐶Citalic_C classes, deep clustering aims to learn discriminative features and group the instances into C𝐶Citalic_C clusters according to their semantics. Specifically, deep clustering methods first learn a deep neural network f:𝒳𝒵:𝑓𝒳𝒵f:\mathcal{X}\rightarrow\mathcal{Z}italic_f : caligraphic_X → caligraphic_Z for feature extraction, i.e., 𝐳i=f(𝐱i)subscript𝐳𝑖𝑓subscript𝐱𝑖\mathbf{z}_{i}=f(\mathbf{x}_{i})bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Given instance features in the latent space, clustering results could be obtained in two ways. The most straightforward way is to apply classic algorithms such as K-means [64] and DBSCAN [19] on the learned features. The other solution is to train an additional cluster head h:𝒵C:𝒵superscript𝐶h:\mathcal{Z}\rightarrow\mathbb{R}^{C}italic_h : caligraphic_Z → blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT to produce soft cluster assignment 𝐩i=softmax(h(𝐳i))subscript𝐩𝑖softmaxsubscript𝐳𝑖\mathbf{p}_{i}=\operatorname{softmax}(h(\mathbf{z}_{i}))bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_softmax ( italic_h ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) which satisfies j=0K𝐩ij=1superscriptsubscript𝑗0𝐾subscript𝐩𝑖𝑗1\sum_{j=0}^{K}\mathbf{p}_{ij}=1∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1. The hard cluster assignment for the i𝑖iitalic_i-th instance could be computed by argmax\arg\maxroman_arg roman_max operation, namely,

y~i=argmaxj𝐩ij,1jC.formulae-sequencesubscript~𝑦𝑖subscript𝑗subscript𝐩𝑖𝑗1𝑗𝐶\tilde{y}_{i}=\arg\max_{j}\ \mathbf{p}_{ij},1\leq j\leq C.over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , 1 ≤ italic_j ≤ italic_C . (1)

The cluster assignments provide the inherent semantic structure underlying the data, which could be utilized in various downstream analyses.

Refer to caption
Figure 1: Six categories of prior knowledge for deep clustering. (a) Structure Prior: data structure could reflect the semantic relation between instances. (b) Distribution Prior: instances from different clusters follow distinct data distributions. (c) Augmentation Invariance: samples augmented by the same instance have similar features. (d) Neighborhood Consistency: neighboring samples have consistent cluster assignments. (e) Pseudo Label: cluster assignments with high confidence are likely to be correct. (f) External Knowledge: abundant knowledge favorable to clustering exists in open-world data and models.
Table 1: Commonly used mathematical notations.
Notation Explanation
N𝑁Nitalic_N Number of data instances
B𝐵Bitalic_B Size of a mini-batch
C𝐶Citalic_C Number of clusters
f()𝑓f(\cdot)italic_f ( ⋅ ) Encoder network
h()h(\cdot)italic_h ( ⋅ ) Cluster head
𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT i𝑖iitalic_i-th data instance
𝐳isubscript𝐳𝑖\mathbf{z}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Feature of the i𝑖iitalic_i-th instance
y~isubscript~𝑦𝑖\tilde{y}_{i}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Pseudo label of the i𝑖iitalic_i-th instance
\|\cdot\|∥ ⋅ ∥ L2-norm of a vector
delimited-⟨⟩\langle\cdot\rangle⟨ ⋅ ⟩ Dot product operator
s(𝐚,𝐛)s𝐚𝐛\operatorname{s}(\mathbf{a},\mathbf{b})roman_s ( bold_a , bold_b ) Cosine similarity, i.e., s(𝐚,𝐛)=𝐚,𝐛𝐚𝐛s𝐚𝐛𝐚𝐛norm𝐚norm𝐛\operatorname{s}(\mathbf{a},\mathbf{b})=\frac{\langle\mathbf{a},\mathbf{b}% \rangle}{\|\mathbf{a}\|\|\mathbf{b}\|}roman_s ( bold_a , bold_b ) = divide start_ARG ⟨ bold_a , bold_b ⟩ end_ARG start_ARG ∥ bold_a ∥ ∥ bold_b ∥ end_ARG
𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Centroid of the i𝑖iitalic_i-th cluster
H()𝐻H(\cdot)italic_H ( ⋅ ) Entropy, i.e., H(X)=xXp(x)logp(x)𝐻𝑋subscript𝑥𝑋𝑝𝑥𝑝𝑥H(X)=\sum_{x\in X}-p(x)\log p(x)italic_H ( italic_X ) = ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT - italic_p ( italic_x ) roman_log italic_p ( italic_x )
H(,)𝐻H(\cdot,\cdot)italic_H ( ⋅ , ⋅ ) Conditional entropy,i.e., H(YX)=xX,yYp(x,y)logp(x,y)p(x)𝐻conditional𝑌𝑋subscriptformulae-sequence𝑥𝑋𝑦𝑌𝑝𝑥𝑦𝑝𝑥𝑦𝑝𝑥H(Y\mid X)=\sum_{x\in X,y\in Y}-p(x,y)\log\frac{p(x,y)}{p(x)}italic_H ( italic_Y ∣ italic_X ) = ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X , italic_y ∈ italic_Y end_POSTSUBSCRIPT - italic_p ( italic_x , italic_y ) roman_log divide start_ARG italic_p ( italic_x , italic_y ) end_ARG start_ARG italic_p ( italic_x ) end_ARG
I(;)𝐼I(\cdot;\cdot)italic_I ( ⋅ ; ⋅ ) Mutual Information, i.e., I(X;Y)=H(X)H(XY)𝐼𝑋𝑌𝐻𝑋𝐻conditional𝑋𝑌I(X;Y)=H(X)-H(X\mid Y)italic_I ( italic_X ; italic_Y ) = italic_H ( italic_X ) - italic_H ( italic_X ∣ italic_Y )
τ𝜏\tauitalic_τ Temperature coefficient of contrastive loss

3 Priors for Deep Clustering

In this section, we review existing deep clustering methods from the perspective of prior knowledge. The priors are illustrated in Figure 1 and the method categorization is summarized in Table2.

3.1 Structure Prior

Structure prior is mostly inspired by traditional clustering methods. Traditional cluster is mainly rooted in assumptions about the structural characteristics of clusters in data space. For example, K-means [64] aims to learn k𝑘kitalic_k cluster centroids, which assumes that instances in each cluster form a spherical structure around its centroid. DBSCAN [19] is based on the assumption that a cluster in data space is a contiguous region of high point density, separated from other such clusters by regions of low point density. Spectral clustering [4] assumes data lies on a locally linear manifold so that the local neighborhoods’ relation should be preserved in latent space. Those methods partition instances according to the graph Laplacian. Agglomerative clustering [25] considers the hierarchical structure of data and performs clustering with merging and splitting. Motivated by the success of classic clustering methods, the early exploration of deep clustering mainly focuses on adapting mature structure priors as objective functions to optimize neural networks.

Given well-structured data in the latent space, ABDC [95] iteratively optimizes the data representation and clustering centers motivated by K-means. As the deep extension of classic spectral clustering, DEN [33], SpectralNet [91], and MvLNet [34, 35] compute the graph Laplacian in the latent space learned by auto-encoder [5] and SiameseNets [28, 90], respectively. Likewise, DCC [89] extends the core idea of RCC [88] by performing a relation matching based on the similarity between latent features. The auto-encoder is then optimized by minimizing the distance of paired instances in the latent space. PARTY [79] is the first deep subspace clustering method, which introduces the sparsity prior and self-representation property in subspace learning to optimize neural networks. Motivated by the hierarchical structure of clusters, JULE [110] achieves agglomerative deep clustering by progressively merging clusters and optimizing the features.

3.2 Distribution Prior

Distribution prior refers to instances of different semantics following distinct data distributions. Such a prior arouses the generative deep clustering paradigm, which employs variational autoencoder [43] (VAE) and generative adversarial network [24] (GAN) to learn the underlying distribution. Instances generated from similar distributions are then grouped together to achieve clustering.

VaDE [40] is the first deep generative clustering method, which computes different data distributions by fitting the Gaussian mixture model (GMM) in the latent space. To generate an instance, VaDE first samples a cluster distribution p(c)𝑝𝑐p\left(c\right)italic_p ( italic_c ) to generate a latent vector p(zc)𝑝conditional𝑧𝑐p\left(z\mid c\right)italic_p ( italic_z ∣ italic_c ), and then reconstructs the instance in the input space p(xz)𝑝conditional𝑥𝑧p\left(x\mid z\right)italic_p ( italic_x ∣ italic_z ). The cluster assignment and neural network are jointly optimized by maximizing the log-likelihood of instance, i.e.,

logp(x)=logzcp(xz)p(zc)p(c)dz.𝑝xsubscriptzsubscript𝑐𝑝conditionalxz𝑝conditionalz𝑐𝑝𝑐dz\log p(\mathrm{x})=\log\int_{\mathrm{z}}\sum_{c}p(\mathrm{x}\mid\mathrm{z})p(% \mathrm{z}\mid c)p(c)\mathrm{dz}.roman_log italic_p ( roman_x ) = roman_log ∫ start_POSTSUBSCRIPT roman_z end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_p ( roman_x ∣ roman_z ) italic_p ( roman_z ∣ italic_c ) italic_p ( italic_c ) roman_dz . (2)

Since directly computing Eq. 2 is intractable, the optimization is approximated by the evidence lower bound (ELBO) of variational inference objective, namely,

=𝔼q(z,cx)[logp(x,z,c)q(z,cx)],subscript𝔼𝑞zconditional𝑐xdelimited-[]𝑝xz𝑐𝑞zconditional𝑐x\mathcal{L}=\mathbb{E}_{q(\mathrm{z},c\mid\mathrm{x})}\left[\log\frac{p(% \mathrm{x},\mathrm{z},c)}{q(\mathrm{z},c\mid\mathrm{x})}\right],caligraphic_L = blackboard_E start_POSTSUBSCRIPT italic_q ( roman_z , italic_c ∣ roman_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( roman_x , roman_z , italic_c ) end_ARG start_ARG italic_q ( roman_z , italic_c ∣ roman_x ) end_ARG ] , (3)

where q(z,cx)𝑞zconditional𝑐xq(\mathrm{z},c\mid\mathrm{x})italic_q ( roman_z , italic_c ∣ roman_x ) is variational posterior, which approximates the real posterior. The reparameterization trick introduced in VAE [43] is adopted to make the sampling process differentiable.

Refer to caption
Figure 2: The framework of distribution prior based methods. In addition to the standard continuous latent variable 𝐳nsubscript𝐳𝑛\mathbf{z}_{n}bold_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, generative deep clustering methods further introduce a discrete variable 𝐳csubscript𝐳𝑐\mathbf{z}_{c}bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to capture the cluster information.

Though GMM could effectively distinguish distributions, Gaussian components are proved to be redundant, which harms the discriminability between different clusters [27]. As an improvement, ClusterGAN, DCGAN [69, 82] proposes to adopt GAN to implicitly learn the latent distributions. Specifically, in addition to the continuous latent variable 𝐳nsubscript𝐳𝑛\mathbf{z}_{n}bold_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, it introduces a one-hot encoding 𝐳csubscript𝐳𝑐\mathbf{z}_{c}bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to capture cluster distribution during the generation. The objective function of ClusterGAN is formulated as follows:

=absent\displaystyle\mathcal{L}=caligraphic_L = 𝔼𝐱pX(𝐱)q(𝒟(𝐱))+𝔼𝐳𝐳q(1𝒟(𝒢(𝐳)))similar-to𝐱subscript𝑝𝑋𝐱𝔼𝑞𝒟𝐱similar-to𝐳subscript𝐳𝔼𝑞1𝒟𝒢𝐳\displaystyle\underset{\mathbf{x}\sim p_{X}(\mathbf{x})}{\mathbb{E}}q(\mathcal% {D}(\mathbf{x}))+\underset{\mathbf{z}\sim\mathbb{P}_{\mathbf{z}}}{\mathbb{E}}q% (1-\mathcal{D}(\mathcal{G}(\mathbf{z})))start_UNDERACCENT bold_x ∼ italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( bold_x ) end_UNDERACCENT start_ARG blackboard_E end_ARG italic_q ( caligraphic_D ( bold_x ) ) + start_UNDERACCENT bold_z ∼ blackboard_P start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG italic_q ( 1 - caligraphic_D ( caligraphic_G ( bold_z ) ) ) (4)
+βn𝔼p𝒵(𝐳)𝐳n(𝒢(𝐳n))22subscript𝛽𝑛subscript𝑝𝒵𝐳𝔼superscriptsubscriptnormsubscript𝐳𝑛𝒢subscript𝐳𝑛22\displaystyle\quad+\beta_{n}\underset{p_{\mathcal{Z}}(\mathbf{z})}{\mathbb{E}}% \left\|\mathbf{z}_{n}-\mathcal{E}\left(\mathcal{G}\left(\mathbf{z}_{n}\right)% \right)\right\|_{2}^{2}+ italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_UNDERACCENT italic_p start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( bold_z ) end_UNDERACCENT start_ARG blackboard_E end_ARG ∥ bold_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - caligraphic_E ( caligraphic_G ( bold_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+βc𝔼p𝒵(𝐳)(𝐳c,(𝒢(𝐳c))),subscript𝛽𝑐subscript𝑝𝒵𝐳𝔼subscript𝐳𝑐𝒢subscript𝐳𝑐\displaystyle\quad+\beta_{c}\underset{p_{\mathcal{Z}}(\mathbf{z})}{\mathbb{E}}% \mathcal{H}\left(\mathbf{z}_{c},\mathcal{E}\left(\mathcal{G}\left(\mathbf{z}_{% c}\right)\right)\right),+ italic_β start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_UNDERACCENT italic_p start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( bold_z ) end_UNDERACCENT start_ARG blackboard_E end_ARG caligraphic_H ( bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , caligraphic_E ( caligraphic_G ( bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ) ,

where 𝐳=(𝐳n,𝐳c)𝐳subscript𝐳𝑛subscript𝐳𝑐\mathbf{z}=(\mathbf{z}_{n},\mathbf{z}_{c})bold_z = ( bold_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) is the mixed latent variable, \mathcal{E}caligraphic_E is the inverse network which maps data from the raw to latent space, (,)\mathcal{H}\left(\cdot,\cdot\right)caligraphic_H ( ⋅ , ⋅ ) denotes the cross-entropy, and βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, βcsubscript𝛽𝑐\beta_{c}italic_β start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are the weight parameters. The first two terms are consistent with standard GAN. The last two clustering-specific terms encourage a more distinct cluster distribution, as well as map inputs to the latent space to achieve clustering.

Refer to caption
Figure 3: The framework of augmentation invariance based methods. Diverse transformations are first applied to augment the input data x𝑥xitalic_x, after which the shared deep neural network is utilized to extract features. The augmented samples of the same instance are encouraged to have similar features and cluster assignments.

3.3 Augmentation Invariance

In recent years, image augmentation methods [93] have gained widespread attention, grounded in the prior that augmentations of the same instance could preserve consistent semantic information. This augmentation-invariance character inspires exploration of how to leverage the positive pairs (i.e., different augmentations of the same image) with similar semantic information. Notably, mutual-information-based methods and contrastive-learning-based methods have emerged as pioneers in the realm of deep clustering. In this section, we delve into the fundamental concepts and related works of both mutual-information-based and contrastive-learning-based methods.

Firstly, mutual information is a measure of dependence between two continuous random variables X𝑋Xitalic_X and Y𝑌Yitalic_Y, formally,

I(X;Y)=YXp(x,y)log(p(x,y)p(x)p(y))𝑑x𝑑y,𝐼𝑋𝑌subscript𝑌subscript𝑋𝑝𝑥𝑦𝑝𝑥𝑦𝑝𝑥𝑝𝑦differential-d𝑥differential-d𝑦I(X;Y)=\int_{Y}\int_{X}p(x,y)\log\left(\frac{p(x,y)}{p(x)p(y)}\right)dxdy,italic_I ( italic_X ; italic_Y ) = ∫ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) roman_log ( divide start_ARG italic_p ( italic_x , italic_y ) end_ARG start_ARG italic_p ( italic_x ) italic_p ( italic_y ) end_ARG ) italic_d italic_x italic_d italic_y , (5)

where p(x,y)𝑝𝑥𝑦p(x,y)italic_p ( italic_x , italic_y ) is the joint probability mass function of X𝑋Xitalic_X and Y𝑌Yitalic_Y, p(x)𝑝𝑥p(x)italic_p ( italic_x ) and p(y)𝑝𝑦p(y)italic_p ( italic_y ) are the marginal probability mass functions of X𝑋Xitalic_X and Y𝑌Yitalic_Y respectively. In the context of information theory, leveraging the mutual information between variables of positive instances could enhance the optimization of clustering-related information.

IMSAT [31] stands as a typical information-theoretic approach to deep clustering. Its fundamental concept includes enforcing invariance on pair-wise augmented instances and achieving unambiguous and uniform cluster assignments. Specifically, IMSAT encourages the representations of augmented instances to closely match the representations of the original instances, i.e.,

=i,k𝐩iklog𝐩iksubscript𝑖𝑘subscript𝐩𝑖𝑘subscriptsuperscript𝐩𝑖𝑘\displaystyle\mathcal{L}=-\sum_{i,k}\mathbf{p}_{ik}\log\mathbf{p}^{\prime}_{ik}caligraphic_L = - ∑ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT roman_log bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT (6)

where 𝐩superscript𝐩\mathbf{p}^{\prime}bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the prediction representations of augmented instances. This aspect can be viewed as exploring the maximization of mutual information between data and its augmentations. Besides, IMSAT implements regularized information maximization for deep clustering inspired by RIM [44] to keep the cluster assignments unambiguous and uniform. Specifically, IMSAT seeks to maximize the mutual information between instances and their cluster assignments, expressed as:

I(X;Y)𝐼𝑋𝑌\displaystyle I(X;Y)italic_I ( italic_X ; italic_Y ) =H(Y)H(YX)absent𝐻𝑌𝐻conditional𝑌𝑋\displaystyle=H(Y)-H(Y\mid X)= italic_H ( italic_Y ) - italic_H ( italic_Y ∣ italic_X ) (7)
=k𝐩klog𝐩k+1Ni,k𝐩iklog𝐩ik,absentsubscript𝑘subscript𝐩absent𝑘subscript𝐩absent𝑘1𝑁subscript𝑖𝑘subscript𝐩𝑖𝑘subscript𝐩𝑖𝑘\displaystyle=-\sum_{k}\mathbf{p}_{\cdot k}\log\mathbf{p}_{\cdot k}+\frac{1}{N% }\sum_{i,k}\mathbf{p}_{ik}\log\mathbf{p}_{ik},= - ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT ⋅ italic_k end_POSTSUBSCRIPT roman_log bold_p start_POSTSUBSCRIPT ⋅ italic_k end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT roman_log bold_p start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ,

where H()𝐻H(\cdot)italic_H ( ⋅ ) and H(|)H(\cdot|\cdot)italic_H ( ⋅ | ⋅ ) the entropy and conditional entropy, and 𝐩k=1Ni𝐩iksubscript𝐩absent𝑘1𝑁subscript𝑖subscript𝐩𝑖𝑘\mathbf{p}_{\cdot k}=\frac{1}{N}\sum_{i}\mathbf{p}_{ik}bold_p start_POSTSUBSCRIPT ⋅ italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT. Increasing the first term (marginal entropy H(Y)𝐻𝑌H(Y)italic_H ( italic_Y )) encourages uniform cluster assignments, i.e., the number of instances in each cluster tends to be the same. Conversely, decreasing the second term (conditional entropy H(YX)𝐻conditional𝑌𝑋H(Y\mid X)italic_H ( italic_Y ∣ italic_X )) encourages each instance to be unambiguously assigned to a certain cluster.

IIC [39] and Completer [57, 58] take a further step in exploring the mutual information between instances and their augmentations. The fundamental concept is to maximize the mutual information between the cluster assignments of pair-wise augmented instances. Specifically, IIC achieves semantically meaningful clustering and avoids trivial solutions by maximizing the mutual information between the cluster assignments,

\displaystyle\mathcal{L}caligraphic_L =I(Z,Z)=iNI(𝐳i,𝐳i)=I(𝐏),absent𝐼𝑍superscript𝑍superscriptsubscript𝑖𝑁𝐼subscript𝐳𝑖superscriptsubscript𝐳𝑖𝐼𝐏\displaystyle=I\left(Z,Z^{\prime}\right)=\sum_{i}^{N}I\left(\mathbf{z}_{i},% \mathbf{z}_{i}^{\prime}\right)=I(\mathbf{P}),= italic_I ( italic_Z , italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_I ( bold_P ) , (8)
=c=1Cc=1C𝐏ccln𝐏cc𝐏c𝐏c,absentsuperscriptsubscript𝑐1𝐶superscriptsubscriptsuperscript𝑐1𝐶subscript𝐏𝑐superscript𝑐subscript𝐏𝑐superscript𝑐subscript𝐏𝑐subscript𝐏superscript𝑐\displaystyle=\sum_{c=1}^{C}\sum_{c^{\prime}=1}^{C}\mathbf{P}_{cc^{\prime}}% \cdot\ln\frac{\mathbf{P}_{cc^{\prime}}}{\mathbf{P}_{c}\cdot\mathbf{P}_{c^{% \prime}}},= ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_c italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ roman_ln divide start_ARG bold_P start_POSTSUBSCRIPT italic_c italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ bold_P start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ,

where 𝐳𝐳\mathbf{z}bold_z and 𝐳superscript𝐳\mathbf{z}^{\prime}bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the representations of the original instance x𝑥xitalic_x and its augmentation 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, respectively. The conditional joint distribution of 𝐳𝐳\mathbf{z}bold_z and 𝐳superscript𝐳\mathbf{z}^{\prime}bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is given by the matrix 𝐏C×C𝐏superscript𝐶𝐶\mathbf{P}\in\mathbb{R}^{C\times C}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT which is constituted by,

𝐏=1ni=1n𝐳i(𝐳i),𝐏1𝑛superscriptsubscript𝑖1𝑛subscript𝐳𝑖superscriptsuperscriptsubscript𝐳𝑖top\mathbf{P}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}\cdot\left(\mathbf{z}_{i}^{% \prime}\right)^{\top},bold_P = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (9)

where 𝐏cc=P(z=c,z=c)subscript𝐏𝑐superscript𝑐𝑃formulae-sequence𝑧𝑐superscript𝑧superscript𝑐\mathbf{P}_{cc^{\prime}}=P\left(z=c,z^{\prime}=c^{\prime}\right)bold_P start_POSTSUBSCRIPT italic_c italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_P ( italic_z = italic_c , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denotes the element of c𝑐citalic_c-th row and csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-th column. Additionally, the marginals 𝐏c=P(z=c)subscript𝐏𝑐𝑃𝑧𝑐\mathbf{P}_{c}=P(z=c)bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_P ( italic_z = italic_c ) and 𝐏c=P(z=c)subscript𝐏superscript𝑐𝑃superscript𝑧superscript𝑐\mathbf{P}_{c^{\prime}}=P\left(z^{\prime}=c^{\prime}\right)bold_P start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_P ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) can be obtained by summing over the rows and columns of this matrix. Notably, IIC stands out as one of the earliest deep frameworks designed entirely under the framework of information theory, distinguishing itself from IMSAT.

Similar to mutual-information-based methods, contrastive-learning-based methods treat instances augmented from the same instance as positive samples and the rest as negative samples. Let 𝐳2isubscript𝐳2𝑖\mathbf{z}_{2i}bold_z start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT and 𝐳2i1subscript𝐳2𝑖1\mathbf{z}_{2i-1}bold_z start_POSTSUBSCRIPT 2 italic_i - 1 end_POSTSUBSCRIPT represent two augmented representation of the i𝑖iitalic_i-th instance, the contrastive loss is formulated as:

\displaystyle\mathcal{L}caligraphic_L =iN((2i,2i1)+(2i1,2i)),absentsuperscriptsubscript𝑖𝑁2𝑖2𝑖12𝑖12𝑖\displaystyle=\sum_{i}^{N}\left(\ell\left(2i,2i-1\right)+\ell\left(2i-1,2i% \right)\right),= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_ℓ ( 2 italic_i , 2 italic_i - 1 ) + roman_ℓ ( 2 italic_i - 1 , 2 italic_i ) ) , (10)
(i,j)𝑖𝑗\displaystyle\ell\left(i,j\right)roman_ℓ ( italic_i , italic_j ) =logexp(s(𝐳i𝐳j)/τ)j=12N𝟏[ji]exp(s(𝐳i𝐳j/)τ),\displaystyle=-\log\frac{\exp\left(\operatorname{s}\left(\mathbf{z}_{i}\cdot% \mathbf{z}_{j}\right)/\tau\right)}{\sum_{j=1}^{2N}\mathbf{1}_{[j\neq i]}\exp% \left(\operatorname{s}\left(\mathbf{z}_{i}\cdot\mathbf{z}_{j}/\right)\tau% \right)},= - roman_log divide start_ARG roman_exp ( roman_s ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT roman_exp ( roman_s ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / ) italic_τ ) end_ARG ,

where (i,j)𝑖𝑗\ell\left(i,j\right)roman_ℓ ( italic_i , italic_j ) represents the pairwise contrastive loss and τ𝜏\tauitalic_τ controls the temperature of the softmax. The function s(𝐳i,𝐳j)ssubscript𝐳𝑖subscript𝐳𝑗\operatorname{s}\left(\mathbf{z}_{i},\mathbf{z}_{j}\right)roman_s ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the similarity between representations 𝐳isubscript𝐳𝑖\mathbf{z}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐳jsubscript𝐳𝑗\mathbf{z}_{j}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. This loss encourages representations of positive instances to be closer while being separated from negative instances, encouraging meaningful clustering patterns.

Notably, some theoretical works [78, 68, 59] have demonstrated that contrastive learning is equivalent to maximizing the mutual information from the instance level. Motivated by this observation, researchers have further explored the application of contrastive loss at the cluster level, proving beneficial for deep clustering. PICA [32] is one of the pioneer works of this domain. The fundamental concept behind it is to maximize the similarity between the cluster assignment of original and augmented data. This objective can be likened to conducting contrastive learning [60] at the cluster level. Motivated by PICA, CC [52] and DRC [123] conduct contrastive learning on both instance level and cluster level. Specifically, cluster-level contrastive loss helps learn discriminative cluster assignment, which is the key to the clustering task. Formally, the cluster-level contrastive loss is,

\displaystyle\mathcal{L}caligraphic_L =12Ci=1C((2i1,2i)+(2i,2i1))H(𝐘),absent12𝐶superscriptsubscript𝑖1𝐶2𝑖12𝑖2𝑖2𝑖1𝐻𝐘\displaystyle=\frac{1}{2C}\sum_{i=1}^{C}\left(\ell\left(2i-1,2i\right)\!+\!% \ell\left(2i,2i-1\right)\right)\!-\!H(\mathbf{Y}),= divide start_ARG 1 end_ARG start_ARG 2 italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ( roman_ℓ ( 2 italic_i - 1 , 2 italic_i ) + roman_ℓ ( 2 italic_i , 2 italic_i - 1 ) ) - italic_H ( bold_Y ) , (11)
(i,j)𝑖𝑗\displaystyle\ell\left(i,j\right)roman_ℓ ( italic_i , italic_j ) =logexp(s(𝐲i,𝐲i)/τ)j=12C𝟏[ji][exp(s(𝐲i,𝐲j)/τ)],absent𝑠subscript𝐲𝑖subscript𝐲𝑖𝜏superscriptsubscript𝑗12𝐶subscript1delimited-[]𝑗𝑖delimited-[]𝑠subscript𝐲𝑖subscript𝐲𝑗𝜏\displaystyle=-\log\frac{\exp\left(s\left(\mathbf{y}_{i},\mathbf{y}_{i}\right)% /\tau\right)}{\sum_{j=1}^{2C}\mathbf{1}_{[j\neq i]}\left[\exp\left(s\left(% \mathbf{y}_{i},\mathbf{y}_{j}\right)/\tau\right)\right]},= - roman_log divide start_ARG roman_exp ( italic_s ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_C end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT [ roman_exp ( italic_s ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) ] end_ARG ,

where 𝐲i1×Nsubscript𝐲𝑖superscript1𝑁\mathbf{y}_{i}\in\mathbb{R}^{1\times N}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_N end_POSTSUPERSCRIPT is the cluster-level assignment and τ𝜏\tauitalic_τ is the cluster-level temperature parameter. H(𝐘)=H(𝐘1)+H(𝐘2)𝐻𝐘𝐻superscript𝐘1𝐻superscript𝐘2H(\mathbf{Y})=H(\mathbf{Y}^{1})+H(\mathbf{Y}^{2})italic_H ( bold_Y ) = italic_H ( bold_Y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) + italic_H ( bold_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is the cluster assignment probabilities entropy of two augmentations. The inclusion of H(𝐘)𝐻𝐘H(\mathbf{Y})italic_H ( bold_Y ) helps avoid the trivial solution where most instances are assigned to the same cluster. Notably, the utilization of contrastive learning at the cluster level in CC and DRC has inspired subsequent works in the field.

TCC [92] takes a step further in exploring the interaction between instance-level and cluster-level representations. The core idea is to leverage a unified representation combined of the cluster semantics and instances, enhancing the representation with cluster information to facilitate clustering tasks. Formally, for an instance representation 𝐳isubscript𝐳𝑖\mathbf{z}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the enhanced representation is given by:

𝐳^i=(𝐳i+NNθ(𝐜i))/𝐳i+NNθ(𝐜i)2,subscript^𝐳𝑖subscript𝐳𝑖subscriptNN𝜃subscript𝐜𝑖subscriptnormsubscript𝐳𝑖subscriptNN𝜃subscript𝐜𝑖2\hat{\mathbf{z}}_{i}=\left(\mathbf{z}_{i}+\operatorname{NN}_{\theta}\left(% \mathbf{c}_{i}\right)\right)/\|\mathbf{z}_{i}+\operatorname{NN}_{\theta}\left(% \mathbf{c}_{i}\right)\|_{2},over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_NN start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) / ∥ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_NN start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (12)

where 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the cluster assignment of i𝑖iitalic_i-th instance after Gumbel Softmax. NNθ()subscriptNN𝜃\operatorname{NN}_{\theta}\left(\cdot\right)roman_NN start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) denotes a single fully connected network, which is the learnable cluster representation. Different from CC which performs contrastive loss on cluster assignment, TCC conducts contrastive loss on the unified representation to better capture cluster semantics. Inspired by TCC, some works [108, 50] explore the fusion of instance-level and cluster-level representation in various domains. and then conduct contrastive loss on the unified representation, which further explores its effectiveness.

Refer to caption
Figure 4: The framework of neighborhood consistency-based methods. Such a paradigm encourages neighboring samples zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and zpsubscript𝑧𝑝z_{p}italic_z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT in the latent space to have consistent features and cluster assignments, which improves the compactness of clusters.

3.4 Neighborhood Consistency

Thanks to the advancements in self-supervised representation learning, the features acquired through discriminative pretext tasks can unveil high-level semantics in the latent space. This provides a crucial prior for clustering, as instances and their neighborhoods in the latent space are likely to belong to the same semantic cluster. Leveraging neighborhood-consistent semantics can further enhance clustering.

SCAN [97] first observes that similar instances will be mapped closely in latent space through self-supervised pretext tasks. Motivated by this observation, SCAN trains a cluster head based on the cluster neighborhood consistency within neighbors. Specifically, SCAN first obtains an encoder f𝑓fitalic_f by a pretext task [23, 119, 104, 30]. It then optimizes a cluster head hhitalic_h by requiring it to make consistent predictions between instances and their nearest neighbors:

=1Bi=1j𝒩iklog𝐩i,𝐩jλH(Y).1𝐵subscript𝑖1subscript𝑗subscriptsuperscript𝒩𝑘𝑖subscript𝐩𝑖subscript𝐩𝑗𝜆𝐻𝑌\mathcal{L}=-\frac{1}{B}\sum_{i=1}\sum_{j\in\mathcal{N}^{k}_{i}}\log\langle% \mathbf{p}_{i},\mathbf{p}_{j}\rangle-\lambda H(Y).caligraphic_L = - divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ⟨ bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ - italic_λ italic_H ( italic_Y ) . (13)

Here 𝒩iksubscriptsuperscript𝒩𝑘𝑖\mathcal{N}^{k}_{i}caligraphic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the k𝑘kitalic_k-nearest neighbors of the i𝑖iitalic_i-th instance. The second term in Eq. 13 prevents hhitalic_h from assigning all instances to a single cluster which is also used in Eq. 11.

NNM [15] and GCC [124] incorporate neighbor information into the framework of contrastive learning to group instances within neighborhoods. In particular, NNM aligns the clustering assignment of an instance with its neighbors through cluster-level contrastive learning:

=1Ci=1Clogexp(s(𝐪i,𝐪𝒩i))j=1Cexp(s(𝐪i,𝐪j)),1𝐶superscriptsubscript𝑖1𝐶ssubscript𝐪𝑖subscript𝐪𝒩𝑖superscriptsubscript𝑗1𝐶ssubscript𝐪𝑖subscript𝐪𝑗\mathcal{L}=-\frac{1}{C}\sum_{i=1}^{C}\log\frac{\exp(\operatorname{s}(\mathbf{% q}_{i},\mathbf{q}_{\mathcal{N}i}))}{\sum_{j=1}^{C}\exp(\operatorname{s}(% \mathbf{q}_{i},\mathbf{q}_{j}))},caligraphic_L = - divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_log divide start_ARG roman_exp ( roman_s ( bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_q start_POSTSUBSCRIPT caligraphic_N italic_i end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_exp ( roman_s ( bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_ARG , (14)

where 𝐪,𝐪𝒩C×B𝐪subscript𝐪𝒩superscript𝐶𝐵\mathbf{q},~{}\mathbf{q}_{\mathcal{N}}\in\mathbb{R}^{C\times B}bold_q , bold_q start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_B end_POSTSUPERSCRIPT represent the transpose matrix of 𝐩𝐩\mathbf{p}bold_p and 𝐩𝒩subscript𝐩𝒩\mathbf{p}_{\mathcal{N}}bold_p start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT, respectively. In contrast, GCC introduces the graph structure of the latent space to modify the vanilla instance-level contrastive loss. It constructs a normalized symmetric graph Laplacian 𝐋𝐋\mathbf{L}bold_L based on the K𝐾Kitalic_K-nn graph:

𝐋𝐋\displaystyle\mathbf{L}bold_L =𝐈𝐃12𝐀𝐃12,absent𝐈superscript𝐃12superscript𝐀𝐃12\displaystyle=\mathbf{I}-\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac% {1}{2}},= bold_I - bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_AD start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , (15)
with 𝐀ijwith subscript𝐀𝑖𝑗\displaystyle\text{with }\mathbf{A}_{ij}with bold_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ={1,if j𝒩ik or i𝒩jk0,otherwise.absentcases1if 𝑗subscriptsuperscript𝒩𝑘𝑖 or 𝑖subscriptsuperscript𝒩𝑘𝑗0otherwise\displaystyle=\begin{cases}1,&\text{if }j\in\mathcal{N}^{k}_{i}\text{ or }i\in% \mathcal{N}^{k}_{j}\\ 0,&\text{otherwise}\end{cases}.= { start_ROW start_CELL 1 , end_CELL start_CELL if italic_j ∈ caligraphic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW .

Then, the loss function is given by the following form:

=1Ni=1Nlog𝐋ij<0𝐋ijexp(s(𝐳i,𝐳j)/τ)𝐋ij=0exp(s(𝐳i,𝐳j)/τ),1𝑁superscriptsubscript𝑖1𝑁subscriptsubscript𝐋𝑖𝑗0subscript𝐋𝑖𝑗ssubscript𝐳𝑖subscript𝐳𝑗𝜏subscriptsubscript𝐋𝑖𝑗0ssubscript𝐳𝑖subscript𝐳𝑗𝜏\mathcal{L}=-\frac{1}{N}\sum_{i=1}^{N}\log\frac{\sum_{\mathbf{L}_{ij}<0}-% \mathbf{L}_{ij}\exp(\operatorname{s}(\mathbf{z}_{i},\mathbf{z}_{j})/\tau)}{% \sum_{\mathbf{L}_{ij}=0}\exp(\operatorname{s}(\mathbf{z}_{i},\mathbf{z}_{j})/% \tau)},caligraphic_L = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log divide start_ARG ∑ start_POSTSUBSCRIPT bold_L start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT < 0 end_POSTSUBSCRIPT - bold_L start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_exp ( roman_s ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT bold_L start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT roman_exp ( roman_s ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG , (16)

where τ𝜏\tauitalic_τ is the temperature. The Graph Laplacian guides the model to attract instances within neighborhoods rather than just augmentation of themselves so that the influence of potential false negative samples [114, 112] can be mitigated. As a result, GCC can better minimize the intra-cluster variance and maximize the inter-cluster variance. The success of this approach has inspired numerous contrastive learning methods [38, 63] in various domains to leverage neighbor relationships that effectively address the false negative challenge.

Refer to caption
Figure 5: The framework of pseudo-labeling based methods. Given features in the latent space, clustering algorithms such as K-means are performed to get pseudo labels. The pseudo labels, usually filtered by confidence, are then used as supervision signals to guide clustering.

3.5 Pseudo-Labeling

As a prevalent paradigm of semi-supervised classification [48, 6, 94], pseudo-labeling has been extended to deep clustering in recent years. The fundamental assumption of pseudo-labeling is that the predictions on unlabeled data, especially the confident ones, can provide reliable supervision and guide model training. Motivated by this, recent deep clustering works leverage confident predictions to boost clustering performance.

DEC [106] is a pioneering work that utilizes labels generated by itself to simultaneously enhance feature representations and optimize clustering assignments. DEC initializes with a pre-trained auto-encoder and C𝐶Citalic_C learnable cluster centroids. The soft assignment is calculated using the Student’s t𝑡titalic_t-distribution, based on the distance between the representation 𝐳isubscript𝐳𝑖\mathbf{z}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and centroid 𝐜jsubscript𝐜𝑗\mathbf{c}_{j}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT:

𝐪ij=(1+𝐳i𝐜j2/α)α+12k(1+𝐳i𝐜k2/α)α+12,subscript𝐪𝑖𝑗superscript1superscriptnormsubscript𝐳𝑖subscript𝐜𝑗2𝛼𝛼12subscript𝑘superscript1superscriptnormsubscript𝐳𝑖subscript𝐜𝑘2𝛼𝛼12\mathbf{q}_{ij}=\frac{(1+\|\mathbf{z}_{i}-\mathbf{c}_{j}\|^{2}/\alpha)^{-\frac% {\alpha+1}{2}}}{\sum_{k}(1+\|\mathbf{z}_{i}-\mathbf{c}_{k}\|^{2}/\alpha)^{-% \frac{\alpha+1}{2}}},bold_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG ( 1 + ∥ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_α ) start_POSTSUPERSCRIPT - divide start_ARG italic_α + 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 + ∥ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_α ) start_POSTSUPERSCRIPT - divide start_ARG italic_α + 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG , (17)

where α𝛼\alphaitalic_α is the hyper-parameter and 𝐪ijsubscript𝐪𝑖𝑗\mathbf{q}_{ij}bold_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the probability of assigning the instances i𝑖iitalic_i to the cluster j𝑗jitalic_j. DEC refines the clusters by emphasizing the high-confidence assignments and making predictions more confident. Specifically, DEC uses the second power of 𝐪isubscript𝐪𝑖\mathbf{q}_{i}bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a sharpened assignment to guide the training, i.e.,

𝐩ij=𝐪ij2/freqjk𝐪ik2/freqk,subscript𝐩𝑖𝑗superscriptsubscript𝐪𝑖𝑗2subscriptfreq𝑗subscript𝑘superscriptsubscript𝐪𝑖𝑘2subscriptfreq𝑘\mathbf{p}_{ij}=\frac{\mathbf{q}_{ij}^{2}/\operatorname{freq}_{j}}{\sum_{k}% \mathbf{q}_{ik}^{2}/\operatorname{freq}_{k}},bold_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG bold_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / roman_freq start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_q start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / roman_freq start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG , (18)

where freqj=i𝐪ijsubscriptfreq𝑗subscript𝑖subscript𝐪𝑖𝑗\operatorname{freq}_{j}=\sum_{i}\mathbf{q}_{ij}roman_freq start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the soft cluster frequency and the sharpened assignment is normalized by fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to prevent feature collapse. Finally, a KL divergence loss between 𝐩𝐩\mathbf{p}bold_p and 𝐪𝐪\mathbf{q}bold_q minimizes the distances between the two distributions, i.e., =KL(𝐩|𝐪)KLconditional𝐩𝐪\mathcal{L}=\operatorname{KL}(\mathbf{p}|\mathbf{q})caligraphic_L = roman_KL ( bold_p | bold_q ).

Another notable method of pseudo-labeling is DeepCluster [8]. This approach employs K-means clustering on the learned representations to obtain cluster assignments as pseudo-labels. DeepCluster iteratively performs representation learning and clustering in a mutually beneficial manner to bootstrap each other. However, DeepCluster faces limitations in achieving outstanding performance, primarily due to the restricted semantics of the initial representation. Similar to DeepCluster, ProPos [36] proposes an EM framework of pseudo-labeling, iteratively performing K-means to obtain pseudo labels (E step) and the representation updating (M step). Notably, ProPos significantly outperforms DeepCluster and other methods because ProPos performs K-means on the learned feature of state-of-the-art self-supervised paradigm BYOL [26]. This observation has demonstrated that the semantics of the representation is vital to pseudo-label generation and clustering. Low-quality features would introduce potential noise in pseudo-labels, impact subsequent pseudo-label generation, and mislead representation learning, which accumulates the error in the process.

In addition to the progression of self-supervised paradigms, researchers are actively investigating strategies to alleviate the issue of error accumulation in pseudo-labeling. To be specific, the challenges in the realm of pseudo-labeling deep clustering remain two-fold: enhancing the accuracy of generating pseudo-labels and maximizing the utility of these pseudo-labels for effective clustering. On the one hand, inaccurate pseudo-labels pose a risk of degradation in clustering performance. On the other hand, determining how to effectively leverage these pseudo-labels for clustering is a critical consideration. These two challenges underscore the ongoing efforts in the pseudo-labeling learning of deep clustering.

The first challenge has been addressed by many works through carefully designing selection methods. For instance, SCAN [97] empirically observed that instances exhibiting highly confident predictions (i.e., max(𝐩i)1subscript𝐩𝑖1\max(\mathbf{p}_{i})\approx 1roman_max ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≈ 1) tend to be correctly clustered by the cluster head. Building on this insight, SCAN opts to choose instances with the most confident predictions as labeled data to fine-tune the model using the cross-entropy loss,

=1|Y|iYy~ilog(𝐩i),1𝑌subscript𝑖𝑌subscript~𝑦𝑖subscript𝐩𝑖\displaystyle\mathcal{L}=\frac{1}{|Y|}\sum_{i\in Y}-\tilde{y}_{i}\log(\mathbf{% p}_{i}),caligraphic_L = divide start_ARG 1 end_ARG start_ARG | italic_Y | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_Y end_POSTSUBSCRIPT - over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (19)
Y={iconfiη},with confi=max(𝐩i)formulae-sequence𝑌conditional-set𝑖subscriptconf𝑖𝜂subscriptwith conf𝑖subscript𝐩𝑖\displaystyle Y=\left\{i\mid\text{conf}_{i}\geq\eta\right\},\text{with conf}_{% i}=\max(\mathbf{p}_{i})italic_Y = { italic_i ∣ conf start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_η } , with conf start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where η𝜂\etaitalic_η is the threshold hyper-parameter to filter the uncertain instances. TCL [53] and SPICE [77] have devised more effective selection strategies to enhance the accuracy of pseudo-labeling. Specifically, TCL selects the most confident predictions as pseudo labels from each cluster c𝑐citalic_c:

Yc={topK(confi)y~i=c}superscript𝑌𝑐conditional-settopKsubscriptconf𝑖subscript~𝑦𝑖𝑐\displaystyle Y^{c}=\left\{\operatorname{topK}(\text{conf}_{i})\mid\tilde{y}_{% i}=c\right\}italic_Y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = { roman_topK ( conf start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c } (20)
Y=c=1CYc𝑌superscriptsubscript𝑐1𝐶superscript𝑌𝑐\displaystyle Y=\bigcup_{c=1}^{C}Y^{c}italic_Y = ⋃ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_Y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT

where topK()topK\operatorname{topK}(\cdot)roman_topK ( ⋅ ) returns the indices of the top K𝐾Kitalic_K confident instances and \bigcup denotes the union of the pseudo labels from all clusters. Here K=γN/C𝐾𝛾𝑁𝐶K=\gamma N/Citalic_K = italic_γ italic_N / italic_C and γ𝛾\gammaitalic_γ is the selection ratio. The cluster-wise selection leads to more class-balanced pseudo labels compared to threshold-based criteria. It improves the clustering performance, especially for challenging classes.

SPICE introduces a prototype-based pseudo-labeling approach. Specifically, it first re-computes the centroids of each cluster only using the instances with confident predictions, then re-assign each instance with new pseudo labels according to the similarity to the new centroids, formally:

𝐜i=1|Yc|iYc𝐳i,superscriptsubscript𝐜𝑖1superscript𝑌𝑐subscript𝑖superscript𝑌𝑐subscript𝐳𝑖\displaystyle\mathbf{c}_{i}^{\prime}=\frac{1}{|Y^{c}|}\sum_{i\in Y^{c}}\mathbf% {z}_{i},bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_Y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_Y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (21)
y~i=argmaxjs(𝐳i,𝐜j).superscriptsubscript~𝑦𝑖subscript𝑗ssubscript𝐳𝑖subscriptsuperscript𝐜𝑗\displaystyle\tilde{y}_{i}^{\prime}=\arg\max_{j}\operatorname{s}(\mathbf{z}_{i% },\mathbf{c}^{\prime}_{j}).over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_s ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .

This operation helps mitigate the influence of potentially incorrect pseudo labels used in calculating centroids, which might accumulate errors in the iterative self-training process.

To address the second challenge, i.e., better utilizing the confident labels, TCL removes negative pairs with the same label in contrastive loss, preventing intra-class instances from pushing apart, i.e., the false negative issue. Meanwhile, SPICE and TCL adopt some semi-supervised classification techniques like FixMatch [94] that impose the pseudo-label consistency for strong augmentations of the same instance. The marvelous results achieved by these works show the effectiveness of combining reliable pseudo-labeling methods and semi-supervised paradigms in clustering.

Refer to caption
Figure 6: The framework of external knowledge based methods. Instead of mining internal priors from the samples themselves, such a paradigm seeks external information like textual semantics to help distinguish the given samples.

3.6 External Knowledge

Most clustering approaches focus on grou** data based on inherent characteristics such as structural priors, distribution priors, and augmentation invariance priors. Instead of pursuing internal priors from the data itself, some recent works [7, 54] attempt to introduce abundant external knowledge such as textual descriptions to guide clustering. These methods prove effective because the utilization of semantic information from natural language offers valuable supervisory signals that enhance the quality of clustering.

SIC [7] is one of the first works in incorporating external knowledge guidance into clustering. The fundamental concept revolves around generating image pseudo-labels from a textual space pre-trained by CLIP [83]. The process involves three main steps: i) Construction of Semantic Space: SIC selects meaningful texts resembling category names to build a semantic space. ii) Pseudo-labeling: Pseudo-labels are generated using text semantic centers 𝐡𝐡\mathbf{h}bold_h and image representations 𝐳isubscript𝐳𝑖\mathbf{z}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, formally,

𝐪i=one-hot(c,argmaxlexp(𝐳iT𝐡l)Σlcexp(𝐳iT𝐡l)),subscript𝐪𝑖one-hot𝑐subscript𝑙superscriptsubscript𝐳𝑖𝑇subscript𝐡𝑙superscriptsubscriptΣsuperscript𝑙𝑐superscriptsubscript𝐳𝑖𝑇subscript𝐡superscript𝑙\mathbf{q}_{i}=\text{one-hot}\left(c,\arg\max_{l}\frac{\exp\left(\mathbf{z}_{i% }^{T}\mathbf{h}_{l}\right)}{\Sigma_{l^{\prime}}^{c}\exp\left(\mathbf{z}_{i}^{T% }\mathbf{h}_{l^{\prime}}\right)}\right),bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = one-hot ( italic_c , roman_arg roman_max start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT divide start_ARG roman_exp ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Σ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_exp ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG ) , (22)

where c𝑐citalic_c is the number of semantic centers, 𝐡lsubscript𝐡𝑙\mathbf{h}_{l}bold_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the l𝑙litalic_l-th center of semantic centers, one-hot operator will generate a c𝑐citalic_c-bit one-hot vector. The pseudo-labels is utilized to guide the clustering similar to SCAN [97],

=1ni=1nCE(𝐪i,𝐩i),1𝑛superscriptsubscript𝑖1𝑛𝐶𝐸subscript𝐪𝑖subscript𝐩𝑖\mathcal{L}=\frac{1}{n}\sum_{i=1}^{n}CE\left(\mathbf{q}_{i},\mathbf{p}_{i}% \right),caligraphic_L = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_C italic_E ( bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (23)

where CE()𝐶𝐸CE\left(\cdot\right)italic_C italic_E ( ⋅ ) is the cross entropy function. iii) Consistency learning: Enhancing clustering effect by enforcing the consistency between the images and their neighbors in the image space,

=1ni=1nlog𝐩iT𝐩j,1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝐩𝑖𝑇subscript𝐩𝑗\mathcal{L}=-\frac{1}{n}\sum_{i=1}^{n}\log\mathbf{p}_{i}^{T}\mathbf{p}_{j},caligraphic_L = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (24)

where j𝑗jitalic_j is an instance index randomly selected from the nearest neighbors 𝒩k(𝐳i)subscript𝒩𝑘subscript𝐳𝑖\mathcal{N}_{k}\left(\mathbf{z}_{i}\right)caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of i𝑖iitalic_i-th instance. Note that, SIC essentially pulls image embeddings closer to embeddings in semantic space, while ignoring the improvement of text semantic embeddings.

Table 2: The summary of deep clustering methods from the perspective of prior knowledge.
Prior Knowledge Method Major Contribution
Structurture Prior Inherent data structure reflect semantic relation ABDC (2013) optimize features and clustering assignment in an EM manner
DEN (2014), SpectralNet (2018) extend spectral clustering from shallow to deep
PARTY (2016) introduce the sparsity prior from subspace learning to deep clustering
JULE (2016) extend agglomerative clustering from shallow to deep
DCC (2018) propose relation matching to achieve non-parametric deep clustering
Distribution Prior Instances of different semantics follow distinct data distribution VaDE (2016) learn distinct cluster distributions by Gaussian mixture model
ClusterGAN (2019) DCGAN (2015) implicitly learn cluster distribution with GAN
Augmentation Invariance Instance features are invariant to data augmentation IMSAT (2017) propose the invariance between pair-wise augmented samples
IIC (2019), Completer (2021) propose the mutual information framework with respect to augmentation invariance
Cluster assignments are invariant to data augmentation PICA (2020) explore invariance between cluster assignments of augmented samples
CC (2021), DRC (2020) simultaneously explore augmentation invariance at instance and cluster level
TCC (2021) leverage a unified representation combined of the cluster semantics and instances
Neighborhood Consistency Neighboring instances have similar semantics SCAN (2020) impose consistent cluster assignments between neighboring instances
NNM (2021) perform cluster-level contrastive learning between neighbors
GCC (2021) perform instance- and cluster-level contrastive learning between neighbors
Pseudo Label Cluster assignments with high confidence are reliable DEC (2016) construct target cluster distribution via sharpening
DeepCluster (2018) generate pseudo labels with K-means
SCAN (2020) select high-confident predictions and finetune the model with strong augmented samples
SPICE (2022) select pseudo labels with the help of prototypes and adopt semi-supervised learning to fine-tune the model
TCL (2022) use pseudo labels to mitigate false negative pairs in contrastive learning
ProPos (2022) use pseudo label from K-means to increase cluster compactness
External Knowledge Abundant clustering- favorable knowledge exists in open world SIC (2023) generate image pseudo labels from the textual space from pre-trained vision-language model
TAC (2023b) construct more discriminative text counterparts and perform cross-modal distillation to improve clustering

TAC [54] focuses on leveraging textual semantics to enhance the feature discriminability. Specifically, it retrieves a text counterpart among representative nouns for each image, which improves K-means performance without any additional training. Besides, TAC proposes a mutual distillation paradigm to incorporate the image and text modalities, which further improves the clustering performance. The cross-modal mutual distillation strategy is formulated as follows:

\displaystyle\mathcal{L}caligraphic_L =i=1Civt+itv,absentsuperscriptsubscript𝑖1𝐶superscriptsubscript𝑖𝑣𝑡superscriptsubscript𝑖𝑡𝑣\displaystyle=\sum_{i=1}^{C}\mathcal{L}_{i}^{v\rightarrow t}+\mathcal{L}_{i}^{% t\rightarrow v},= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v → italic_t end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t → italic_v end_POSTSUPERSCRIPT , (25)
Livtsuperscriptsubscript𝐿𝑖𝑣𝑡\displaystyle L_{i}^{v\rightarrow t}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v → italic_t end_POSTSUPERSCRIPT =logexp(sim(𝐪^i,𝐩^i𝒩)/τ)k=1Kexp(sim(𝐪^i,𝐩^k𝒩)/τ),absentsimsubscript^𝐪𝑖superscriptsubscript^𝐩𝑖𝒩𝜏superscriptsubscript𝑘1𝐾simsubscript^𝐪𝑖superscriptsubscript^𝐩𝑘𝒩𝜏\displaystyle=-\log\frac{\exp\left(\operatorname{sim}\left(\hat{\mathbf{q}}_{i% },\hat{\mathbf{p}}_{i}^{\mathcal{N}}\right)/\tau\right)}{\sum_{k=1}^{K}\exp% \left(\operatorname{sim}\left(\hat{\mathbf{q}}_{i},\hat{\mathbf{p}}_{k}^{% \mathcal{N}}\right)/\tau\right)},= - roman_log divide start_ARG roman_exp ( roman_sim ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( roman_sim ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG ,
Litvsuperscriptsubscript𝐿𝑖𝑡𝑣\displaystyle L_{i}^{t\rightarrow v}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t → italic_v end_POSTSUPERSCRIPT =logexp(sim(𝐩^i,𝐪^i𝒩)/τ^)k=1Kexp(sim(𝐩^i,𝐪^k𝒩)/τ),absentsimsubscript^𝐩𝑖superscriptsubscript^𝐪𝑖𝒩^𝜏superscriptsubscript𝑘1𝐾simsubscript^𝐩𝑖superscriptsubscript^𝐪𝑘𝒩𝜏\displaystyle=-\log\frac{\exp\left(\operatorname{sim}\left(\hat{\mathbf{p}}_{i% },\hat{\mathbf{q}}_{i}^{\mathcal{N}}\right)/\hat{\tau}\right)}{\sum_{k=1}^{K}% \exp\left(\operatorname{sim}\left(\hat{\mathbf{p}}_{i},\hat{\mathbf{q}}_{k}^{% \mathcal{N}}\right)/\tau\right)},= - roman_log divide start_ARG roman_exp ( roman_sim ( over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N end_POSTSUPERSCRIPT ) / over^ start_ARG italic_τ end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( roman_sim ( over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG ,

where τ𝜏\tauitalic_τ is the softmax temperature parameter, 𝐩^i,𝐪^i1×Nsubscript^𝐩𝑖subscript^𝐪𝑖superscript1𝑁\hat{\mathbf{p}}_{i},\hat{\mathbf{q}}_{i}\in\mathbb{R}^{1\times N}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_N end_POSTSUPERSCRIPT is the i𝑖iitalic_i-th column of image and text assignment matrix, 𝐩^i𝒩,𝐪^i𝒩1×Nsuperscriptsubscript^𝐩𝑖𝒩superscriptsubscript^𝐪𝑖𝒩superscript1𝑁\hat{\mathbf{p}}_{i}^{\mathcal{N}},\hat{\mathbf{q}}_{i}^{\mathcal{N}}\in% \mathbb{R}^{1\times N}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N end_POSTSUPERSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_N end_POSTSUPERSCRIPT is the i𝑖iitalic_i-th column of image and text random nearest neighbor matrix. The mutual distillation strategy has two advantages. On the one hand, it generates more discriminative clusters through cluster-level contrastive loss. On the other hand, it encourages consistent clustering assignments between each sample and its cross-modal neighbors, which bootstraps the clustering performance in both modalities.

4 Experiment

In this section, we introduce the evaluation of deep clustering. Briefly, we first present the evaluation metrics and common benchmarks. Then we analyze the results of the existing deep clustering methods.

4.1 Evaluation Metrics

For clustering evaluation, three metrics are commonly used to measure how the predicted cluster assignments y~~𝑦\tilde{y}over~ start_ARG italic_y end_ARG match the ground truth labels y𝑦yitalic_y, including accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI). A higher value of the metrics corresponds to better clustering performance. The definitions of the three metrics are as follows:

  • ACC [1] indicates the correct rate of clustering predictions:

    ACC=1Ni=1N𝟏{yi=y~i},ACC1𝑁superscriptsubscript𝑖1𝑁1subscript𝑦𝑖subscript~𝑦𝑖\operatorname{ACC}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{1}\{y_{i}=\tilde{y}_{i}\},roman_ACC = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_1 { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , (26)

    where the Hungarian matching [46] is first applied to align the predictions and labels.

  • NMI [65] quantifies the mutual information between the predicted labels 𝐘~~𝐘\tilde{\mathbf{Y}}over~ start_ARG bold_Y end_ARG and ground truth labels 𝐘𝐘\mathbf{Y}bold_Y:

    NMI=I(𝐘~;𝐘)12[H(𝐘~)+H(𝐘)],NMI𝐼~𝐘𝐘12delimited-[]𝐻~𝐘𝐻𝐘\operatorname{NMI}=\frac{I(\tilde{\mathbf{Y}};\mathbf{Y})}{\frac{1}{2}[H(% \tilde{\mathbf{Y}})+H(\mathbf{Y})]},roman_NMI = divide start_ARG italic_I ( over~ start_ARG bold_Y end_ARG ; bold_Y ) end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ italic_H ( over~ start_ARG bold_Y end_ARG ) + italic_H ( bold_Y ) ] end_ARG , (27)

    where H(𝐘)𝐻𝐘H(\mathbf{Y})italic_H ( bold_Y ) denotes the entropy of Y𝑌Yitalic_Y and I(𝐘~;𝐘)𝐼~𝐘𝐘I(\tilde{\mathbf{Y}};\mathbf{Y})italic_I ( over~ start_ARG bold_Y end_ARG ; bold_Y ) denotes the mutual information between 𝐘~~𝐘\tilde{\mathbf{Y}}over~ start_ARG bold_Y end_ARG and 𝐘𝐘\mathbf{Y}bold_Y.

  • ARI [37] is the normalization of the rand index (RI), which counts the number of instances pairs in the same cluster and different clusters:

    RI=TP+TNCN2,RITPTNsubscriptsuperscriptC2𝑁\operatorname{RI}=\frac{\operatorname{TP}+{\operatorname{TN}}}{\operatorname{C% }^{2}_{N}},roman_RI = divide start_ARG roman_TP + roman_TN end_ARG start_ARG roman_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG , (28)

    where TPTP\operatorname{TP}roman_TP and TNTN\operatorname{TN}roman_TN refer to the number of true positive pairs and true negative pairs, CN2subscriptsuperscriptC2𝑁\operatorname{C}^{2}_{N}roman_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the number of possible instance pairs. ARI is computed by adding the following normalization:

    ARI=RI𝔼(RI)max(RI)𝔼(RI),ARIRI𝔼RImaxRI𝔼RI\operatorname{ARI}=\frac{\operatorname{RI}-\mathbb{E}(\operatorname{RI})}{% \operatorname{max}(\operatorname{RI})-\mathbb{E}\left(\operatorname{RI}\right)},roman_ARI = divide start_ARG roman_RI - blackboard_E ( roman_RI ) end_ARG start_ARG roman_max ( roman_RI ) - blackboard_E ( roman_RI ) end_ARG , (29)

    where 𝔼(RI)𝔼RI\mathbb{E}(\operatorname{RI})blackboard_E ( roman_RI ) denotes the expectation of RI.

Table 3: A summary of datasets commonly used for deep clustering.
Dataset Split Samples Classes Image Size
CIFAR-10 Train+Test 60,000 10 32×\times×32
CIFAR-100 Train+Test 60,000 20 32×\times×32
STL-10 Train+Test 13,000 10 96×\times×96
ImageNet-10 Train 13,000 10 96×\times×96
ImageNet-Dogs Train 19,500 15 96×\times×96
Tiny-ImageNet Train 100,000 200 64×\times×64
ImageNet-1K Train 1,281,167 1000 224×\times×224
Table 4: Clustering performance on five widely-used image clustering datasets. SCAN denotes the clustering results using only neighborhood consistency loss without the self-labeling step. \dagger denotes using the train and test split for training and testing respectively, instead of using both splits for training and testing. Horizontal lines separate methods with different priors. From top to bottom are structure prior, distribution prior, augmentation invariance, neighborhood consistency, pseudo-labeling, and external knowledge.
Method CIFAR-10 CIFAR-100 STL-10 ImageNet-10 ImageNet-Dogs
ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI
K-means (1967) 22.9 8.7 4.9 13.0 8.4 2.8 19.2 12.5 6.1 24.1 11.9 5.7 - - -
JULE (2016) 27.2 19.2 13.8 13.7 10.3 3.3 27.7 18.2 16.4 30.0 17.5 13.8 13.8 5.4 2.8
DCGAN (2015) 31.5 26.5 17.6 15.1 12.0 4.5 29.9 22.7 16.2 31.3 18.6 14.2 17.8 9.8 7.3
IIC (2019) 61.7 51.3 41.1 25.7 22.5 11.7 59.6 49.6 39.7 - - - - - -
PICA (2020) 69.6 59.1 51.2 33.7 31.0 17.1 71.3 61.1 53.1 87.0 80.2 76.1 35.3 35.2 20.1
CC (2021) 79.0 70.5 63.7 42.9 43.1 26.6 85.0 76.4 72.6 89.3 85.9 82.2 42.9 44.5 27.4
TCC (2021) 90.6 79.0 73.3 49.1 47.9 31.2 81.4 73.2 68.9 89.7 84.8 82.5 59.5 55.4 41.7
SCAN (2020) 81.8 71.2 66.5 42.2 44.1 26.7 75.5 65.4 59.0 - - - - - -
NNM (2021) 83.7 73.7 69.4 45.9 48.0 30.2 76.8 66.3 59.6 - - - 58.6 60.4 44.9
GCC (2021) 85.6 76.4 72.8 47.2 47.2 30.5 78.8 68.4 63.1 90.1 84.2 82.2 52.6 49.0 36.2
DEC (2016) 30.1 25.7 16.1 18.5 13.6 5.0 35.9 27.6 18.6 38.1 28.2 20.3 19.5 12.2 7.9
DeepCluster (2018) 37.4 - - - - - 33.4 - - 18.9 - - - - -
SCAN (2020) 87.6 78.7 75.8 48.3 48.5 31.4 81.8 70.3 66.1 - - - 59.3 61.2 45.7
SPICE (2022) 83.8 73.4 70.5 46.8 44.8 29.4 90.8 81.7 81.2 92.1 82.8 83.6 64.6 57.2 47.9
TCL (2022) 88.7 81.9 78.0 53.1 52.9 35.7 86.8 79.9 75.7 89.5 87.5 83.7 64.4 62.3 51.6
ProPos (2022) 94.3 88.6 88.4 61.4 60.6 45.1 86.7 75.8 73.7 95.6 89.6 90.6 74.5 69.2 62.7
SIC (2023) 92.6 84.7 84.4 58.3 59.3 43.9 98.1 95.3 95.9 98.2 97.0 96.1 69.7 69.0 55.8
TAC (2023b) 92.3 84.1 83.9 60.7 61.1 44.8 98.2 95.5 96.1 99.2 98.5 98.3 83.0 80.6 72.2

4.2 Datasets

In the early stage, deep clustering methods are evaluated on relatively small and low-dimensional datasets (e.g. COIL-20 [72], YaleB [22]). Recently, with the rapid development of deep clustering methods, it has become more popular to evaluate clustering performance on more complex and challenging datasets. There are five widely used benchmark datasets:

  • CIFAR-10 [45] consists of 60,000 colored images from 10 different classes including airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

  • CIFAR-100 [45] contains 100 classes grouped into 20 superclasses. Each image comes with a “fine” class label and a “coarse” superclass label.

  • STL-10 [13] contains 13,000 labeled images from 10 object classes. Besides, it provides 100,000 unlabeled images for self-supervised learning to enhance the clustering performance.

  • ImageNet-10 [9] is a subset of the ImageNet dataset [17]. It contains 10 classes, each with 1,300 high-resolution images.

  • ImageNet-Dog [9] is another subset of ImageNet. It consists of images belonging to 15 dog breeds, which is suitable for fine-grained clustering tasks.

Apart from them, some recent works employ two more challenging large-scale datasets, Tiny-ImageNet [49] and ImageNet-1K [17], to evaluate the effectiveness and efficiency. A brief description of these datasets is summarized in Table 3.

4.3 Performance Comparisons

The clustering performance on five widely used datasets is shown in Table 4. Thanks to the feature extraction ability of deep neural networks, early deep clustering methods based on structure and distribution priors achieve much better performance than the classic K-means. Then, a series of contrastive clustering methods significantly improve the performance by introducing additional priors through data augmentation. After that, more advanced methods boost the performance by further considering the neighborhood consistency (GCC compared with CC) and utilizing pseudo labels (SCAN compared with SCAN). Notably, the performance gains of different priors are independent. For example, ProPos remarkably outperforms DEC and CC by additionally utilizing the augmentation invariance or pseudo-labeling priors, respectively. Very recently, external-knowledge-based methods achieved state-of-the-art performance, which proves the promising prospect of such a new deep clustering paradigm. In addition, clustering becomes more challenging when the category number grows (from CIFAR-10 to CIFAR-100) or the semantics becomes more complex (from CIFAR-10 to ImageNet-Dogs). Such results indicate that more challenging datasets such as full ImageNet-1K are expected to benchmark in future works.

5 Application in Vicinagearth

In this section, we explore some typical applications of deep clustering within the domain of Vicinagearth, a term crafted from the fusion of ”Vicinage” and ”Earth.” Vicinagearth represents the critical spatial expanse ranging from 1,000 meters below sea level (the depth at which sunlight ceases to penetrate) to 10,000 meters above sea level (the typical cruising altitude of commercial aircraft). This zone is of great importance as it encompasses the core regions of human activity including areas of habitation and production. Recently, deep clustering has emerged as an indispensable analytical tool within Vicinagearth, instrumental in unveiling complex patterns and structures of data within the vicinal space. The diverse applications of deep clustering in this zone include anomaly detection, environmental monitoring, community detection, person re-identification, and more.

Anomaly Detection, also known as Outlier Detection [14] or Novelty Detection [20], attempts to identify abnormal instances or patterns. In the context of Vicinagearth, deep clustering proves valuable for analyzing sensor data obtained from diverse sources like underwater monitoring systems, aerial sensors, or ground-based sensors [10]. Through the analysis of the patterns and typical behaviors from the sensor data, the system becomes adept at detecting anomalies, which may signal security threats or irregular activities.

Environmental Monitoring involves the analysis of data collected from environmental sensors [105], such as monitoring air quality, water conditions, and geological factors. The primary goal is to ensure the health of ecosystems [103] and detect potential environmental threats, such as pollution events or natural disasters. Deep clustering techniques play a crucial role in grou** similar environmental patterns, facilitating the identification of abnormalities. This application contributes to real-time environmental monitoring [47], enhancing the ability to respond promptly to environmental challenges.

Community Detection  [21, 41] involves evaluating how groups of nodes are clustered or partitioned and their tendency to strengthen or break apart within a network. In the context of Vicinagearth, this technique is applied to identify groups of species [70] that interact closely or share similar ecological niches. Deep clustering plays a pivotal role in the analysis of complex ecological networks [67], contributing to a deeper understanding of ecological communities and their dynamics.

Person Re-identification [102, 115] is a crucial task that involves recognizing and matching individuals across different camera views [113]. This technology plays a significant role in public safety and law enforcement initiatives, as it helps to monitor densely populated areas for including potential threats or subjects on the watchlist. The integration of deep clustering algorithms has remarkedly improved the scalability and efficiency [109] of person re-identification systems. Deep clustering effectively enables the management of the complexities presented by large and dynamically changing crowds. Furthermore, the adaptability of deep clustering techniques broadens their use to include the monitoring of natural habitats and the tracking of wildlife in diverse and uncontrolled settings.

6 Future Challenges

Although existing works achieve remarkable performance, some practical challenges and emerging requirements have yet to be fully addressed. In this section, we delve into some future directions of modern deep clustering.

6.1 Fine-grained Clustering

The objective of fine-grained clustering is to discern subtle and intricate variations within data, which is particularly advantageous in research like the identification of biological subspecies [55, 56]. The primary challenge is that fine-grained classes exhibit a high degree of similarity, where distinctions often lie in coloration, markings, shape, or other subtle characteristics. In such scenarios, traditional coarse-grained clustering priors frequently prove inadequate. For instance, color and shape augmentations in augmentation invariance prior would become ineffective. Recently, C3-GAN [42] employs contrastive learning within adversarial training to generate lifelike images, enabling the nuanced capture of fine-grained details and ensuring the separability between clusters.

6.2 Non-parametric Clustering

Many clustering methods typically require a predefined and fixed number of clusters. However, real-world datasets often present a challenge with an unknown number of clusters, reflecting situations closer to reality. Only a few works [11, 89, 122, 100] have been devoted to solving this problem. These methods often rely on calculating global similarity and introduce huge computational costs, especially in large-scale datasets. Therefore, efficiently determining the optimal value of cluster number C𝐶Citalic_C remains an open challenge, often involving the incorporation of human priors. Among existing works, DeepDPM introduces Dirichlet Process Gaussian Mixture Models (DPGMM) [3] that utilize the Dirichlet Process as the prior distribution over mixture components. DeepDPM dynamically adjusts the number of clusters C𝐶Citalic_C through split and merge operations guided by the Metropolis-Hastings framework [29].

6.3 Fair Clustering

Collecting Real-world datasets from diverse sources with various acquisition methods can enhance the generalization of machine learning models. However, these datasets frequently manifest inherent biases, notably in sensitive attributes such as gender, race, and ethnicity. These biases would introduce disparities among individuals and minority groups, leading to cluster partitions that deviate from the underlying objective characteristics of the data. The pursuit of fairness is particularly pertinent in applications where unbiased and equitable analyses are crucial, such as employment, healthcare, and education. To tackle this challenge, fair clustering seeks to mitigate the influence of these biases given the biased attributes for each sample.

To address this daunting task, Chierichetti et al first introduces a data pre-processing method known as fairlet decomposition. Recent advancements address this issue on large-scale data through adversarial training [51] and mutual information maximization [116]. Notably, Zeng et al designs a novel metric that assesses both clustering quality and fairness from the perspective of information theory. Despite these developments, there is still room for improvement, and the establishment of better evaluation metrics is a continuing area of this research.

6.4 Multi-view Clustering

Multi-view data [107, 62] is common in real-world situations where information is captured from a variety of sensors or observed from multiple angles. This data is inherently rich, offering diverse yet consistent information. For example, an RGB view would provide color details while the depth view reveals spatial information, which represents the complementary aspects of the views. Simultaneously, there exists a level of view consistency as the same object possesses common attributes across different views. To deal with multi-view data, multi-view clustering [16, 61] is proposed to exploit both the complementary and consistent characters. The goal is to integrate information from all views to produce a unified and insightful clustering result.

Over recent years, several deep-learning approaches [2, 99, 121, 80] have been developed to address this challenge. Binary multi-view clustering Zhang et al [120] simultaneously refines binary cluster structures alongside discrete data representations, ensuring cohesive clustering. In pursuit of view consistency, Lin et al [57, 58] maximize mutual information across views, thus aligning common properties. SURE [114] aims to strengthen the consistency of shared features between views by utilizing robust contrastive loss. Recently, Li et al [50] performs bound contrastive loss to preserve the view complementary at the cluster level. These innovative methodologies demonstrate the significant strides made in the field of multi-view analysis, where clustering continues to play a pivotal role in enhancing the synergistic exploitation of multi-view data.

7 Conclusion

The key to deep clustering or unsupervised learning is to seek effective supervision to guide representation learning. Different from traditional taxonomies from the network structure or data type, this survey offers a comprehensive review from the perspective of prior knowledge. With the evolution of clustering technologies, there is a discernible trend shifting from exploring priors within the data itself to external knowledge like natural language guiding. The exploration of external pre-trained models like ChatGPT or GPT-4V(ision) might emerge as a promising avenue. This survey potentially provides some valuable insight and inspires further exploration and advancements in deep clustering.

References

  • \bibcommenthead
  • Amigó et al [2009] Amigó E, Gonzalo J, Artiles J, et al (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information retrieval 12:461–486
  • Andrew et al [2013] Andrew G, Arora R, Bilmes J, et al (2013) Deep canonical correlation analysis. In: International conference on machine learning, PMLR, pp 1247–1255
  • Antoniak [1974] Antoniak CE (1974) Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The annals of statistics pp 1152–1174
  • Belkin and Niyogi [2001] Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in neural information processing systems 14
  • Bengio et al [2013] Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35(8):1798–1828
  • Berthelot et al [2019] Berthelot D, Carlini N, Goodfellow I, et al (2019) Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems 32
  • Cai et al [2023] Cai S, Qiu L, Chen X, et al (2023) Semantic-enhanced image clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6869–6878
  • Caron et al [2018] Caron M, Bojanowski P, Joulin A, et al (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European conference on computer vision (ECCV), pp 132–149
  • Chang et al [2017] Chang J, Wang L, Meng G, et al (2017) Deep adaptive image clustering. In: Proceedings of the IEEE international conference on computer vision, pp 5879–5887
  • Chatterjee and Ahmed [2022] Chatterjee A, Ahmed BS (2022) Iot anomaly detection methods and applications: A survey. Internet of Things 19:100568
  • Chen [2015] Chen G (2015) Deep learning with nonparametric clustering. arXiv preprint arXiv:150103084
  • Chierichetti et al [2017] Chierichetti F, Kumar R, Lattanzi S, et al (2017) Fair clustering through fairlets. Advances in neural information processing systems 30
  • Coates et al [2011] Coates A, Ng A, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp 215–223
  • Comaniciu and Meer [2002] Comaniciu D, Meer P (2002) Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence 24(5):603–619
  • Dang et al [2021] Dang Z, Deng C, Yang X, et al (2021) Nearest neighbor matching for deep clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13693–13702
  • Deng et al [2015] Deng C, Lv Z, Liu W, et al (2015) Multi-view matrix decomposition: A new scheme for exploring discriminative information. In: Twenty-Fourth International Joint Conference on Artificial Intelligence
  • Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp 248–255
  • Dong et al [2021] Dong S, Wang P, Abbas K (2021) A survey on deep learning and its applications. Computer Science Review 40:100379
  • Ester et al [1996a] Ester M, Kriegel HP, Sander J, et al (1996a) A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd, pp 226–231
  • Ester et al [1996b] Ester M, Kriegel HP, Sander J, et al (1996b) A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd, pp 226–231
  • Fortunato [2010] Fortunato S (2010) Community detection in graphs. Physics reports 486(3-5):75–174
  • Georghiades et al [2001] Georghiades AS, Belhumeur PN, Kriegman DJ (2001) From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE transactions on pattern analysis and machine intelligence 23(6):643–660
  • Gidaris et al [2018] Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:180307728
  • Goodfellow et al [2014] Goodfellow I, Pouget-Abadie J, Mirza M, et al (2014) Generative adversarial nets. Advances in neural information processing systems 27
  • Gowda and Krishna [1978] Gowda KC, Krishna G (1978) Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern recognition 10(2):105–112
  • Grill et al [2020] Grill JB, Strub F, Altché F, et al (2020) Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33:21271–21284
  • Gurumurthy et al [2017] Gurumurthy S, Kiran Sarvadevabhatla R, Venkatesh Babu R (2017) Deligan: Generative adversarial networks for diverse and limited data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 166–174
  • Hadsell et al [2006] Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant map**. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), IEEE, pp 1735–1742
  • Hastings [1970] Hastings WK (1970) Monte carlo sampling methods using markov chains and their applications
  • He et al [2020] He K, Fan H, Wu Y, et al (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
  • Hu et al [2017] Hu W, Miyato T, Tokui S, et al (2017) Learning discrete representations via information maximizing self-augmented training. In: International conference on machine learning, PMLR, pp 1558–1567
  • Huang et al [2020] Huang J, Gong S, Zhu X (2020) Deep semantic clustering by partition confidence maximisation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8849–8858
  • Huang et al [2014] Huang P, Huang Y, Wang W, et al (2014) Deep embedding network for clustering. In: 2014 22nd International conference on pattern recognition, IEEE, pp 1532–1537
  • Huang et al [2019] Huang Z, Zhou JT, Peng X, et al (2019) Multi-view spectral clustering network. In: Proceeings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, pp 2563–2569, 10.24963/ijcai.2019/356
  • Huang et al [2021] Huang Z, Zhou JT, Zhu H, et al (2021) Deep spectral representation learning from multi-view data. IEEE Transactions on Image Processing 30:5352–5362
  • Huang et al [2022] Huang Z, Chen J, Zhang J, et al (2022) Learning representation for clustering via prototype scattering and positive sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Hubert and Arabie [1985] Hubert L, Arabie P (1985) Comparing partitions. Journal of classification 2:193–218
  • Huynh et al [2022] Huynh T, Kornblith S, Walter MR, et al (2022) Boosting contrastive self-supervised learning with false negative cancellation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2785–2795
  • Ji et al [2019] Ji X, Henriques JF, Vedaldi A (2019) Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9865–9874
  • Jiang et al [2016] Jiang Z, Zheng Y, Tan H, et al (2016) Variational deep embedding: An unsupervised and generative approach to clustering. arXiv preprint arXiv:161105148
  • ** et al [2021] ** D, Yu Z, Jiao P, et al (2021) A survey of community detection approaches: From statistical modeling to deep learning. IEEE Transactions on Knowledge and Data Engineering 35(2):1149–1170
  • Kim and Ha [2021] Kim Y, Ha JW (2021) Contrastive fine-grained class clustering via generative adversarial networks
  • Kingma and Welling [2013] Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:13126114
  • Krause et al [2010] Krause A, Perona P, Gomes R (2010) Discriminative clustering by regularized information maximization. Advances in neural information processing systems 23
  • Krizhevsky et al [2009] Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images
  • Kuhn [1955] Kuhn HW (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2(1-2):83–97
  • Kumar et al [2012] Kumar A, Kim H, Hancke GP (2012) Environmental monitoring systems: A review. IEEE Sensors Journal 13(4):1329–1339
  • Laine and Aila [2016] Laine S, Aila T (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:161002242
  • Le and Yang [2015] Le Y, Yang X (2015) Tiny imagenet visual recognition challenge. CS 231N 7(7):3
  • Li et al [2023a] Li H, Li Y, Yang M, et al (2023a) Incomplete multi-view clustering via prototype-based imputation. arXiv preprint arXiv:230111045
  • Li et al [2020] Li P, Zhao H, Liu H (2020) Deep fair clustering for visual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9070–9079
  • Li et al [2021] Li Y, Hu P, Liu Z, et al (2021) Contrastive clustering. In: Proceedings of the AAAI conference on artificial intelligence, pp 8547–8555
  • Li et al [2022] Li Y, Yang M, Peng D, et al (2022) Twin contrastive learning for online clustering. International Journal of Computer Vision 130(9):2205–2221
  • Li et al [2023b] Li Y, Hu P, Peng D, et al (2023b) Image clustering with external guidance. arXiv preprint arXiv:231011989
  • Li et al [2023c] Li Y, Lin Y, Hu P, et al (2023c) Single-cell rna-seq debiased clustering via batch effect disentanglement. IEEE Transactions on Neural Networks and Learning Systems
  • Li et al [2023d] Li Y, Zhang D, Yang M, et al (2023d) scbridge embraces cell heterogeneity in single-cell rna-seq and atac-seq data integration. Nature Communications 14(1):6045
  • Lin et al [2021] Lin Y, Gou Y, Liu Z, et al (2021) Completer: Incomplete multi-view clustering via contrastive prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11174–11183
  • Lin et al [2022] Lin Y, Gou Y, Liu X, et al (2022) Dual contrastive prediction for incomplete multi-view representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–14. 10.1109/TPAMI.2022.3197238
  • Lin et al [2023] Lin Y, Yang M, Yu J, et al (2023) Graph matching with bi-level noisy correspondence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 23362–23371
  • Liu et al [2022] Liu J, Lin Y, Jiang L, et al (2022) Improve interpretability of neural networks via sparse contrastive coding. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp 460–470
  • Liu et al [2019a] Liu X, Zhu X, Li M, et al (2019a) Multiple kernel k𝑘kitalic_k k-means with incomplete kernels. IEEE transactions on pattern analysis and machine intelligence 42(5):1191–1204
  • Liu et al [2019b] Liu X, Zhu X, Li M, et al (2019b) Multiple kernel k𝑘kitalic_k k-means with incomplete kernels. IEEE transactions on pattern analysis and machine intelligence 42(5):1191–1204
  • Lu et al [2023] Lu Y, Lin Y, Yang M, et al (2023) Decoupled contrastive multi-view clustering with high-order random walks. arXiv preprint arXiv:230811164
  • MacQueen et al [1967] MacQueen J, et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, pp 281–297
  • McDaid et al [2011] McDaid AF, Greene D, Hurley N (2011) Normalized mutual information to evaluate overlap** community finding algorithms. arXiv preprint arXiv:11102515
  • Min et al [2018] Min E, Guo X, Liu Q, et al (2018) A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access 6:39501–39514
  • Montoya et al [2006] Montoya JM, Pimm SL, Solé RV (2006) Ecological networks and their fragility. Nature 442(7100):259–264
  • Moskalev et al [2022] Moskalev A, Sosnovik I, Fischer V, et al (2022) Contrasting quadratic assignments for set-based representation learning. In: European Conference on Computer Vision, Springer, pp 88–104
  • Mukherjee et al [2019] Mukherjee S, Asnani H, Lin E, et al (2019) Clustergan: Latent space clustering in generative adversarial networks. In: Proceedings of the AAAI conference on artificial intelligence, pp 4610–4617
  • Murdock and Yaeger [2011] Murdock J, Yaeger LS (2011) Identifying species by genetic clustering. In: ECAL, pp 564–572
  • Murtagh and Contreras [2012] Murtagh F, Contreras P (2012) Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(1):86–97
  • Nene et al [1996] Nene SA, Nayar SK, Murase H, et al (1996) Columbia object image library (coil-20)
  • Newman and Girvan [2004] Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Physical review E 69(2):026113
  • Nguyen et al [2021] Nguyen XB, Bui DT, Duong CN, et al (2021) Clusformer: A transformer based clustering approach to unsupervised large-scale face and visual landmark recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10842–10851, 10.1109/CVPR46437.2021.01070
  • Nie et al [2016] Nie F, Li J, Li X, et al (2016) Parameter-free auto-weighted multiple graph learning: a framework for multiview clustering and semi-supervised classification. In: IJCAI
  • Nie et al [2017] Nie F, Li J, Li X, et al (2017) Self-weighted multiview clustering with multiple graphs. In: IJCAI, pp 2564–2570
  • Niu et al [2022] Niu C, Shan H, Wang G (2022) Spice: Semantic pseudo-labeling for image clustering. IEEE Transactions on Image Processing 31:7264–7278
  • Oord et al [2018] Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:180703748
  • Peng et al [2016] Peng X, Xiao S, Feng J, et al (2016) Deep subspace clustering with sparsity prior. In: IJCAI, pp 1925–1931
  • Peng et al [2019] Peng X, Huang Z, Lv J, et al (2019) Comic: Multi-view clustering without parameter selection. In: International conference on machine learning, PMLR, pp 5092–5101
  • Qian [2023] Qian Q (2023) Stable cluster discrimination for deep clustering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 16645–16654
  • Radford et al [2015] Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:151106434
  • Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
  • Ren et al [2022] Ren Y, Pu J, Yang Z, et al (2022) Deep clustering: A comprehensive survey. arXiv preprint arXiv:221004142
  • Roweis and Saul [2000] Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290(5500):2323–2326
  • Saeedi Emadi and Mazinani [2018] Saeedi Emadi H, Mazinani SM (2018) A novel anomaly detection algorithm using dbscan and svm in wireless sensor networks. Wireless Personal Communications 98:2025–2035
  • Schaeffer [2007] Schaeffer SE (2007) Graph clustering. Computer science review 1(1):27–64
  • Shah and Koltun [2017] Shah SA, Koltun V (2017) Robust continuous clustering. Proceedings of the National Academy of Sciences 114(37):9814–9819
  • Shah and Koltun [2018] Shah SA, Koltun V (2018) Deep continuous clustering. arXiv preprint arXiv:180301449
  • Shaham and Lederman [2018] Shaham U, Lederman RR (2018) Learning by coincidence: Siamese networks and common variable learning. Pattern Recognition 74:52–63
  • Shaham et al [2018] Shaham U, Stanton K, Li H, et al (2018) Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:180101587
  • Shen et al [2021] Shen Y, Shen Z, Wang M, et al (2021) You never cluster alone. Advances in Neural Information Processing Systems 34:27734–27746
  • Shorten and Khoshgoftaar [2019] Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. Journal of big data 6(1):1–48
  • Sohn et al [2020] Sohn K, Berthelot D, Li CL, et al (2020) Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:200107685
  • Song et al [2013] Song C, Liu F, Huang Y, et al (2013) Auto-encoder based data clustering. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I 18, Springer, pp 117–124
  • Su et al [2022] Su X, Xue S, Liu F, et al (2022) A comprehensive survey on community detection with deep learning. IEEE Transactions on Neural Networks and Learning Systems pp 1–21. 10.1109/TNNLS.2021.3137396
  • Van Gansbeke et al [2020] Van Gansbeke W, Vandenhende S, Georgoulis S, et al (2020) Scan: Learning to classify images without labels. In: European conference on computer vision, Springer, pp 268–285
  • Wang et al [2018] Wang Q, Chen M, Nie F, et al (2018) Detecting coherent groups in crowd scenes by multiview clustering. IEEE transactions on pattern analysis and machine intelligence 42(1):46–58
  • Wang et al [2016] Wang W, Yan X, Lee H, et al (2016) Deep variational canonical correlation analysis. arXiv preprint arXiv:161003454
  • Wang et al [2021] Wang Z, Ni Y, **g B, et al (2021) Dnb: A joint learning framework for deep bayesian nonparametric clustering. IEEE Transactions on Neural Networks and Learning Systems 33(12):7610–7620
  • Wright et al [2010] Wright J, Ma Y, Mairal J, et al (2010) Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE 98(6):1031–1044
  • Wu et al [2019] Wu D, Zheng SJ, Zhang XP, et al (2019) Deep learning-based methods for person re-identification: A comprehensive review. Neurocomputing 337:354–371
  • Wu et al [2016] Wu M, Tan L, Xiong N (2016) Data prediction, compression, and recovery in clustered wireless sensor networks for environmental monitoring applications. Information Sciences 329:800–818
  • Wu et al [2018] Wu Z, Xiong Y, Yu SX, et al (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3733–3742
  • Xia and Vlajic [2007] Xia D, Vlajic N (2007) Near-optimal node clustering in wireless sensor networks for environment monitoring. In: 21st international conference on advanced information networking and applications (AINA’07), IEEE, pp 632–641
  • Xie et al [2016] Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning, PMLR, pp 478–487
  • Xu et al [2013] Xu C, Tao D, Xu C (2013) A survey on multi-view learning. arXiv preprint arXiv:13045634
  • Xu et al [2022] Xu J, De Mello S, Liu S, et al (2022) Groupvit: Semantic segmentation emerges from text supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18134–18144
  • Yan et al [2023] Yan Y, Li J, Qin J, et al (2023) Efficient person search: An anchor-free approach. International Journal of Computer Vision pp 1–20
  • Yang et al [2016] Yang J, Parikh D, Batra D (2016) Joint unsupervised learning of deep representations and image clusters. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5147–5156
  • Yang et al [2023] Yang J, Liu J, Xu N, et al (2023) Tvt: Transferable vision transformer for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 520–530
  • Yang et al [2021] Yang M, Li Y, Huang Z, et al (2021) Partially view-aligned representation learning with noise-robust contrastive loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • Yang et al [2022a] Yang M, Huang Z, Hu P, et al (2022a) Learning with twin noisy labels for visible-infrared person re-identification. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
  • Yang et al [2022b] Yang M, Li Y, Hu P, et al (2022b) Robust multi-view clustering with incomplete information. IEEE Trans Pattern Anal Mach Intell
  • Ye et al [2021] Ye M, Shen J, Lin G, et al (2021) Deep learning for person re-identification: A survey and outlook. IEEE transactions on pattern analysis and machine intelligence 44(6):2872–2893
  • Zeng et al [2023] Zeng P, Li Y, Hu P, et al (2023) Deep fair clustering via maximizing and minimizing mutual information: Theory, algorithm and metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 23986–23995
  • Zhang et al [2015] Zhang C, Fu H, Liu S, et al (2015) Low-rank tensor constrained multiview subspace clustering. In: Proceedings of the IEEE international conference on computer vision, pp 1582–1590
  • Zhang et al [2023] Zhang H, Nie F, Li X (2023) Large-scale clustering with structured optimal bipartite graph. IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Zhang et al [2019] Zhang L, Qi GJ, Wang L, et al (2019) Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2547–2555
  • Zhang et al [2018] Zhang Z, Liu L, Shen F, et al (2018) Binary multi-view clustering. IEEE transactions on pattern analysis and machine intelligence 41(7):1774–1782
  • Zhao et al [2016] Zhao H, Liu H, Fu Y (2016) Incomplete multi-modal visual data grou**. In: IJCAI, pp 2392–2398
  • Zhao et al [2019] Zhao T, Wang Z, Masoomi A, et al (2019) Streaming adaptive nonparametric variational autoencoder. arXiv preprint arXiv:190603288
  • Zhong et al [2020] Zhong H, Chen C, ** Z, et al (2020) Deep robust clustering by contrastive learning. arXiv preprint arXiv:200803030
  • Zhong et al [2021] Zhong H, Wu J, Chen C, et al (2021) Graph contrastive clustering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9224–9233
  • Zhou et al [2022] Zhou S, Xu H, Zheng Z, et al (2022) A comprehensive survey on deep clustering: Taxonomy, challenges, and future directions. arXiv preprint arXiv:220607579

Author Contributions

All authors contributed to the core insights presented in this paper. Xi Peng supervised this survey and provided valuable guidance throughout the process. Yiding Lu, Haobin Li, Yunfan Li, and Yijie Lin collaboratively wrote Priors for Deep Clustering. Yiding Lu took the lead in crafting Introduction, Application, and Future Challenges. Haobin Li was responsible for collecting and analyzing experimental results, creating figures, and summarizing tables. Yunfan Li and Yijie Lin designed the outline, wrote Abstract, and refined the manuscript.

Data Availability

The datasets utilized in this survey are publicly available and can be accessed from the following sources: