Graph sketching-based Space-efficient Data Clustering

Morvan, Anne; Choromanski, Krzysztof; Gouy-Pailler, Cédric; Atif, Jamal

doi:10.1137/1.9781611975321.2

Computer Science > Machine Learning

arXiv:1703.02375 (cs)

[Submitted on 7 Mar 2017 (v1), last revised 27 May 2018 (this version, v5)]

Title:Graph sketching-based Space-efficient Data Clustering

Authors:Anne Morvan, Krzysztof Choromanski, Cédric Gouy-Pailler, Jamal Atif

View PDF

Abstract:In this paper, we address the problem of recovering arbitrary-shaped data clusters from datasets while facing \emph{high space constraints}, as this is for instance the case in many real-world applications when analysis algorithms are directly deployed on resources-limited mobile devices collecting the data. We present DBMSTClu a new space-efficient density-based \emph{non-parametric} method working on a Minimum Spanning Tree (MST) recovered from a limited number of linear measurements i.e. a \emph{sketched} version of the dissimilarity graph $\mathcal{G}$ between the $N$ objects to cluster. Unlike $k$-means, $k$-medians or $k$-medoids algorithms, it does not fail at distinguishing clusters with particular forms thanks to the property of the MST for expressing the underlying structure of a graph. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. An approximate MST is retrieved by following the dynamic \emph{semi-streaming} model in handling the dissimilarity graph $\mathcal{G}$ as a stream of edge weight updates which is sketched in one pass over the data into a compact structure requiring $O(N \operatorname{polylog}(N))$ space, far better than the theoretical memory cost $O(N^2)$ of $\mathcal{G}$. The recovered approximate MST $\mathcal{T}$ as input, DBMSTClu then successfully detects the right number of nonconvex clusters by performing relevant cuts on $\mathcal{T}$ in a time linear in $N$. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets.

Comments:	Proceedings of the 2018 SIAM International Conference on Data Mining
Subjects:	Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:1703.02375 [cs.LG]
	(or arXiv:1703.02375v5 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1703.02375
Related DOI:	https://doi.org/10.1137/1.9781611975321.2

Submission history

From: Anne Morvan [view email]
[v1] Tue, 7 Mar 2017 13:43:45 UTC (1,467 KB)
[v2] Thu, 23 Mar 2017 13:11:52 UTC (1,624 KB)
[v3] Mon, 4 Sep 2017 07:58:04 UTC (5,526 KB)
[v4] Wed, 20 Dec 2017 20:37:29 UTC (6,178 KB)
[v5] Sun, 27 May 2018 17:17:46 UTC (6,178 KB)

Computer Science > Machine Learning

Title:Graph sketching-based Space-efficient Data Clustering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Graph sketching-based Space-efficient Data Clustering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators