Graph sketching-based Space-efficient Data Clustering

Morvan, Anne; Choromanski, Krzysztof; Gouy-Pailler, Cédric; Atif, Jamal

Computer Science > Machine Learning

arXiv:1703.02375v3 (cs)

[Submitted on 7 Mar 2017 (v1), revised 4 Sep 2017 (this version, v3), latest version 27 May 2018 (v5)]

Title:Graph sketching-based Space-efficient Data Clustering

Authors:Anne Morvan, Krzysztof Choromanski, Cédric Gouy-Pailler, Jamal Atif

View PDF

Abstract:In this paper, we address the problem of recovering arbitrary-shaped data clusters from datasets while facing high space constraints, as this is for instance the case in the Internet of Things environment when analysis algorithms are directly deployed on resources-limited mobile devices collecting the data. We present DBMSTClu a new density-based \emph{non-parametric} method working on a limited number of linear measurements i.e. a \emph{sketched} version of the dissimilarity graph $G$ between the $N$ objects to cluster. Unlike $k$-means, $k$-medians or $k$-medoids algorithms, it does not fail at distinguishing clusters with particular structures. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. DBMSTClu as a graph-based technique relies on the dissimilarity graph $G$ which costs theoretically $O(N^2)$ in memory. However, our algorithm follows the dynamic semi-streaming model by handling $G$ as a stream of edge weight updates and sketches it in one pass over the data into a compact structure requiring $O(N \operatorname{polylog}(N))$ space. Thanks to the property of the Minimum Spanning Tree (MST) for expressing the underlying structure of a graph, our algorithm successfully detects the right number of non-convex clusters by recovering an approximate MST from the graph sketch of $G$. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets.

Subjects:	Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:1703.02375 [cs.LG]
	(or arXiv:1703.02375v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1703.02375

Submission history

From: Anne Morvan [view email]
[v1] Tue, 7 Mar 2017 13:43:45 UTC (1,467 KB)
[v2] Thu, 23 Mar 2017 13:11:52 UTC (1,624 KB)
[v3] Mon, 4 Sep 2017 07:58:04 UTC (5,526 KB)
[v4] Wed, 20 Dec 2017 20:37:29 UTC (6,178 KB)
[v5] Sun, 27 May 2018 17:17:46 UTC (6,178 KB)

Computer Science > Machine Learning

Title:Graph sketching-based Space-efficient Data Clustering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Graph sketching-based Space-efficient Data Clustering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators