-
On the adversarial robustness of Locality-Sensitive Hashing in Hamming space
Authors:
Michael Kapralov,
Mikhail Makarov,
Christian Sohler
Abstract:
Locality-sensitive hashing~[Indyk,Motwani'98] is a classical data structure for approximate nearest neighbor search. It allows, after a close to linear time preprocessing of the input dataset, to find an approximately nearest neighbor of any fixed query in sublinear time in the dataset size. The resulting data structure is randomized and succeeds with high probability for every fixed query.
In m…
▽ More
Locality-sensitive hashing~[Indyk,Motwani'98] is a classical data structure for approximate nearest neighbor search. It allows, after a close to linear time preprocessing of the input dataset, to find an approximately nearest neighbor of any fixed query in sublinear time in the dataset size. The resulting data structure is randomized and succeeds with high probability for every fixed query.
In many modern applications of nearest neighbor search the queries are chosen adaptively. In this paper, we study the robustness of the locality-sensitive hashing to adaptive queries in Hamming space. We present a simple adversary that can, under mild assumptions on the initial point set, provably find a query to the approximate near neighbor search data structure that the data structure fails on. Crucially, our adaptive algorithm finds the hard query exponentially faster than random sampling.
△ Less
Submitted 17 June, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Constant Approximation for Normalized Modularity and Associations Clustering
Authors:
Jakub Łącki,
Vahab Mirrokni,
Christian Sohler
Abstract:
We study the problem of graph clustering under a broad class of objectives in which the quality of a cluster is defined based on the ratio between the number of edges in the cluster, and the total weight of vertices in the cluster. We show that our definition is closely related to popular clustering measures, namely normalized associations, which is a dual of the normalized cut objective, and norm…
▽ More
We study the problem of graph clustering under a broad class of objectives in which the quality of a cluster is defined based on the ratio between the number of edges in the cluster, and the total weight of vertices in the cluster. We show that our definition is closely related to popular clustering measures, namely normalized associations, which is a dual of the normalized cut objective, and normalized modularity. We give a linear time constant-approximate algorithm for our objective, which implies the first constant-factor approximation algorithms for normalized modularity and normalized associations.
△ Less
Submitted 29 December, 2022;
originally announced December 2022.
-
Motif Cut Sparsifiers
Authors:
Michael Kapralov,
Mikhail Makarov,
Sandeep Silwal,
Christian Sohler,
Jakab Tardos
Abstract:
A motif is a frequently occurring subgraph of a given directed or undirected graph $G$. Motifs capture higher order organizational structure of $G$ beyond edge relationships, and, therefore, have found wide applications such as in graph clustering, community detection, and analysis of biological and physical networks to name a few. In these applications, the cut structure of motifs plays a crucial…
▽ More
A motif is a frequently occurring subgraph of a given directed or undirected graph $G$. Motifs capture higher order organizational structure of $G$ beyond edge relationships, and, therefore, have found wide applications such as in graph clustering, community detection, and analysis of biological and physical networks to name a few. In these applications, the cut structure of motifs plays a crucial role as vertices are partitioned into clusters by cuts whose conductance is based on the number of instances of a particular motif, as opposed to just the number of edges, crossing the cuts.
In this paper, we introduce the concept of a motif cut sparsifier. We show that one can compute in polynomial time a sparse weighted subgraph $G'$ with only $\widetilde{O}(n/ε^2)$ edges such that for every cut, the weighted number of copies of $M$ crossing the cut in $G'$ is within a $1+ε$ factor of the number of copies of $M$ crossing the cut in $G$, for every constant size motif $M$.
Our work carefully combines the viewpoints of both graph sparsification and hypergraph sparsification. We sample edges which requires us to extend and strengthen the concept of cut sparsifiers introduced in the seminal work of to the motif setting. We adapt the importance sampling framework through the viewpoint of hypergraph sparsification by deriving the edge sampling probabilities from the strong connectivity values of a hypergraph whose hyperedges represent motif instances. Finally, an iterative sparsification primitive inspired by both viewpoints is used to reduce the number of edges in $G$ to nearly linear.
In addition, we present a strong lower bound ruling out a similar result for sparsification with respect to induced occurrences of motifs.
△ Less
Submitted 12 September, 2022; v1 submitted 21 April, 2022;
originally announced April 2022.
-
Spectral Clustering Oracles in Sublinear Time
Authors:
Grzegorz Gluch,
Michael Kapralov,
Silvio Lattanzi,
Aida Mousavifar,
Christian Sohler
Abstract:
Given a graph $G$ that can be partitioned into $k$ disjoint expanders with outer conductance upper bounded by $ε\ll 1$, can we efficiently construct a small space data structure that allows quickly classifying vertices of $G$ according to the expander (cluster) they belong to? Formally, we would like an efficient local computation algorithm that misclassifies at most an $O(ε)$ fraction of vertices…
▽ More
Given a graph $G$ that can be partitioned into $k$ disjoint expanders with outer conductance upper bounded by $ε\ll 1$, can we efficiently construct a small space data structure that allows quickly classifying vertices of $G$ according to the expander (cluster) they belong to? Formally, we would like an efficient local computation algorithm that misclassifies at most an $O(ε)$ fraction of vertices in every expander. We refer to such a data structure as a \textit{spectral clustering oracle}. Our main result is a spectral clustering oracle with query time $O^*(n^{1/2+O(ε)})$ and preprocessing time $2^{O(\frac{1}ε k^4 \log^2(k))} n^{1/2+O(ε)}$ that provides misclassification error $O(ε\log k)$ per cluster for any $ε\ll 1/\log k$. More generally, query time can be reduced at the expense of increasing the preprocessing time appropriately (as long as the product is about $n^{1+O(ε)}$) -- this in particular gives a nearly linear time spectral clustering primitive. The main technical contribution is a sublinear time oracle that provides dot product access to the spectral embedding of $G$ by estimating distributions of short random walks from vertices in $G$. The distributions themselves provide a poor approximation to the spectral embedding, but we show that an appropriate linear transformation can be used to achieve high precision dot product access. We then show that dot product access to the spectral embedding is sufficient to design a clustering oracle. At a high level our approach amounts to hyperplane partitioning in the spectral embedding of $G$, but crucially operates on a nested sequence of carefully defined subspaces in the spectral embedding to achieve per cluster recovery guarantees.
△ Less
Submitted 19 October, 2021; v1 submitted 14 January, 2021;
originally announced January 2021.
-
Fast and Accurate $k$-means++ via Rejection Sampling
Authors:
Vincent Cohen-Addad,
Silvio Lattanzi,
Ashkan Norouzi-Fard,
Christian Sohler,
Ola Svensson
Abstract:
$k$-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, $k…
▽ More
$k$-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, $k$-means++ sometimes suffers from being slow on large data-sets so a natural question has been to obtain more efficient algorithms with similar guarantees. In this paper, we present a near linear time algorithm for $k$-means++ seeding. Interestingly our algorithm obtains the same theoretical guarantees as $k$-means++ and significantly improves earlier results on fast $k$-means++ seeding. Moreover, we show empirically that our algorithm is significantly faster than $k$-means++ and obtains solutions of equivalent quality.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
A characterization of graph properties testable for general planar graphs with one-sided error (It is all about forbidden subgraphs)
Authors:
Artur Czumaj,
Christian Sohler
Abstract:
The problem of characterizing testable graph properties (properties that can be tested with a number of queries independent of the input size) is a fundamental problem in the area of property testing. While there has been some extensive prior research characterizing testable graph properties in the dense graphs model and we have good understanding of the bounded degree graphs model, no similar cha…
▽ More
The problem of characterizing testable graph properties (properties that can be tested with a number of queries independent of the input size) is a fundamental problem in the area of property testing. While there has been some extensive prior research characterizing testable graph properties in the dense graphs model and we have good understanding of the bounded degree graphs model, no similar characterization has been known for general graphs, with no degree bounds. In this paper we take on this major challenge and consider the problem of characterizing all testable graph properties in general planar graphs.
We consider the model in which a general planar graph can be accessed by the random neighbor oracle that allows access to any given vertex and access to a random neighbor of a given vertex. We show that, informally, a graph property $P$ is testable with one-sided error for general planar graphs if and only if testing $P$ can be reduced to testing for a finite family of finite forbidden subgraphs. While our presentation focuses on planar graphs, our approach extends easily to general minor-free graphs.
Our analysis of the necessary condition relies on a recent construction of canonical testers in the random neighbor oracle model that is applied here to the one-sided error model for testing in planar graphs. The sufficient condition in the characterization reduces the problem to the task of testing $H$-freeness in planar graphs, and is the main and most challenging technical contribution of the paper: we show that for planar graphs (with arbitrary degrees), the property of being $H$-free is testable with one-sided error for every finite graph $H$, in the random neighbor oracle model.
△ Less
Submitted 23 September, 2019;
originally announced September 2019.
-
Fully dynamic hierarchical diameter k-clustering and k-center
Authors:
Melanie Schmidt,
Christian Sohler
Abstract:
We develop dynamic data structures for maintaining a hierarchical k-center clustering when the points come from a discrete space $\{1,\ldots,Δ\}^d$. Our first data structure is for the low dimensional setting, i.e., d is a constant, and processes insertions, deletions and cluster representative queries in $\log^{O(1)} (Δn)$ time, where $n$ is the current size of the point set. For the high dimensi…
▽ More
We develop dynamic data structures for maintaining a hierarchical k-center clustering when the points come from a discrete space $\{1,\ldots,Δ\}^d$. Our first data structure is for the low dimensional setting, i.e., d is a constant, and processes insertions, deletions and cluster representative queries in $\log^{O(1)} (Δn)$ time, where $n$ is the current size of the point set. For the high dimensional case and an integer parameter $\ell > 1$, we provide a randomized data structure that maintains an $O(d \ell)$-approximation. The amortized expected insertion time is $O(d^2 \ell \log n \log Δ)$. The amortized expected deletion time is $O(d^2 n^{1/\ell} \log^2 n \log Δ)$. At any point of time, with probability at least $1-1/n$, the data structure can correctly answer all queries for cluster representatives in $O(d \ell \log n \log Δ)$ time per query.
△ Less
Submitted 7 August, 2019;
originally announced August 2019.
-
Testable Properties in General Graphs and Random Order Streaming
Authors:
Artur Czumaj,
Hendrik Fichtenberger,
Pan Peng,
Christian Sohler
Abstract:
We present a novel framework closely linking the areas of property testing and data streaming algorithms in the setting of general graphs. It has been recently shown (Monemizadeh et al. 2017) that for bounded-degree graphs, any constant-query tester can be emulated in the random order streaming model by a streaming algorithm that uses only space required to store a constant number of words. Howeve…
▽ More
We present a novel framework closely linking the areas of property testing and data streaming algorithms in the setting of general graphs. It has been recently shown (Monemizadeh et al. 2017) that for bounded-degree graphs, any constant-query tester can be emulated in the random order streaming model by a streaming algorithm that uses only space required to store a constant number of words. However, in a more natural setting of general graphs, with no restriction on the maximum degree, no such results were known because of our lack of understanding of constant-query testers in general graphs and lack of techniques to appropriately emulate in the streaming setting off-line algorithms allowing many high-degree vertices.
In this work we advance our understanding on both of these challenges. First, we provide canonical testers for all constant-query testers for general graphs, both, for one-sided and two-sided errors. Such canonizations were only known before (in the adjacency matrix model) for dense graphs (Goldreich and Trevisan 2003) and (in the adjacency list model) for bounded degree (di-)graphs (Goldreich and Ron 2011, Czumaj et al. 2016). Using the concept of canonical testers, we then prove that every property of general graphs that is constant-query testable with one-sided error can also be tested in constant-space with one-sided error in the random order streaming model.
Our results imply, among others, that properties like $(s,t)$ disconnectivity, $k$-path-freeness, etc. are constant-space testable in random order streams.
△ Less
Submitted 5 May, 2019;
originally announced May 2019.
-
Fair Coresets and Streaming Algorithms for Fair k-Means Clustering
Authors:
Melanie Schmidt,
Chris Schwiegelshohn,
Christian Sohler
Abstract:
We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well.
We show how to model and compute so-called coresets for fair clustering problems, which c…
▽ More
We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well.
We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyd's algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering.
△ Less
Submitted 9 March, 2021; v1 submitted 27 December, 2018;
originally announced December 2018.
-
Every Testable (Infinite) Property of Bounded-Degree Graphs Contains an Infinite Hyperfinite Subproperty
Authors:
Hendrik Fichtenberger,
Pan Peng,
Christian Sohler
Abstract:
One of the most fundamental questions in graph property testing is to characterize the combinatorial structure of properties that are testable with a constant number of queries. We work towards an answer to this question for the bounded-degree graph model introduced in [Goldreich, Ron, 2002], where the input graphs have maximum degree bounded by a constant $d$. In this model, it is known (among ot…
▽ More
One of the most fundamental questions in graph property testing is to characterize the combinatorial structure of properties that are testable with a constant number of queries. We work towards an answer to this question for the bounded-degree graph model introduced in [Goldreich, Ron, 2002], where the input graphs have maximum degree bounded by a constant $d$. In this model, it is known (among other results) that every \emph{hyperfinite} property is constant-query testable [Newman, Sohler, 2013], where, informally, a graph property is hyperfinite, if for every $δ>0$ every graph in the property can be partitioned into small connected components by removing $δn$ edges.
In this paper we show that hyperfiniteness plays a role in \emph{every} testable property, i.e. we show that every testable property is either finite (which trivially implies hyperfiniteness and testability) or contains an infinite hyperfinite subproperty. A simple consequence of our result is that no infinite graph property that only consists of expander graphs is constant-query testable.
Based on the above findings, one could ask if every infinite testable non-hyperfinite property might contain an infinite family of expander (or near-expander) graphs. We show that this is not true. Motivated by our counter-example we develop a theorem that shows that we can partition the set of vertices of every bounded degree graph into a constant number of subsets and a separator set, such that the separator set is small and the distribution of $k$-disks on every subset of a partition class, is roughly the same as that of the partition class if the subset has small expansion.
△ Less
Submitted 7 November, 2018;
originally announced November 2018.
-
Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension
Authors:
Christian Sohler,
David P. Woodruff
Abstract:
We obtain the first strong coresets for the $k$-median and subspace approximation problems with sum of distances objective function, on $n$ points in $d$ dimensions, with a number of weighted points that is independent of both $n$ and $d$; namely, our coresets have size $\text{poly}(k/ε)$. A strong coreset $(1+ε)$-approximates the cost function for all possible sets of centers simultaneously. We a…
▽ More
We obtain the first strong coresets for the $k$-median and subspace approximation problems with sum of distances objective function, on $n$ points in $d$ dimensions, with a number of weighted points that is independent of both $n$ and $d$; namely, our coresets have size $\text{poly}(k/ε)$. A strong coreset $(1+ε)$-approximates the cost function for all possible sets of centers simultaneously. We also give efficient $\text{nnz}(A) + (n+d)\text{poly}(k/ε) + \exp(\text{poly}(k/ε))$ time algorithms for computing these coresets.
We obtain the result by introducing a new dimensionality reduction technique for coresets that significantly generalizes an earlier result of Feldman, Sohler and Schmidt \cite{FSS13} for squared Euclidean distances to sums of $p$-th powers of Euclidean distances for constant $p\ge1$.
△ Less
Submitted 14 April, 2022; v1 submitted 9 September, 2018;
originally announced September 2018.
-
Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
Authors:
Dan Feldman,
Melanie Schmidt,
Christian Sohler
Abstract:
We develop and analyze a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximate…
▽ More
We develop and analyze a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximately the first k principal components of the original set or computing the centers of a k-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension d of the input space and that the sets are mergable. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets (this property has recently also been called composability, see Indyk et. al., PODS 2014). It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as k-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on projecting the points on a low dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including k-means clustering, principal component analysis and subspace clustering. The main conceptual contribution is a new coreset definition that allows to charge costs that appear for every solution to an additive constant.
△ Less
Submitted 12 July, 2018;
originally announced July 2018.
-
On Coresets for Logistic Regression
Authors:
Alexander Munteanu,
Chris Schwiegelshohn,
Christian Sohler,
David P. Woodruff
Abstract:
Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $μ(X)$, which quantifies…
▽ More
Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $μ(X)$, which quantifies the hardness of compressing a data set for logistic regression. $μ(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $μ(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\varepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.
△ Less
Submitted 8 March, 2021; v1 submitted 22 May, 2018;
originally announced May 2018.
-
Approximating the Spectrum of a Graph
Authors:
David Cohen-Steiner,
Weihao Kong,
Christian Sohler,
Gregory Valiant
Abstract:
The spectrum of a network or graph $G=(V,E)$ with adjacency matrix $A$, consists of the eigenvalues of the normalized Laplacian $L= I - D^{-1/2} A D^{-1/2}$. This set of eigenvalues encapsulates many aspects of the structure of the graph, including the extent to which the graph posses community structures at multiple scales. We study the problem of approximating the spectrum…
▽ More
The spectrum of a network or graph $G=(V,E)$ with adjacency matrix $A$, consists of the eigenvalues of the normalized Laplacian $L= I - D^{-1/2} A D^{-1/2}$. This set of eigenvalues encapsulates many aspects of the structure of the graph, including the extent to which the graph posses community structures at multiple scales. We study the problem of approximating the spectrum $λ= (λ_1,\dots,λ_{|V|})$, $0 \le λ_1,\le \dots, \le λ_{|V|}\le 2$ of $G$ in the regime where the graph is too large to explicitly calculate the spectrum. We present a sublinear time algorithm that, given the ability to query a random node in the graph and select a random neighbor of a given node, computes a succinct representation of an approximation $\widetilde λ= (\widetilde λ_1,\dots,\widetilde λ_{|V|})$, $0 \le \widetilde λ_1,\le \dots, \le \widetilde λ_{|V|}\le 2$ such that $\|\widetilde λ- λ\|_1 \le ε|V|$. Our algorithm has query complexity and running time $exp(O(1/ε))$, independent of the size of the graph, $|V|$. We demonstrate the practical viability of our algorithm on 15 different real-world graphs from the Stanford Large Network Dataset Collection, including social networks, academic collaboration graphs, and road networks. For the smallest of these graphs, we are able to validate the accuracy of our algorithm by explicitly calculating the true spectrum; for the larger graphs, such a calculation is computationally prohibitive.
In addition we study the implications of our algorithm to property testing in the bounded degree graph model.
△ Less
Submitted 5 December, 2017;
originally announced December 2017.
-
Estimating Graph Parameters from Random Order Streams
Authors:
Pan Peng,
Christian Sohler
Abstract:
We develop a new algorithmic technique that allows to transfer some constant time approximation algorithms for general graphs into random order streaming algorithms. We illustrate our technique by proving that in random order streams with probability at least $2/3$,
$\bullet$ the number of connected components of $G$ can be approximated up to an additive error of $\varepsilon n$ using…
▽ More
We develop a new algorithmic technique that allows to transfer some constant time approximation algorithms for general graphs into random order streaming algorithms. We illustrate our technique by proving that in random order streams with probability at least $2/3$,
$\bullet$ the number of connected components of $G$ can be approximated up to an additive error of $\varepsilon n$ using $(\frac{1}{\varepsilon})^{O(1/\varepsilon^3)}$ space,
$\bullet$ the weight of a minimum spanning tree of a connected input graph with integer edges weights from $\{1,\dots,W\}$ can be approximated within a multiplicative factor of $1+\varepsilon$ using $\big(\frac{1}{\varepsilon}\big)^{\tilde O(W^3/\varepsilon^3)}$ space,
$\bullet$ the size of a maximum independent set in planar graphs can be approximated within a multiplicative factor of $1+\varepsilon$ using space $2^{(1/\varepsilon)^{(1/\varepsilon)^{\log^{O(1)} (1/\varepsilon)}}}$.
△ Less
Submitted 13 November, 2017;
originally announced November 2017.
-
Testable Bounded Degree Graph Properties Are Random Order Streamable
Authors:
Morteza Monemizadeh,
S. Muthukrishnan,
Pan Peng,
Christian Sohler
Abstract:
We study which property testing and sublinear time algorithms can be transformed into graph streaming algorithms for random order streams. Our main result is that for bounded degree graphs, any property that is constant-query testable in the adjacency list model can be tested with constant space in a single-pass in random order streams. Our result is obtained by estimating the distribution of loca…
▽ More
We study which property testing and sublinear time algorithms can be transformed into graph streaming algorithms for random order streams. Our main result is that for bounded degree graphs, any property that is constant-query testable in the adjacency list model can be tested with constant space in a single-pass in random order streams. Our result is obtained by estimating the distribution of local neighborhoods of the vertices on a random order graph stream using constant space.
We then show that our approach can also be applied to constant time approximation algorithms for bounded degree graphs in the adjacency list model: As an example, we obtain a constant-space single-pass random order streaming algorithms for approximating the size of a maximum matching with additive error $εn$ ($n$ is the number of nodes).
Our result establishes for the first time that a large class of sublinear algorithms can be simulated in random order streams, while $Ω(n)$ space is needed for many graph streaming problems for adversarial orders.
△ Less
Submitted 23 July, 2017;
originally announced July 2017.
-
Clustering High Dimensional Dynamic Data Streams
Authors:
Vladimir Braverman,
Gereon Frahling,
Harry Lang,
Christian Sohler,
Lin F. Yang
Abstract:
We present data streaming algorithms for the $k$-median problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space $\{1, 2, \ldots Δ\}^d$. Our algorithms use $k ε^{-2} poly(d \log Δ)$ space/time and maintain with high probability a small weighted set of points (a coreset) such that for every set of $k$ c…
▽ More
We present data streaming algorithms for the $k$-median problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space $\{1, 2, \ldots Δ\}^d$. Our algorithms use $k ε^{-2} poly(d \log Δ)$ space/time and maintain with high probability a small weighted set of points (a coreset) such that for every set of $k$ centers the cost of the coreset $(1+ε)$-approximates the cost of the streamed point set. We also provide algorithms that guarantee only positive weights in the coreset with additional logarithmic factors in the space and time complexities. We can use this positively-weighted coreset to compute a $(1+ε)$-approximation for the $k$-median problem by any efficient offline $k$-median algorithm. All previous algorithms for computing a $(1+ε)$-approximation for the $k$-median problem over dynamic data streams required space and time exponential in $d$. Our algorithms can be generalized to metric spaces of bounded doubling dimension.
△ Less
Submitted 12 June, 2017;
originally announced June 2017.
-
Theoretical Analysis of the $k$-Means Algorithm - A Survey
Authors:
Johannes Blömer,
Christiane Lammersen,
Melanie Schmidt,
Christian Sohler
Abstract:
The $k$-means algorithm is one of the most widely used clustering heuristics. Despite its simplicity, analyzing its running time and quality of approximation is surprisingly difficult and can lead to deep insights that can be used to improve the algorithm. In this paper we survey the recent results in this direction as well as several extension of the basic $k$-means method.
The $k$-means algorithm is one of the most widely used clustering heuristics. Despite its simplicity, analyzing its running time and quality of approximation is surprisingly difficult and can lead to deep insights that can be used to improve the algorithm. In this paper we survey the recent results in this direction as well as several extension of the basic $k$-means method.
△ Less
Submitted 26 February, 2016;
originally announced February 2016.
-
Clustering time series under the Fréchet distance
Authors:
Anne Driemel,
Amer Krivošija,
Christian Sohler
Abstract:
The Fréchet distance is a popular distance measure for curves. We study the problem of clustering time series under the Fréchet distance. In particular, we give $(1+\varepsilon)$-approximation algorithms for variations of the following problem with parameters $k$ and $\ell$. Given $n$ univariate time series $P$, each of complexity at most $m$, we find $k$ time series, not necessarily from $P$, whi…
▽ More
The Fréchet distance is a popular distance measure for curves. We study the problem of clustering time series under the Fréchet distance. In particular, we give $(1+\varepsilon)$-approximation algorithms for variations of the following problem with parameters $k$ and $\ell$. Given $n$ univariate time series $P$, each of complexity at most $m$, we find $k$ time series, not necessarily from $P$, which we call \emph{cluster centers} and which each have complexity at most $\ell$, such that (a) the maximum distance of an element of $P$ to its nearest cluster center or (b) the sum of these distances is minimized. Our algorithms have running time near-linear in the input size for constant $\varepsilon$, $k$ and $\ell$. To the best of our knowledge, our algorithms are the first clustering algorithms for the Fréchet distance which achieve an approximation factor of $(1+\varepsilon)$ or better.
Keywords: time series, longitudinal data, functional data, clustering, Fréchet distance, dynamic time war**, approximation algorithms.
△ Less
Submitted 14 December, 2015;
originally announced December 2015.
-
Random projections for Bayesian regression
Authors:
Leo N. Geppert,
Katja Ickstadt,
Alexander Munteanu,
Jens Quedenfeld,
Christian Sohler
Abstract:
This article deals with random projections applied as a data reduction technique for Bayesian regression analysis. We show sufficient conditions under which the entire $d$-dimensional distribution is approximately preserved under random projections by reducing the number of data points from $n$ to $k\in O(\operatorname{poly}(d/\varepsilon))$ in the case $n\gg d$. Under mild assumptions, we prove t…
▽ More
This article deals with random projections applied as a data reduction technique for Bayesian regression analysis. We show sufficient conditions under which the entire $d$-dimensional distribution is approximately preserved under random projections by reducing the number of data points from $n$ to $k\in O(\operatorname{poly}(d/\varepsilon))$ in the case $n\gg d$. Under mild assumptions, we prove that evaluating a Gaussian likelihood function based on the projected data instead of the original data yields a $(1+O(\varepsilon))$-approximation in terms of the $\ell_2$ Wasserstein distance. Our main result shows that the posterior distribution of Bayesian linear regression is approximated up to a small error depending on only an $\varepsilon$-fraction of its defining parameters. This holds when using arbitrary Gaussian priors or the degenerate case of uniform distributions over $\mathbb{R}^d$ for $β$. Our empirical evaluations involve different simulated settings of Bayesian linear regression. Our experiments underline that the proposed method is able to recover the regression model up to small error while considerably reducing the total running time.
△ Less
Submitted 30 November, 2015; v1 submitted 23 April, 2015;
originally announced April 2015.
-
Testing Cluster Structure of Graphs
Authors:
Artur Czumaj,
Pan Peng,
Christian Sohler
Abstract:
We study the problem of recognizing the cluster structure of a graph in the framework of property testing in the bounded degree model. Given a parameter $\varepsilon$, a $d$-bounded degree graph is defined to be $(k, φ)$-clusterable, if it can be partitioned into no more than $k$ parts, such that the (inner) conductance of the induced subgraph on each part is at least $φ$ and the (outer) conductan…
▽ More
We study the problem of recognizing the cluster structure of a graph in the framework of property testing in the bounded degree model. Given a parameter $\varepsilon$, a $d$-bounded degree graph is defined to be $(k, φ)$-clusterable, if it can be partitioned into no more than $k$ parts, such that the (inner) conductance of the induced subgraph on each part is at least $φ$ and the (outer) conductance of each part is at most $c_{d,k}\varepsilon^4φ^2$, where $c_{d,k}$ depends only on $d,k$. Our main result is a sublinear algorithm with the running time $\widetilde{O}(\sqrt{n}\cdot\mathrm{poly}(φ,k,1/\varepsilon))$ that takes as input a graph with maximum degree bounded by $d$, parameters $k$, $φ$, $\varepsilon$, and with probability at least $\frac23$, accepts the graph if it is $(k,φ)$-clusterable and rejects the graph if it is $\varepsilon$-far from $(k, φ^*)$-clusterable for $φ^* = c'_{d,k}\frac{φ^2 \varepsilon^4}{\log n}$, where $c'_{d,k}$ depends only on $d,k$. By the lower bound of $Ω(\sqrt{n})$ on the number of queries needed for testing graph expansion, which corresponds to $k=1$ in our problem, our algorithm is asymptotically optimal up to polylogarithmic factors.
△ Less
Submitted 13 April, 2015;
originally announced April 2015.
-
Asymptotically exact streaming algorithms
Authors:
Marc Heinrich,
Alexander Munteanu,
Christian Sohler
Abstract:
We introduce a new computational model for data streams: asymptotically exact streaming algorithms. These algorithms have an approximation ratio that tends to one as the length of the stream goes to infinity while the memory used by the algorithm is restricted to polylog(n) size. Thus, the output of the algorithm is optimal in the limit. We show positive results in our model for a series of import…
▽ More
We introduce a new computational model for data streams: asymptotically exact streaming algorithms. These algorithms have an approximation ratio that tends to one as the length of the stream goes to infinity while the memory used by the algorithm is restricted to polylog(n) size. Thus, the output of the algorithm is optimal in the limit. We show positive results in our model for a series of important problems that have been discussed in the streaming literature. These include computing the frequency moments, clustering problems and least squares regression. Our results also include lower bounds for problems, which have streaming algorithms in the ordinary setting but do not allow for sublinear space algorithms in our model.
△ Less
Submitted 8 August, 2014;
originally announced August 2014.
-
Planar Graphs: Random Walks and Bipartiteness Testing
Authors:
Artur Czumaj,
Morteza Monemizadeh,
Krzysztof Onak,
Christian Sohler
Abstract:
We initiate the study of property testing in arbitrary planar graphs. We prove that bipartiteness can be tested in constant time, improving on the previous bound of $\tilde{O}(\sqrt{n})$ for graphs on $n$ vertices. The constant-time testability was only known for planar graphs with bounded degree.
Our algorithm is based on random walks. Since planar graphs have good separators, i.e., bad expansi…
▽ More
We initiate the study of property testing in arbitrary planar graphs. We prove that bipartiteness can be tested in constant time, improving on the previous bound of $\tilde{O}(\sqrt{n})$ for graphs on $n$ vertices. The constant-time testability was only known for planar graphs with bounded degree.
Our algorithm is based on random walks. Since planar graphs have good separators, i.e., bad expansion, our analysis diverges from standard techniques that involve the fast convergence of random walks on expanders. We reduce the problem to the task of detecting an odd-parity cycle in a multigraph induced by constant-length cycles. We iteratively reduce the length of cycles while preserving the detection probability, until the multigraph collapses to a collection of easily discoverable self-loops.
Our approach extends to arbitrary minor-free graphs. We also believe that our techniques will find applications to testing other properties in arbitrary minor-free graphs.
△ Less
Submitted 21 December, 2018; v1 submitted 8 July, 2014;
originally announced July 2014.
-
Property-Testing in Sparse Directed Graphs: 3-Star-Freeness and Connectivity
Authors:
Frank Hellweg,
Christian Sohler
Abstract:
We study property testing in directed graphs in the bounded degree model, where we assume that an algorithm may only query the outgoing edges of a vertex, a model proposed by Bender and Ron in 2002. As our first main result, we we present a property testing algorithm for strong connectivity in this model, having a query complexity of $\mathcal{O}(n^{1-ε/(3+α)})$ for arbitrary $α>0$; it is based on…
▽ More
We study property testing in directed graphs in the bounded degree model, where we assume that an algorithm may only query the outgoing edges of a vertex, a model proposed by Bender and Ron in 2002. As our first main result, we we present a property testing algorithm for strong connectivity in this model, having a query complexity of $\mathcal{O}(n^{1-ε/(3+α)})$ for arbitrary $α>0$; it is based on a reduction to estimating the vertex indegree distribution. For subgraph-freeness we give a property testing algorithm with a query complexity of $\mathcal{O}(n^{1-1/k})$, where $k$ is the number of connected componentes in the queried subgraph which have no incoming edge. We furthermore take a look at the problem of testing whether a weakly connected graph contains vertices with a degree of least $3$, which can be viewed as testing for freeness of all orientations of $3$-stars; as our second main result, we show that this property can be tested with a query complexity of $\mathcal{O}(\sqrt{n})$ instead of, what would be expected, $Ω(n^{2/3})$.
△ Less
Submitted 2 December, 2013;
originally announced December 2013.
-
Analysis of Agglomerative Clustering
Authors:
Marcel R. Ackermann,
Johannes Blömer,
Daniel Kuntze,
Christian Sohler
Abstract:
The diameter $k$-clustering problem is the problem of partitioning a finite subset of $\mathbb{R}^d$ into $k$ subsets called clusters such that the maximum diameter of the clusters is minimized. One early clustering algorithm that computes a hierarchy of approximate solutions to this problem (for all values of $k$) is the agglomerative clustering algorithm with the complete linkage strategy. For d…
▽ More
The diameter $k$-clustering problem is the problem of partitioning a finite subset of $\mathbb{R}^d$ into $k$ subsets called clusters such that the maximum diameter of the clusters is minimized. One early clustering algorithm that computes a hierarchy of approximate solutions to this problem (for all values of $k$) is the agglomerative clustering algorithm with the complete linkage strategy. For decades, this algorithm has been widely used by practitioners. However, it is not well studied theoretically. In this paper, we analyze the agglomerative complete linkage clustering algorithm. Assuming that the dimension $d$ is a constant, we show that for any $k$ the solution computed by this algorithm is an $O(\log k)$-approximation to the diameter $k$-clustering problem. Our analysis does not only hold for the Euclidean distance but for any metric that is based on a norm. Furthermore, we analyze the closely related $k$-center and discrete $k$-center problem. For the corresponding agglomerative algorithms, we deduce an approximation factor of $O(\log k)$ as well.
△ Less
Submitted 7 March, 2014; v1 submitted 16 December, 2010;
originally announced December 2010.
-
Finding Cycles and Trees in Sublinear Time
Authors:
Artur Czumaj,
Oded Goldreich,
Dana Ron,
C. Seshadhri,
Asaf Shapira,
Christian Sohler
Abstract:
We present sublinear-time (randomized) algorithms for finding simple cycles of length at least $k\geq 3$ and tree-minors in bounded-degree graphs. The complexity of these algorithms is related to the distance of the graph from being $C_k$-minor-free (resp., free from having the corresponding tree-minor). In particular, if the graph is far (i.e., $Ω(1)$-far) {from} being cycle-free, i.e. if one has…
▽ More
We present sublinear-time (randomized) algorithms for finding simple cycles of length at least $k\geq 3$ and tree-minors in bounded-degree graphs. The complexity of these algorithms is related to the distance of the graph from being $C_k$-minor-free (resp., free from having the corresponding tree-minor). In particular, if the graph is far (i.e., $Ω(1)$-far) {from} being cycle-free, i.e. if one has to delete a constant fraction of edges to make it cycle-free, then the algorithm finds a cycle of polylogarithmic length in time $\tildeO(\sqrt{N})$, where $N$ denotes the number of vertices. This time complexity is optimal up to polylogarithmic factors.
The foregoing results are the outcome of our study of the complexity of {\em one-sided error} property testing algorithms in the bounded-degree graphs model. For example, we show that cycle-freeness of $N$-vertex graphs can be tested with one-sided error within time complexity $\tildeO(\poly(1/\e)\cdot\sqrt{N})$. This matches the known $Ω(\sqrt{N})$ query lower bound, and contrasts with the fact that any minor-free property admits a {\em two-sided error} tester of query complexity that only depends on the proximity parameter $\e$. For any constant $k\geq3$, we extend this result to testing whether the input graph has a simple cycle of length at least $k$. On the other hand, for any fixed tree $T$, we show that $T$-minor-freeness has a one-sided error tester of query complexity that only depends on the proximity parameter $\e$.
Our algorithm for finding cycles in bounded-degree graphs extends to general graphs, where distances are measured with respect to the actual number of edges. Such an extension is not possible with respect to finding tree-minors in $o(\sqrt{N})$ complexity.
△ Less
Submitted 3 April, 2012; v1 submitted 23 July, 2010;
originally announced July 2010.