Search | arXiv e-print repository

A Simple and Optimal Sublinear Algorithm for Mean Estimation

Authors: Beatrice Bertolotti, Matteo Russo, Chris Schwiegelshohn

Abstract: We study the sublinear mean estimation problem. Specifically, we aim to output a point minimizing the sum of squared Euclidean distances. We show that a multiplicative $(1+\varepsilon)$ approximation can be found with probability $1-δ$ using $O(\varepsilon^{-1}\log δ^{-1})$ many independent random samples. We also provide a matching lower bound. We study the sublinear mean estimation problem. Specifically, we aim to output a point minimizing the sum of squared Euclidean distances. We show that a multiplicative $(1+\varepsilon)$ approximation can be found with probability $1-δ$ using $O(\varepsilon^{-1}\log δ^{-1})$ many independent random samples. We also provide a matching lower bound. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.01339 [pdf, other]

Sensitivity Sampling for $k$-Means: Worst Case and Stability Optimal Coreset Bounds

Authors: Nikhil Bansal, Vincent Cohen-Addad, Milind Prabhu, David Saulpic, Chris Schwiegelshohn

Abstract: Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as $k$-means. Given a point set $P$, a coreset $Ω$ is a small, weighted summary that preserves the cost of all candidate solutions $S$ up to a $(1\pm \varepsilon)$ factor. For $k$-means in $d$-dimensional Euclidean space the cost for solution $S$ is $\sum_{p\in P}\min_{s\in S}\|p-s\|^2$. A ver… ▽ More Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as $k$-means. Given a point set $P$, a coreset $Ω$ is a small, weighted summary that preserves the cost of all candidate solutions $S$ up to a $(1\pm \varepsilon)$ factor. For $k$-means in $d$-dimensional Euclidean space the cost for solution $S$ is $\sum_{p\in P}\min_{s\in S}\|p-s\|^2$. A very popular method for coreset construction, both in theory and practice, is Sensitivity Sampling, where points are sampled in proportion to their importance. We show that Sensitivity Sampling yields optimal coresets of size $\tilde{O}(k/\varepsilon^2\min(\sqrt{k},\varepsilon^{-2}))$ for worst-case instances. Uniquely among all known coreset algorithms, for well-clusterable data sets with $Ω(1)$ cost stability, Sensitivity Sampling gives coresets of size $\tilde{O}(k/\varepsilon^2)$, improving over the worst-case lower bound. Notably, Sensitivity Sampling does not have to know the cost stability in order to exploit it: It is appropriately sensitive to the clusterability of the data set while being oblivious to it. We also show that any coreset for stable instances consisting of only input points must have size $Ω(k/\varepsilon^2)$. Our results for Sensitivity Sampling also extend to the $k$-median problem, and more general metric spaces. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 57 pages

arXiv:2404.01936 [pdf, other]

Settling Time vs. Accuracy Tradeoffs for Clustering Big Data

Authors: Andrew Draganov, David Saulpic, Chris Schwiegelshohn

Abstract: We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of poi… ▽ More We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points - while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the dataset size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time - within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available and has scripts to recreate the experiments. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2402.04035 [pdf, ps, other]

Low-Distortion Clustering with Ordinal and Limited Cardinal Information

Authors: Jakob Burkhardt, Ioannis Caragiannis, Karl Fehrs, Matteo Russo, Chris Schwiegelshohn, Sudarshan Shyam

Abstract: Motivated by recent work in computational social choice, we extend the metric distortion framework to clustering problems. Given a set of $n$ agents located in an underlying metric space, our goal is to partition them into $k$ clusters, optimizing some social cost objective. The metric space is defined by a distance function $d$ between the agent locations. Information about $d$ is available only… ▽ More Motivated by recent work in computational social choice, we extend the metric distortion framework to clustering problems. Given a set of $n$ agents located in an underlying metric space, our goal is to partition them into $k$ clusters, optimizing some social cost objective. The metric space is defined by a distance function $d$ between the agent locations. Information about $d$ is available only implicitly via $n$ rankings, through which each agent ranks all other agents in terms of their distance from her. Still, we would like to evaluate clustering algorithms in terms of social cost objectives that are defined using $d$. This is done using the notion of distortion, which measures how far from optimality a clustering can be, taking into account all underlying metrics that are consistent with the ordinal information available. Unfortunately, the most important clustering objectives do not admit algorithms with finite distortion. To sidestep this disappointing fact, we follow two alternative approaches: We first explore whether resource augmentation can be beneficial. We consider algorithms that use more than $k$ clusters but compare their social cost to that of the optimal $k$-clusterings. We show that using exponentially (in terms of $k$) many clusters, we can get low (constant or logarithmic) distortion for the $k$-center and $k$-median objectives. Interestingly, such an exponential blowup is shown to be necessary. More importantly, we explore whether limited cardinal information can be used to obtain better results. Somewhat surprisingly, for $k$-median and $k$-center, we show that a number of queries that is polynomial in $k$ and only logarithmic in $n$ (i.e., only sublinear in the number of agents for the most relevant scenarios in practice) is enough to get constant distortion. △ Less

Submitted 6 February, 2024; originally announced February 2024.

Comments: to appear in AAAI 2024

arXiv:2310.18146 [pdf, other]

Adaptive Out-Orientations with Applications

Authors: Chandra Chekuri, Aleksander Bjørn Christiansen, Jacob Holm, Ivor van der Hoog, Kent Quanrud, Eva Rotenberg, Chris Schwiegelshohn

Abstract: We give improved algorithms for maintaining edge-orientations of a fully-dynamic graph, such that the out-degree of each vertex is bounded. On one hand, we show how to orient the edges such that the out-degree of each vertex is proportional to the arboricity $α$ of the graph, in, either, an amortised update time of $O(\log^2 n \log α)$, or a worst-case update time of $O(\log^3 n \log α)$. On the o… ▽ More We give improved algorithms for maintaining edge-orientations of a fully-dynamic graph, such that the out-degree of each vertex is bounded. On one hand, we show how to orient the edges such that the out-degree of each vertex is proportional to the arboricity $α$ of the graph, in, either, an amortised update time of $O(\log^2 n \log α)$, or a worst-case update time of $O(\log^3 n \log α)$. On the other hand, motivated by applications including dynamic maximal matching, we obtain a different trade-off, namely either $O(\log n \log α)$, amortised, or $O(\log ^2 n \log α)$, worst-case time, for the problem of maintaining an edge-orientation with at most $O(α+ \log n)$ out-edges per vertex. Since our algorithms have update times with worst-case guarantees, the number of changes to the solution (i.e. the recourse) is naturally limited. Our algorithms adapt to the current arboricity of the graph, and yield improvements over previous work: Firstly, we obtain an $O(\varepsilon^{-6}\log^3 n \log ρ)$ worst-case update time algorithm for maintaining a $(1+\varepsilon)$ approximation of the maximum subgraph density, $ρ$. Secondly, we obtain an $O(\varepsilon^{-6}\log^3 n \log α)$ worst-case update time algorithm for maintaining a $(1 + \varepsilon) \cdot OPT + 2$ approximation of the optimal out-orientation of a graph with adaptive arboricity $α$. This yields the first worst-case polylogarithmic dynamic algorithm for decomposing into $O(α)$ forests.Thirdly, we obtain arboricity-adaptive fully-dynamic deterministic algorithms for a variety, of problems including maximal matching, $Δ+1$ coloring, and matrix vector multiplication. All update times are worst-case $O(α+\log^2n \log α)$, where $α$ is the current arboricity of the graph. △ Less

Submitted 4 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

Comments: To appear at SODA24

arXiv:2310.09127 [pdf, other]

On Generalization Bounds for Projective Clustering

Authors: Maria Sofia Bucarelli, Matilde Fjeldsø Larsen, Chris Schwiegelshohn, Mads Bech Toftrup

Abstract: Given a set of points, clustering consists of finding a partition of a point set into $k$ clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous $k$-median and $k$-means objectives. One may also choose centers to be $j$ dimensional subspaces, which gives rise to subspace clustering. In this paper,… ▽ More Given a set of points, clustering consists of finding a partition of a point set into $k$ clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous $k$-median and $k$-means objectives. One may also choose centers to be $j$ dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of $n$ samples $P$ drawn independently from some unknown, but fixed distribution $\mathcal{D}$, how quickly does a solution computed on $P$ converge to the optimal clustering of $\mathcal{D}$? We give several near optimal results. In particular, For center-based objectives, we show a convergence rate of $\tilde{O}\left(\sqrt{{k}/{n}}\right)$. This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for $k$-means and extends it to other important objectives such as $k$-median. For subspace clustering with $j$-dimensional subspaces, we show a convergence rate of $\tilde{O}\left(\sqrt{\frac{kj^2}{n}}\right)$. These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes $k$-means, we show a convergence rate of $Ω\left(\sqrt{\frac{kj}{n}}\right)$ is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2310.04076 [pdf, other]

Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation

Authors: Vincent Cohen-Addad, David Saulpic, Chris Schwiegelshohn

Abstract: In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the preci… ▽ More In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the precision parameter $\varepsilon^{-1}$ or $k$. Furthermore, there is no coreset construction that succeeds with probability $1-1/n$ and whose size does not depend on the number of input points, $n$. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman, WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are $Ω(1)$ for both $k$-median and $k$-means, even when allowing a complexity FPT in the number of clusters $k$. This stands in sharp contrast with the $(1+\varepsilon)$-approximation achievable in that case, when allowing randomization. In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We also construct a deterministic algorithm for computing $(1+\varepsilon)$-approximation to $k$-median and $k$-means in high dimensional Euclidean spaces in time $2^{k^2/\varepsilon^{O(1)}} poly(nd)$, close to the best randomized complexity. Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS '22] by a factor $k$. △ Less

Submitted 6 October, 2023; originally announced October 2023.

Comments: FOCS 2023. Abstract reduced for arxiv requirements

arXiv:2304.02261 [pdf, other]

Optimal Sketching Bounds for Sparse Linear Regression

Authors: Tung Mai, Alexander Munteanu, Cameron Musco, Anup B. Rao, Chris Schwiegelshohn, David P. Woodruff

Abstract: We study oblivious sketching for $k$-sparse linear regression under various loss functions such as an $\ell_p$ norm, or from a broad class of hinge-like loss functions, which includes the logistic and ReLU losses. We show that for sparse $\ell_2$ norm regression, there is a distribution over oblivious sketches with $Θ(k\log(d/k)/\varepsilon^2)$ rows, which is tight up to a constant factor. This ex… ▽ More We study oblivious sketching for $k$-sparse linear regression under various loss functions such as an $\ell_p$ norm, or from a broad class of hinge-like loss functions, which includes the logistic and ReLU losses. We show that for sparse $\ell_2$ norm regression, there is a distribution over oblivious sketches with $Θ(k\log(d/k)/\varepsilon^2)$ rows, which is tight up to a constant factor. This extends to $\ell_p$ loss with an additional additive $O(k\log(k/\varepsilon)/\varepsilon^2)$ term in the upper bound. This establishes a surprising separation from the related sparse recovery problem, which is an important special case of sparse regression. For this problem, under the $\ell_2$ norm, we observe an upper bound of $O(k \log (d)/\varepsilon + k\log(k/\varepsilon)/\varepsilon^2)$ rows, showing that sparse recovery is strictly easier to sketch than sparse regression. For sparse regression under hinge-like loss functions including sparse logistic and sparse ReLU regression, we give the first known sketching bounds that achieve $o(d)$ rows showing that $O(μ^2 k\log(μn d/\varepsilon)/\varepsilon^2)$ rows suffice, where $μ$ is a natural complexity parameter needed to obtain relative error bounds for these loss functions. We again show that this dimension is tight, up to lower order terms and the dependence on $μ$. Finally, we show that similar sketching bounds can be achieved for LASSO regression, a popular convex relaxation of sparse regression, where one aims to minimize $\|Ax-b\|_2^2+λ\|x\|_1$ over $x\in\mathbb{R}^d$. We show that sketching dimension $O(\log(d)/(λ\varepsilon)^2)$ suffices and that the dependence on $d$ and $λ$ is tight. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: AISTATS 2023

arXiv:2302.06165 [pdf, ps, other]

Sparse Dimensionality Reduction Revisited

Authors: Mikael Møller Høgsgaard, Lion Kamma, Kasper Green Larsen, Jelani Nelson, Chris Schwiegelshohn

Abstract: The sparse Johnson-Lindenstrauss transform is one of the central techniques in dimensionality reduction. It supports embedding a set of $n$ points in $\mathbb{R}^d$ into $m=O(\varepsilon^{-2} \lg n)$ dimensions while preserving all pairwise distances to within $1 \pm \varepsilon$. Each input point $x$ is embedded to $Ax$, where $A$ is an $m \times d$ matrix having $s$ non-zeros per column, allowin… ▽ More The sparse Johnson-Lindenstrauss transform is one of the central techniques in dimensionality reduction. It supports embedding a set of $n$ points in $\mathbb{R}^d$ into $m=O(\varepsilon^{-2} \lg n)$ dimensions while preserving all pairwise distances to within $1 \pm \varepsilon$. Each input point $x$ is embedded to $Ax$, where $A$ is an $m \times d$ matrix having $s$ non-zeros per column, allowing for an embedding time of $O(s \|x\|_0)$. Since the sparsity of $A$ governs the embedding time, much work has gone into improving the sparsity $s$. The current state-of-the-art by Kane and Nelson (JACM'14) shows that $s = O(\varepsilon ^{-1} \lg n)$ suffices. This is almost matched by a lower bound of $s = Ω(\varepsilon ^{-1} \lg n/\lg(1/\varepsilon))$ by Nelson and Nguyen (STOC'13). Previous work thus suggests that we have near-optimal embeddings. In this work, we revisit sparse embeddings and identify a loophole in the lower bound. Concretely, it requires $d \geq n$, which in many applications is unrealistic. We exploit this loophole to give a sparser embedding when $d = o(n)$, achieving $s = O(\varepsilon^{-1}(\lg n/\lg(1/\varepsilon)+\lg^{2/3}n \lg^{1/3} d))$. We also complement our analysis by strengthening the lower bound of Nelson and Nguyen to hold also when $d \ll n$, thereby matching the first term in our new sparsity upper bound. Finally, we also improve the sparsity of the best oblivious subspace embeddings for optimal embedding dimensionality. △ Less

Submitted 13 February, 2023; originally announced February 2023.

arXiv:2302.03071 [pdf, ps, other]

Optimally Interpolating between Ex-Ante Fairness and Welfare

Authors: Mikael Høgsgaard, Panagiotis Karras, Wenyue Ma, Nidhi Rathi, Chris Schwiegelshohn

Abstract: For the fundamental problem of allocating a set of resources among individuals with varied preferences, the quality of an allocation relates to the degree of fairness and the collective welfare achieved. Unfortunately, in many resource-allocation settings, it is computationally hard to maximize welfare while achieving fairness goals. In this work, we consider ex-ante notions of fairness; popular… ▽ More For the fundamental problem of allocating a set of resources among individuals with varied preferences, the quality of an allocation relates to the degree of fairness and the collective welfare achieved. Unfortunately, in many resource-allocation settings, it is computationally hard to maximize welfare while achieving fairness goals. In this work, we consider ex-ante notions of fairness; popular examples include the \emph{randomized round-robin algorithm} and \emph{sortition mechanism}. We propose a general framework to systematically study the \emph{interpolation} between fairness and welfare goals in a multi-criteria setting. We develop two efficient algorithms ($\varepsilon-Mix$ and $Simple-Mix$) that achieve different trade-off guarantees with respect to fairness and welfare. $\varepsilon-Mix$ achieves an optimal multi-criteria approximation with respect to fairness and welfare, while $Simple-Mix$ achieves optimality up to a constant factor with zero computational overhead beyond the underlying \emph{welfare-maximizing mechanism} and the \emph{ex-ante fair mechanism}. Our framework makes no assumptions on either of the two underlying mechanisms, other than that the fair mechanism produces a distribution over the set of all allocations. Indeed, if these mechanisms are themselves approximation algorithms, our framework will retain the approximation factor, guaranteeing sensitivity to the quality of the underlying mechanisms, while being \emph{oblivious} to them. We also give an extensive experimental analysis for the aforementioned ex-ante fair mechanisms on real data sets, confirming our theoretical analysis. △ Less

Submitted 6 February, 2023; originally announced February 2023.

arXiv:2211.08184 [pdf, other]

Improved Coresets for Euclidean $k$-Means

Authors: Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, Omar Ali Sheikh-Omar

Abstract: Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weigh… ▽ More Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset. The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a $(1\pm \varepsilon)$ factor. The current state of the art coreset size is $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for Euclidean $k$-means and $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for Euclidean $k$-median. The best known lower bound for both problems is $Ω(k \varepsilon^{-2})$. In this paper, we improve the upper bounds $\tilde O(\min(k^{3/2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for $k$-means and $\tilde O(\min(k^{4/3} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for $k$-median. In particular, ours is the first provable bound that breaks through the $k^2$ barrier while retaining an optimal dependency on $\varepsilon$. △ Less

Submitted 16 November, 2022; v1 submitted 15 November, 2022; originally announced November 2022.

arXiv:2209.14087 [pdf, ps, other]

Adaptive Out-Orientations with Applications

Authors: Aleksander B. G. Christiansen, Jacob Holm, Ivor van der Hoog, Eva Rotenberg, Chris Schwiegelshohn

Abstract: We give improved algorithms for maintaining edge-orientations of a fully-dynamic graph, such that the out-degree of each vertex is bounded. On one hand, we show how to orient the edges such that the out-degree of each vertex is proportional to the arboricity $α$ of the graph, in a worst-case update time of $O(\log^3 n \log α)$. On the other hand, motivated by applications including dynamic maximal… ▽ More We give improved algorithms for maintaining edge-orientations of a fully-dynamic graph, such that the out-degree of each vertex is bounded. On one hand, we show how to orient the edges such that the out-degree of each vertex is proportional to the arboricity $α$ of the graph, in a worst-case update time of $O(\log^3 n \log α)$. On the other hand, motivated by applications including dynamic maximal matching, we obtain a different trade-off, namely the improved worst case update time of $O(\log ^2 n \log α)$ for the problem of maintaining an edge-orientation with at most $O(α+ \log n)$ out-edges per vertex. Since our algorithms have update times with worst-case guarantees, the number of changes to the solution (i.e. the recourse) is naturally limited. Our algorithms adapt to the current arboricity of the graph, and yield improvements over previous work: Firstly, we obtain an $O(\varepsilon^{-6}\log^3 n \log ρ)$ worst-case update time algorithm for maintaining a $(1+\varepsilon)$ approximation of the maximum subgraph density, $ρ$. Secondly, we obtain an $O(\varepsilon^{-6}\log^3 n \log α)$ worst-case update time algorithm for maintaining a $(1 + \varepsilon) \cdot OPT + 2$ approximation of the optimal out-orientation of a graph with adaptive arboricity $α$. This yields the first worst-case polylogarithmic dynamic algorithm for decomposing into $O(α)$ forests.Thirdly, we obtain arboricity-adaptive fully-dynamic deterministic algorithms for a variety, of problems including maximal matching, $Δ+1$ coloring, and matrix vector multiplication. All update times are worst-case $O(α+\log^2n \log α)$, where $α$ is the current arboricity of the graph. △ Less

Submitted 15 February, 2023; v1 submitted 28 September, 2022; originally announced September 2022.

arXiv:2209.01901 [pdf, ps, other]

The Power of Uniform Sampling for Coresets

Authors: Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H. -C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, Xuan Wu

Abstract: Motivated by practical generalizations of the classic $k$-median and $k$-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive er… ▽ More Motivated by practical generalizations of the classic $k$-median and $k$-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error. This reduction enables us to construct coresets using uniform sampling, in contrast to the widely-used importance sampling, and consequently we can easily handle constrained objectives. Notably and perhaps surprisingly, this simpler sampling scheme can yield coresets whose size is independent of $n$, the number of input points. Our technique yields smaller coresets, and sometimes the first coresets, for a large number of constrained clustering problems, including capacitated clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in minor-excluded graph, and polygon clustering under Fréchet and Hausdorff distance. Finally, our technique yields also smaller coresets for $1$-median in low-dimensional Euclidean spaces, specifically of size $\tilde{O}(\varepsilon^{-1.5})$ in $\mathbb{R}^2$ and $\tilde{O}(\varepsilon^{-1.6})$ in $\mathbb{R}^3$. △ Less

Submitted 17 September, 2022; v1 submitted 5 September, 2022; originally announced September 2022.

arXiv:2207.05150 [pdf, ps, other]

Breaching the 2 LMP Approximation Barrier for Facility Location with Applications to k-Median

Authors: Vincent Cohen-Addad, Fabrizio Grandoni, Euiwoong Lee, Chris Schwiegelshohn

Abstract: The Uncapacitated Facility Location (UFL) problem is one of the most fundamental clustering problems: Given a set of clients $C$ and a set of facilities $F$ in a metric space $(C \cup F, dist)$ with facility costs $open : F \to \mathbb{R}^+$, the goal is to find a set of facilities $S \subseteq F$ to minimize the sum of the opening cost $open(S)$ and the connection cost… ▽ More The Uncapacitated Facility Location (UFL) problem is one of the most fundamental clustering problems: Given a set of clients $C$ and a set of facilities $F$ in a metric space $(C \cup F, dist)$ with facility costs $open : F \to \mathbb{R}^+$, the goal is to find a set of facilities $S \subseteq F$ to minimize the sum of the opening cost $open(S)$ and the connection cost $d(S) := \sum_{p \in C} \min_{c \in S} dist(p, c)$. An algorithm for UFL is called a Lagrangian Multiplier Preserving (LMP) $α$ approximation if it outputs a solution $S\subseteq F$ satisfying $open(S) + d(S) \leq open(S^*) + αd(S^*)$ for any $S^* \subseteq F$. The best-known LMP approximation ratio for UFL is at most $2$ by the JMS algorithm of Jain, Mahdian, and Saberi based on the Dual-Fitting technique. We present a (slightly) improved LMP approximation algorithm for UFL. This is achieved by combining the Dual-Fitting technique with Local Search, another popular technique to address clustering problems. From a conceptual viewpoint, our result gives a theoretical evidence that local search can be enhanced so as to avoid bad local optima by choosing the initial feasible solution with LP-based techniques. Using the framework of bipoint solutions, our result directly implies a (slightly) improved approximation for the $k$-Median problem from 2.6742 to 2.67059. △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: 55 pages

arXiv:2207.00966 [pdf, other]

An Empirical Evaluation of $k$-Means Coresets

Authors: Chris Schwiegelshohn, Omar Ali Sheikh-Omar

Abstract: Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as $k$-means in both theory and practice. Curiously, there exists no work on comparing the quality of available $k$-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of… ▽ More Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as $k$-means in both theory and practice. Curiously, there exists no work on comparing the quality of available $k$-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice. △ Less

Submitted 3 July, 2022; originally announced July 2022.

arXiv:2206.08646 [pdf, other]

Scalable Differentially Private Clustering via Hierarchically Separated Trees

Authors: Vincent Cohen-Addad, Alessandro Epasto, Silvio Lattanzi, Vahab Mirrokni, Andres Munoz, David Saulpic, Chris Schwiegelshohn, Sergei Vassilvitskii

Abstract: We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / ε^2)$, where $ε$ is the priv… ▽ More We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / ε^2)$, where $ε$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: To appear at KDD'22

arXiv:2202.12793 [pdf, other]

Towards Optimal Lower Bounds for k-median and k-means Coresets

Authors: Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn

Abstract: Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern da… ▽ More Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative $(1 \pm \varepsilon)$ factor, hence reducing from large to small scale the input to the problem. In this paper, we present improved lower bounds for coresets in various metric spaces. In finite metrics consisting of $n$ points and doubling metrics with doubling constant $D$, we show that any coreset for $(k,z)$ clustering must consist of at least $Ω(k \varepsilon^{-2} \log n)$ and $Ω(k \varepsilon^{-2} D)$ points, respectively. Both bounds match previous upper bounds up to polylog factors. In Euclidean spaces, we show that any coreset for $(k,z)$ clustering must consists of at least $Ω(k\varepsilon^{-2})$ points. We complement these lower bounds with a coreset construction consisting of at most $\tilde{O}(k\varepsilon^{-2}\cdot \min(\varepsilon^{-z},k))$ points. △ Less

Submitted 25 February, 2022; originally announced February 2022.

arXiv:2108.08825 [pdf, other]

Maintaining an EDCS in General Graphs: Simpler, Density-Sensitive and with Worst-Case Time Bounds

Authors: Fabrizio Grandoni, Chris Schwiegelshohn, Shay Solomon, Amitai Uzrad

Abstract: In their breakthrough ICALP'15 paper, Bernstein and Stein presented an algorithm for maintaining a $(3/2+ε)$-approximate maximum matching in fully dynamic {\em bipartite} graphs with a {\em worst-case} update time of $O_ε(m^{1/4})$; we use the $O_ε$ notation to suppress the $ε$-dependence. Their main technical contribution was in presenting a new type of bounded-degree subgraph, which they named a… ▽ More In their breakthrough ICALP'15 paper, Bernstein and Stein presented an algorithm for maintaining a $(3/2+ε)$-approximate maximum matching in fully dynamic {\em bipartite} graphs with a {\em worst-case} update time of $O_ε(m^{1/4})$; we use the $O_ε$ notation to suppress the $ε$-dependence. Their main technical contribution was in presenting a new type of bounded-degree subgraph, which they named an {\em edge degree constrained subgraph (EDCS)}, which contains a large matching -- of size that is smaller than the maximum matching size of the entire graph by at most a factor of $3/2+ε$. They demonstrate that the EDCS can be maintained with a worst-case update time of $O_ε(m^{1/4})$, and their main result follows as a direct corollary. In their followup SODA'16 paper, Bernstein and Stein generalized their result for general graphs, achieving the same update time of $O_ε(m^{1/4})$, albeit with an amortized rather than worst-case bound. To date, the best {\em deterministic} worst-case update time bound for {\em any} better-than-2 approximate matching is $O(\sqrt{m})$ [Neiman and Solomon, STOC'13], [Gupta and Peng, FOCS'13]; allowing randomization (against an oblivious adversary) one can achieve a much better (still polynomial) update time for approximation slightly below 2 [Behnezhad, Lacki and Mirrokni, SODA'20]. In this work we\footnote{\em quasi nanos, gigantium humeris insidentes} simplify the approach of Bernstein and Stein for bipartite graphs, which allows us to generalize it for general graphs while maintaining the same bound of $O_ε(m^{1/4})$ on the {\em worst-case} update time. Moreover, our approach is {\em density-sensitive}: If the {\em arboricity} of the dynamic graph is bounded by $α$ at all times, then the worst-case update time of the algorithm is $O_ε(\sqrtα)$. △ Less

Submitted 19 August, 2021; originally announced August 2021.

arXiv:2104.06133 [pdf, other]

doi 10.1145/3406325.3451022

A New Coreset Framework for Clustering

Authors: Vincent Cohen-Addad, David Saulpic, Chris Schwiegelshohn

Abstract: Given a metric space, the $(k,z)$-clustering problem consists of finding $k$ centers such that the sum of the of distances raised to the power $z$ of every point to its closest center is minimized. This encapsulates the famous $k$-median ($z=1$) and $k$-means ($z=2$) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known a… ▽ More Given a metric space, the $(k,z)$-clustering problem consists of finding $k$ centers such that the sum of the of distances raised to the power $z$ of every point to its closest center is minimized. This encapsulates the famous $k$-median ($z=1$) and $k$-means ($z=2$) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as \emph{coresets}, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases. △ Less

Submitted 29 July, 2022; v1 submitted 13 April, 2021; originally announced April 2021.

Comments: Improved presentation. Adds a simpler suboptimal proof for interesting points, and an improved analysis for planar graphs. Corrects errors in the construction of centroid sets

arXiv:2002.11621 [pdf, ps, other]

doi 10.1145/3308560.3317587

Algorithms for Fair Team Formation in Online Labour Marketplaces

Authors: Giorgio Barnabò, Adriano Fazzone, Stefano Leonardi, Chris Schwiegelshohn

Abstract: As freelancing work keeps on growing almost everywhere due to a sharp decrease in communication costs and to the widespread of Internet-based labour marketplaces (e.g., guru.com, feelancer.com, mturk.com, upwork.com), many researchers and practitioners have started exploring the benefits of outsourcing and crowdsourcing. Since employers often use these platforms to find a group of workers to compl… ▽ More As freelancing work keeps on growing almost everywhere due to a sharp decrease in communication costs and to the widespread of Internet-based labour marketplaces (e.g., guru.com, feelancer.com, mturk.com, upwork.com), many researchers and practitioners have started exploring the benefits of outsourcing and crowdsourcing. Since employers often use these platforms to find a group of workers to complete a specific task, researchers have focused their efforts on the study of team formation and matching algorithms and on the design of effective incentive schemes. Nevertheless, just recently, several concerns have been raised on possibly unfair biases introduced through the algorithms used to carry out these selection and matching procedures. For this reason, researchers have started studying the fairness of algorithms related to these online marketplaces, looking for intelligent ways to overcome the algorithmic bias that frequently arises. Broadly speaking, the aim is to guarantee that, for example, the process of hiring workers through the use of machine learning and algorithmic data analysis tools does not discriminate, even unintentionally, on grounds of nationality or gender. In this short paper, we define the Fair Team Formation problem in the following way: given an online labour marketplace where each worker possesses one or more skills, and where all workers are divided into two or more not overlap** classes (for examples, men and women), we want to design an algorithm that is able to find a team with all the skills needed to complete a given task, and that has the same number of people from all classes. We provide inapproximability results for the Fair Team Formation problem together with four algorithms for the problem itself. We also tested the effectiveness of our algorithmic solutions by performing experiments using real data from an online labor marketplace. △ Less

Submitted 14 February, 2020; originally announced February 2020.

Comments: Accepted at "FATES 2019 : 1st Workshop on Fairness, Accountability, Transparency, Ethics, and Society on the Web" (http://fates19.isti.cnr.it)

Journal ref: "Companion Proceedings of The 2019 World Wide Web Conference", 2019, pages 484-490

arXiv:2002.07892 [pdf, other]

Fair Clustering with Multiple Colors

Authors: Matteo Böhm, Adriano Fazzone, Stefano Leonardi, Chris Schwiegelshohn

Abstract: A fair clustering instance is given a data set $A$ in which every point is assigned some color. Colors correspond to various protected attributes such as sex, ethnicity, or age. A fair clustering is an instance where membership of points in a cluster is uncorrelated with the coloring of the points. Of particular interest is the case where all colors are equally represented. If we have exactly tw… ▽ More A fair clustering instance is given a data set $A$ in which every point is assigned some color. Colors correspond to various protected attributes such as sex, ethnicity, or age. A fair clustering is an instance where membership of points in a cluster is uncorrelated with the coloring of the points. Of particular interest is the case where all colors are equally represented. If we have exactly two colors, Chierrichetti, Kumar, Lattanzi and Vassilvitskii (NIPS 2017) showed that various $k$-clustering objectives admit a constant factor approximation. Since then, a number of follow up work has attempted to extend this result to a multi-color case, though so far, the only known results either result in no-constant factor approximation, apply only to special clustering objectives such as $k$-center, yield bicrititeria approximations, or require $k$ to be constant. In this paper, we present a simple reduction from unconstrained $k$-clustering to fair $k$-clustering for a large range of clustering objectives including $k$-median, $k$-means, and $k$-center. The reduction loses only a constant factor in the approximation guarantee, marking the first true constant factor approximation for many of these problems. △ Less

Submitted 5 March, 2021; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: Partially supported by the ERC Advanced Grant 788893 AMDROMA "Algorithmic and Mechanism Design Research in Online Markets" and MIUR PRIN project ALGADIMAR "Algorithms, Games, and Digital Markets"

arXiv:1905.13651 [pdf, other]

Principal Fairness: Removing Bias via Projections

Authors: Aris Anagnostopoulos, Luca Becchetti, Adriano Fazzone, Cristina Menghini, Chris Schwiegelshohn

Abstract: Reducing hidden bias in the data and ensuring fairness in algorithmic data analysis has recently received significant attention. We complement several recent papers in this line of research by introducing a general method to reduce bias in the data through random projections in a "fair" subspace. We apply this method to densest subgraph problem. For densest subgraph, our approach based on fair p… ▽ More Reducing hidden bias in the data and ensuring fairness in algorithmic data analysis has recently received significant attention. We complement several recent papers in this line of research by introducing a general method to reduce bias in the data through random projections in a "fair" subspace. We apply this method to densest subgraph problem. For densest subgraph, our approach based on fair projections allows to recover both theoretically and empirically an almost optimal, fair, dense subgraph hidden in the input data. We also show that, under the small set expansion hypothesis, approximating this problem beyond a factor of 2 is NP-hard and we show a polynomial time algorithm with a matching approximation bound. △ Less

Submitted 5 March, 2021; v1 submitted 31 May, 2019; originally announced May 2019.

Comments: Partially supported by the ERC Advanced Grant 788893 AMDROMA "Algorithmic and Mechanism Design Research in Online Markets" and MIUR PRIN project ALGADIMAR "Algorithms, Games, and Digital Markets"

arXiv:1904.06150 [pdf, ps, other]

Maximizing Online Utilization with Commitment

Authors: Chris Schwiegelshohn, Uwe Schwiegelshohn

Abstract: We investigate online scheduling with commitment for parallel identical machines. Our objective is to maximize the total processing time of accepted jobs. As soon as a job has been submitted, the commitment constraint forces us to decide immediately whether we accept or reject the job. Upon acceptance of a job, we must complete it before its deadline $d$ that satisfies $d \geq (1+ε)\cdot p + r$, w… ▽ More We investigate online scheduling with commitment for parallel identical machines. Our objective is to maximize the total processing time of accepted jobs. As soon as a job has been submitted, the commitment constraint forces us to decide immediately whether we accept or reject the job. Upon acceptance of a job, we must complete it before its deadline $d$ that satisfies $d \geq (1+ε)\cdot p + r$, with $p$ and $r$ being the processing time and the submission time of the job, respectively while $ε>0$ is the slack of the system. Since the hard case typically arises for near-tight deadlines, we consider $\varepsilon\leq 1$. We use competitive analysis to evaluate our algorithms. Our first main contribution is a deterministic preemptive online algorithm with an almost tight competitive ratio on any number of machines. For a single machine, the competitive factor matches the optimal bound $\frac{1+ε}ε$ of the greedy acceptance policy. Then the competitive ratio improves with an increasing number of machines and approaches $(1+ε)\cdot\ln \frac{1+ε}ε$ as the number of machines converges to infinity. This is an exponential improvement over the greedy acceptance policy for small $ε$. In the non-preemptive case, we present a deterministic algorithm on $m$ machines with a competitive ratio of $1+m\cdot \left(\frac{1+ε}ε\right)^{\frac{1}{m}}$. This matches the optimal bound of $2+\frac{1}ε$ of the greedy acceptance policy for a single machine while it again guarantees an exponential improvement over the greedy acceptance policy for small $ε$ and large $m$. In addition, we determine an almost tight lower bound that approaches $m\cdot \left(\frac{1}ε\right)^{\frac{1}{m}}$ for large $m$ and small $ε$. △ Less

Submitted 12 April, 2019; originally announced April 2019.

Comments: 13 pages

MSC Class: 68W27; 68W40

arXiv:1812.10854 [pdf, ps, other]

Fair Coresets and Streaming Algorithms for Fair k-Means Clustering

Authors: Melanie Schmidt, Chris Schwiegelshohn, Christian Sohler

Abstract: We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well. We show how to model and compute so-called coresets for fair clustering problems, which c… ▽ More We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well. We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyd's algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering. △ Less

Submitted 9 March, 2021; v1 submitted 27 December, 2018; originally announced December 2018.

arXiv:1805.08571 [pdf, other]

On Coresets for Logistic Regression

Authors: Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, David P. Woodruff

Abstract: Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $μ(X)$, which quantifies… ▽ More Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $μ(X)$, which quantifies the hardness of compressing a data set for logistic regression. $μ(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $μ(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\varepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression. △ Less

Submitted 8 March, 2021; v1 submitted 22 May, 2018; originally announced May 2018.

arXiv:1701.08423 [pdf, other]

On the Local Structure of Stable Clustering Instances

Authors: Vincent Cohen-Addad, Chris Schwiegelshohn

Abstract: We study the classic $k$-median and $k$-means clustering objectives in the beyond-worst-case scenario. We consider three well-studied notions of structured data that aim at characterizing real-world inputs: Distribution Stability (introduced by Awasthi, Blum, and Sheffet, FOCS 2010), Spectral Separability (introduced by Kumar and Kannan, FOCS 2010), Perturbation Resilience (introduced by Bilu and… ▽ More We study the classic $k$-median and $k$-means clustering objectives in the beyond-worst-case scenario. We consider three well-studied notions of structured data that aim at characterizing real-world inputs: Distribution Stability (introduced by Awasthi, Blum, and Sheffet, FOCS 2010), Spectral Separability (introduced by Kumar and Kannan, FOCS 2010), Perturbation Resilience (introduced by Bilu and Linial, ICS 2010). We prove structural results showing that inputs satisfying at least one of the conditions are inherently "local". Namely, for any such input, any local optimum is close both in term of structure and in term of objective value to the global optima. As a corollary we obtain that the widely-used Local Search algorithm has strong performance guarantees for both the tasks of recovering the underlying optimal clustering and obtaining a clustering of small cost. This is a significant step toward understanding the success of local search heuristics in clustering applications. △ Less

Submitted 10 August, 2017; v1 submitted 29 January, 2017; originally announced January 2017.

arXiv:1605.03949 [pdf, other]

Efficient Similarity Search in Dynamic Data Streams

Authors: Marc Bury, Chris Schwiegelshohn, Mara Sorella

Abstract: The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate methods. The algorithm of choice used to quickly compute the Jaccard index $\frac{\vert A \cap B \vert}{\vert A\cup B\vert}$ of two item sets $A$ and… ▽ More The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate methods. The algorithm of choice used to quickly compute the Jaccard index $\frac{\vert A \cap B \vert}{\vert A\cup B\vert}$ of two item sets $A$ and $B$ is usually a form of min-hashing. Most min-hashing schemes are maintainable in data streams processing only additions, but none are known to work when facing item-wise deletions. In this paper, we investigate scalable approximation algorithms for rational set similarities, a broad class of similarity measures including Jaccard. Motivated by a result of Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity $S$ admits a locality sensitive hashing (LSH) scheme if and only if the corresponding distance $1-S$ is a metric, we can show that there exists a space efficient summary maintaining a $(1\pm \varepsilon)$ multiplicative approximation to $1-S$ in dynamic data streams. This in turn also yields a $\varepsilon$ additive approximation of the similarity. The existence of these approximations hints at, but does not directly imply a LSH scheme in dynamic data streams. Our second and main contribution now lies in the design of such a LSH scheme maintainable in dynamic data streams. The scheme is space efficient, easy to implement and to the best of our knowledge the first of its kind able to process deletions. △ Less

Submitted 8 March, 2021; v1 submitted 12 May, 2016; originally announced May 2016.

arXiv:1505.02019 [pdf, other]

Sublinear Estimation of Weighted Matchings in Dynamic Data Streams

Authors: Marc Bury, Chris Schwiegelshohn

Abstract: This paper presents an algorithm for estimating the weight of a maximum weighted matching by augmenting any estimation routine for the size of an unweighted matching. The algorithm is implementable in any streaming model including dynamic graph streams. We also give the first constant estimation for the maximum matching size in a dynamic graph stream for planar graphs (or any graph with bounded ar… ▽ More This paper presents an algorithm for estimating the weight of a maximum weighted matching by augmenting any estimation routine for the size of an unweighted matching. The algorithm is implementable in any streaming model including dynamic graph streams. We also give the first constant estimation for the maximum matching size in a dynamic graph stream for planar graphs (or any graph with bounded arboricity) using $\tilde{O}(n^{4/5})$ space which also extends to weighted matching. Using previous results by Kapralov, Khanna, and Sudan (2014) we obtain a $\mathrm{polylog}(n)$ approximation for general graphs using $\mathrm{polylog}(n)$ space in random order streams, respectively. In addition, we give a space lower bound of $Ω(n^{1-\varepsilon})$ for any randomized algorithm estimating the size of a maximum matching up to a $1+O(\varepsilon)$ factor for adversarial streams. △ Less

Submitted 9 July, 2015; v1 submitted 8 May, 2015; originally announced May 2015.

arXiv:1504.01584

Random Projections for k-Means: Maintaining Coresets Beyond Merge & Reduce

Authors: Marc Bury, Chris Schwiegelshohn

Abstract: We give a new construction for a small space summary satisfying the coreset guarantee of a data set with respect to the $k$-means objective function. The number of points required in an offline construction is in $\tilde{O}(k ε^{-2}\min(d,kε^{-2}))$ which is minimal among all available constructions. Aside from two constructions with exponential dependence on the dimension, all known coresets ar… ▽ More We give a new construction for a small space summary satisfying the coreset guarantee of a data set with respect to the $k$-means objective function. The number of points required in an offline construction is in $\tilde{O}(k ε^{-2}\min(d,kε^{-2}))$ which is minimal among all available constructions. Aside from two constructions with exponential dependence on the dimension, all known coresets are maintained in data streams via the merge and reduce framework, which incurs are large space dependency on $\log n$. Instead, our construction crucially relies on Johnson-Lindenstrauss type embeddings which combined with results from online algorithms give us a new technique for efficiently maintaining coresets in data streams without relying on merge and reduce. The final number of points stored by our algorithm in a data stream is in $\tilde{O}(k^2 ε^{-2} \log^2 n \min(d,kε^{-2}))$. △ Less

Submitted 18 February, 2020; v1 submitted 7 April, 2015; originally announced April 2015.

Comments: This paper has been withdrawn due to an error in Theorem 1

Showing 1–29 of 29 results for author: Schwiegelshohn, C