-
Moderate Dimension Reduction for $k$-Center Clustering
Authors:
Shaofeng H. -C. Jiang,
Robert Krauthgamer,
Shay Sapir
Abstract:
The Johnson-Lindenstrauss (JL) Lemma introduced the concept of dimension reduction via a random linear map, which has become a fundamental technique in many computational settings. For a set of $n$ points in $\mathbb{R}^d$ and any fixed $ε>0$, it reduces the dimension $d$ to $O(\log n)$ while preserving, with high probability, all the pairwise Euclidean distances within factor $1+ε$. Perhaps surpr…
▽ More
The Johnson-Lindenstrauss (JL) Lemma introduced the concept of dimension reduction via a random linear map, which has become a fundamental technique in many computational settings. For a set of $n$ points in $\mathbb{R}^d$ and any fixed $ε>0$, it reduces the dimension $d$ to $O(\log n)$ while preserving, with high probability, all the pairwise Euclidean distances within factor $1+ε$. Perhaps surprisingly, the target dimension can be lower if one only wishes to preserve the optimal value of a certain problem on the pointset, e.g., Euclidean max-cut or $k$-means. However, for some notorious problems, like diameter (aka furthest pair), dimension reduction via the JL map to below $O(\log n)$ does not preserve the optimal value within factor $1+ε$.
We propose to focus on another regime, of \emph{moderate dimension reduction}, where a problem's value is preserved within factor $α>1$ using target dimension $\tfrac{\log n}{poly(α)}$. We establish the viability of this approach and show that the famous $k$-center problem is $α$-approximated when reducing to dimension $O(\tfrac{\log n}{α^2}+\log k)$. Along the way, we address the diameter problem via the special case $k=1$. Our result extends to several important variants of $k$-center (with outliers, capacities, or fairness constraints), and the bound improves further with the input's doubling dimension.
While our $poly(α)$-factor improvement in the dimension may seem small, it actually has significant implications for streaming algorithms, and easily yields an algorithm for $k$-center in dynamic geometric streams, that achieves $O(α)$-approximation using space $poly(kdn^{1/α^2})$. This is the first algorithm to beat $O(n)$ space in high dimension $d$, as all previous algorithms require space at least $\exp(d)$. Furthermore, it extends to the $k$-center variants mentioned above.
△ Less
Submitted 16 June, 2024; v1 submitted 3 December, 2023;
originally announced December 2023.
-
Fully Dynamic Algorithms for Euclidean Steiner Tree
Authors:
T-H. Hubert Chan,
Gramoz Goranci,
Shaofeng H. -C. Jiang,
Bo Wang,
Quan Xue
Abstract:
The Euclidean Steiner tree problem asks to find a min-cost metric graph that connects a given set of \emph{terminal} points $X$ in $\mathbb{R}^d$, possibly using points not in $X$ which are called Steiner points. Even though near-linear time $(1 + ε)$-approximation was obtained in the offline setting in seminal works of Arora and Mitchell, efficient dynamic algorithms for Steiner tree is still ope…
▽ More
The Euclidean Steiner tree problem asks to find a min-cost metric graph that connects a given set of \emph{terminal} points $X$ in $\mathbb{R}^d$, possibly using points not in $X$ which are called Steiner points. Even though near-linear time $(1 + ε)$-approximation was obtained in the offline setting in seminal works of Arora and Mitchell, efficient dynamic algorithms for Steiner tree is still open. We give the first algorithm that (implicitly) maintains a $(1 + ε)$-approximate solution which is accessed via a set of tree traversal queries, subject to point insertion and deletions, with amortized update and query time $O(\poly\log n)$ with high probability. Our approach is based on an Arora-style geometric dynamic programming, and our main technical contribution is to maintain the DP subproblems in the dynamic setting efficiently. We also need to augment the DP subproblems to support the tree traversal queries.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
Fully Scalable MPC Algorithms for Clustering in High Dimension
Authors:
Artur Czumaj,
Guichen Gao,
Shaofeng H. -C. Jiang,
Robert Krauthgamer,
Pavel Veselý
Abstract:
We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be $n^σ$ for arbitrarily small fixed $σ>0$. Importantly, the local memory may be substantially smaller than the number of clusters $k$, yet all our algorithms are fast…
▽ More
We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be $n^σ$ for arbitrarily small fixed $σ>0$. Importantly, the local memory may be substantially smaller than the number of clusters $k$, yet all our algorithms are fast, i.e., run in $O(1)$ rounds.
We first devise a fast MPC algorithm for $O(1)$-approximation of uniform facility location. This is the first fully-scalable MPC algorithm that achieves $O(1)$-approximation for any clustering problem in general geometric setting; previous algorithms only provide $\mathrm{poly}(\log n)$-approximation or apply to restricted inputs, like low dimension or small number of clusters $k$; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this facility location result and devise a fast MPC algorithm that achieves $O(1)$-bicriteria approximation for $k$-Median and for $k$-Means, namely, it computes $(1+\varepsilon)k$ clusters of cost within $O(1/\varepsilon^2)$-factor of the optimum for $k$ clusters.
A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].
△ Less
Submitted 6 July, 2024; v1 submitted 15 July, 2023;
originally announced July 2023.
-
Near-Optimal Quantum Coreset Construction Algorithms for Clustering
Authors:
Yecheng Xue,
Xiaoyu Chen,
Tongyang Li,
Shaofeng H. -C. Jiang
Abstract:
$k$-Clustering in $\mathbb{R}^d$ (e.g., $k$-median and $k$-means) is a fundamental machine learning problem. While near-linear time approximation algorithms were known in the classical setting for a dataset with cardinality $n$, it remains open to find sublinear-time quantum algorithms. We give quantum algorithms that find coresets for $k$-clustering in $\mathbb{R}^d$ with $\tilde{O}(\sqrt{nk}d^{3…
▽ More
$k$-Clustering in $\mathbb{R}^d$ (e.g., $k$-median and $k$-means) is a fundamental machine learning problem. While near-linear time approximation algorithms were known in the classical setting for a dataset with cardinality $n$, it remains open to find sublinear-time quantum algorithms. We give quantum algorithms that find coresets for $k$-clustering in $\mathbb{R}^d$ with $\tilde{O}(\sqrt{nk}d^{3/2})$ query complexity. Our coreset reduces the input size from $n$ to $\mathrm{poly}(kε^{-1}d)$, so that existing $α$-approximation algorithms for clustering can run on top of it and yield $(1 + ε)α$-approximation. This eventually yields a quadratic speedup for various $k$-clustering approximation algorithms. We complement our algorithm with a nearly matching lower bound, that any quantum algorithm must make $Ω(\sqrt{nk})$ queries in order to achieve even $O(1)$-approximation for $k$-clustering.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Algorithms for the Generalized Poset Sorting Problem
Authors:
Shaofeng H. -C. Jiang,
Wenqian Wang,
Yubo Zhang,
Yuhao Zhang
Abstract:
We consider a generalized poset sorting problem (GPS), in which we are given a query graph $G = (V, E)$ and an unknown poset $\mathcal{P}(V, \prec)$ that is defined on the same vertex set $V$, and the goal is to make as few queries as possible to edges in $G$ in order to fully recover $\mathcal{P}$, where each query $(u, v)$ returns the relation between $u, v$, i.e., $u \prec v$, $v \prec u$ or…
▽ More
We consider a generalized poset sorting problem (GPS), in which we are given a query graph $G = (V, E)$ and an unknown poset $\mathcal{P}(V, \prec)$ that is defined on the same vertex set $V$, and the goal is to make as few queries as possible to edges in $G$ in order to fully recover $\mathcal{P}$, where each query $(u, v)$ returns the relation between $u, v$, i.e., $u \prec v$, $v \prec u$ or $u \not \sim v$. This generalizes both the poset sorting problem [Faigle et al., SICOMP 88] and the generalized sorting problem [Huang et al., FOCS 11].
We give algorithms with $\tilde{O}(n\cdot \mathrm{poly}(k))$ query complexity when $G$ is a complete bipartite graph or $G$ is stochastic under the \ER model, where $k$ is the \emph{width} of the poset, and these generalize [Daskalakis et al., SICOMP 11] which only studies complete graph $G$. Both results are based on a unified framework that reduces the poset sorting to partitioning the vertices with respect to a given pivot element, which may be of independent interest.
Our study of GPS also leads to a new $\tilde{O}(n^{1 - 1 / (2W)})$ competitive ratio for the so-called weighted generalized sorting problem where $W$ is the number of distinct weights in the query graph. This problem was considered as an open question in [Charikar et al., JCSS 02], and our result makes important progress as it yields the first nontrivial sublinear ratio for general weighted query graphs (for any bounded $W$). We obtain this via an $\tilde{O}(nk + n^{1.5})$ query complexity algorithm for the case where every edge in $G$ is guaranteed to be comparable in the poset, which generalizes a $\tilde{O}(n^{1.5})$ bound for generalized sorting [Huang et al., FOCS 11].
△ Less
Submitted 15 July, 2023; v1 submitted 4 April, 2023;
originally announced April 2023.
-
The Power of Uniform Sampling for $k$-Median
Authors:
Lingxiao Huang,
Shaofeng H. -C. Jiang,
Jianing Lou
Abstract:
We study the power of uniform sampling for $k$-Median in various metric spaces. We relate the query complexity for approximating $k$-Median, to a key parameter of the dataset, called the balancedness $β\in (0, 1]$ (with $1$ being perfectly balanced). We show that any algorithm must make $Ω(1 / β)$ queries to the point set in order to achieve $O(1)$-approximation for $k$-Median. This particularly i…
▽ More
We study the power of uniform sampling for $k$-Median in various metric spaces. We relate the query complexity for approximating $k$-Median, to a key parameter of the dataset, called the balancedness $β\in (0, 1]$ (with $1$ being perfectly balanced). We show that any algorithm must make $Ω(1 / β)$ queries to the point set in order to achieve $O(1)$-approximation for $k$-Median. This particularly implies existing constructions of coresets, a popular data reduction technique, cannot be query-efficient. On the other hand, we show a simple uniform sample of $\mathrm{poly}(k ε^{-1} β^{-1})$ points suffices for $(1 + ε)$-approximation for $k$-Median for various metric spaces, which nearly matches the lower bound. We conduct experiments to verify that in many real datasets, the balancedness parameter is usually well bounded, and that the uniform sampling performs consistently well even for the case with moderately large balancedness, which justifies that uniform sampling is indeed a viable approach for solving $k$-Median.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Streaming Euclidean Max-Cut: Dimension vs Data Reduction
Authors:
Xiaoyu Chen,
Shaofeng H. -C. Jiang,
Robert Krauthgamer
Abstract:
Max-Cut is a fundamental problem that has been studied extensively in various settings. We design an algorithm for Euclidean Max-Cut, where the input is a set of points in $\mathbb{R}^d$, in the model of dynamic geometric streams, where the input $X\subseteq [Δ]^d$ is presented as a sequence of point insertions and deletions. Previously, Frahling and Sohler [STOC 2005] designed a $(1+ε)$-approxima…
▽ More
Max-Cut is a fundamental problem that has been studied extensively in various settings. We design an algorithm for Euclidean Max-Cut, where the input is a set of points in $\mathbb{R}^d$, in the model of dynamic geometric streams, where the input $X\subseteq [Δ]^d$ is presented as a sequence of point insertions and deletions. Previously, Frahling and Sohler [STOC 2005] designed a $(1+ε)$-approximation algorithm for the low-dimensional regime, i.e., it uses space $\exp(d)$.
To tackle this problem in the high-dimensional regime, which is of growing interest, one must improve the dependence on the dimension $d$, ideally to space complexity $\mathrm{poly}(ε^{-1} d \logΔ)$. Lammersen, Sidiropoulos, and Sohler [WADS 2009] proved that Euclidean Max-Cut admits dimension reduction with target dimension $d' = \mathrm{poly}(ε^{-1})$. Combining this with the aforementioned algorithm that uses space $\exp(d')$, they obtain an algorithm whose overall space complexity is indeed polynomial in $d$, but unfortunately exponential in $ε^{-1}$.
We devise an alternative approach of \emph{data reduction}, based on importance sampling, and achieve space bound $\mathrm{poly}(ε^{-1} d \logΔ)$, which is exponentially better (in $ε$) than the dimension-reduction approach. To implement this scheme in the streaming model, we employ a randomly-shifted quadtree to construct a tree embedding. While this is a well-known method, a key feature of our algorithm is that the embedding's distortion $O(d\logΔ)$ affects only the space complexity, and the approximation ratio remains $1+ε$.
△ Less
Submitted 29 March, 2023; v1 submitted 9 November, 2022;
originally announced November 2022.
-
Near-optimal Coresets for Robust Clustering
Authors:
Lingxiao Huang,
Shaofeng H. -C. Jiang,
Jianing Lou,
Xuan Wu
Abstract:
We consider robust clustering problems in $\mathbb{R}^d$, specifically $k$-clustering problems (e.g., $k$-Median and $k$-Means with $m$ outliers, where the cost for a given center set $C \subset \mathbb{R}^d$ aggregates the distances from $C$ to all but the furthest $m$ data points, instead of all points as in classical clustering. We focus on the $ε$-coreset for robust clustering, a small proxy o…
▽ More
We consider robust clustering problems in $\mathbb{R}^d$, specifically $k$-clustering problems (e.g., $k$-Median and $k$-Means with $m$ outliers, where the cost for a given center set $C \subset \mathbb{R}^d$ aggregates the distances from $C$ to all but the furthest $m$ data points, instead of all points as in classical clustering. We focus on the $ε$-coreset for robust clustering, a small proxy of the dataset that preserves the clustering cost within $ε$-relative error for all center sets. Our main result is an $ε$-coreset of size $O(m + \mathrm{poly}(k ε^{-1}))$ that can be constructed in near-linear time. This significantly improves previous results, which either suffers an exponential dependence on $(m + k)$ [Feldman and Schulman, SODA'12], or has a weaker bi-criteria guarantee [Huang et al., FOCS'18]. Furthermore, we show this dependence in $m$ is nearly-optimal, and the fact that it is isolated from other factors may be crucial for dealing with large number of outliers. We construct our coresets by adapting to the outlier setting a recent framework [Braverman et al., FOCS'22] which was designed for capacity-constrained clustering, overcoming a new challenge that the participating terms in the cost, particularly the excluded $m$ outlier points, are dependent on the center set $C$. We validate our coresets on various datasets, and we observe a superior size-accuracy tradeoff compared with popular baselines including uniform sampling and sensitivity sampling. We also achieve a significant speedup of existing approximation algorithms for robust clustering using our coresets.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
On The Relative Error of Random Fourier Features for Preserving Kernel Distance
Authors:
Kuan Cheng,
Shaofeng H. -C. Jiang,
Luojian Wei,
Zhide Wei
Abstract:
The method of random Fourier features (RFF), proposed in a seminal paper by Rahimi and Recht (NIPS'07), is a powerful technique to find approximate low-dimensional representations of points in (high-dimensional) kernel space, for shift-invariant kernels. While RFF has been analyzed under various notions of error guarantee, the ability to preserve the kernel distance with \emph{relative} error is l…
▽ More
The method of random Fourier features (RFF), proposed in a seminal paper by Rahimi and Recht (NIPS'07), is a powerful technique to find approximate low-dimensional representations of points in (high-dimensional) kernel space, for shift-invariant kernels. While RFF has been analyzed under various notions of error guarantee, the ability to preserve the kernel distance with \emph{relative} error is less understood. We show that for a significant range of kernels, including the well-known Laplacian kernels, RFF cannot approximate the kernel distance with small relative error using low dimensions. We complement this by showing as long as the shift-invariant kernel is analytic, RFF with $\mathrm{poly}(ε^{-1} \log n)$ dimensions achieves $ε$-relative error for pairwise kernel distance of $n$ points, and the dimension bound is improved to $\mathrm{poly}(ε^{-1}\log k)$ for the specific application of kernel $k$-means. Finally, going beyond RFF, we make the first step towards data-oblivious dimension-reduction for general shift-invariant kernels, and we obtain a similar $\mathrm{poly}(ε^{-1} \log n)$ dimension bound for Laplacian kernels. We also validate the dimension-error tradeoff of our methods on simulated datasets, and they demonstrate superior performance compared with other popular methods including random-projection and Nyström methods.
△ Less
Submitted 13 April, 2023; v1 submitted 1 October, 2022;
originally announced October 2022.
-
The Power of Uniform Sampling for Coresets
Authors:
Vladimir Braverman,
Vincent Cohen-Addad,
Shaofeng H. -C. Jiang,
Robert Krauthgamer,
Chris Schwiegelshohn,
Mads Bech Toftrup,
Xuan Wu
Abstract:
Motivated by practical generalizations of the classic $k$-median and $k$-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive er…
▽ More
Motivated by practical generalizations of the classic $k$-median and $k$-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error. This reduction enables us to construct coresets using uniform sampling, in contrast to the widely-used importance sampling, and consequently we can easily handle constrained objectives. Notably and perhaps surprisingly, this simpler sampling scheme can yield coresets whose size is independent of $n$, the number of input points.
Our technique yields smaller coresets, and sometimes the first coresets, for a large number of constrained clustering problems, including capacitated clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in minor-excluded graph, and polygon clustering under Fréchet and Hausdorff distance. Finally, our technique yields also smaller coresets for $1$-median in low-dimensional Euclidean spaces, specifically of size $\tilde{O}(\varepsilon^{-1.5})$ in $\mathbb{R}^2$ and $\tilde{O}(\varepsilon^{-1.6})$ in $\mathbb{R}^3$.
△ Less
Submitted 17 September, 2022; v1 submitted 5 September, 2022;
originally announced September 2022.
-
Streaming Facility Location in High Dimension via Geometric Hashing
Authors:
Artur Czumaj,
Arnold Filtser,
Shaofeng H. -C. Jiang,
Robert Krauthgamer,
Pavel Veselý,
Mingwei Yang
Abstract:
In Euclidean Uniform Facility Location (UFL), the input is a set of clients in $\mathbb{R}^d$ and the goal is to place facilities to serve them, so as to minimize the total cost of opening facilities plus connecting the clients. We study the setting of dynamic geometric streams, where the clients are presented as a sequence of insertions and deletions of points in the grid $\{1,\ldots,Δ\}^d$, and…
▽ More
In Euclidean Uniform Facility Location (UFL), the input is a set of clients in $\mathbb{R}^d$ and the goal is to place facilities to serve them, so as to minimize the total cost of opening facilities plus connecting the clients. We study the setting of dynamic geometric streams, where the clients are presented as a sequence of insertions and deletions of points in the grid $\{1,\ldots,Δ\}^d$, and we focus on the \emph{high-dimensional regime}, where the algorithm must use space polynomial in $d\cdot\logΔ$.
We present a new algorithmic framework, based on importance sampling, for $O(1)$-approximation of UFL using only $\mathrm{poly}(d\cdot\logΔ)$ space. This framework is easy to implement in two passes, one for sampling points and the other for estimating their contribution. Over random-order streams, we can extend this to one pass by using the two halves of the stream separately. Our main result, for arbitrary-order streams, computes $O(d / \log d)$-approximation in one pass by combining the two passes differently. This improves upon previous algorithms that either need space $\exp(d)$ or only guarantee $O(d\cdot\log^2Δ)$-approximation, and therefore our algorithms for high dimension are the first to avoid the $O(\logΔ)$-factor in approximation that is inherent to the widely-used quadtree decomposition. Our improvement is achieved by employing a geometric hashing scheme that maps points in $\mathbb{R}^d$ into buckets of bounded diameter, with the key property that every point set of small-enough diameter is hashed into few buckets. By applying an alternative bound for this hashing, we also obtain an $O(1 / ε)$-approximation in one pass, using larger but still sublinear space $O(n^ε)$ where $n$ is the number of clients.
We complement our results by showing $1.085$-approximation requires space exponential in $\mathrm{poly}(d\cdot\logΔ)$.
△ Less
Submitted 28 January, 2023; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Enhanced charge density wave with mobile superconducting vortices in La$_{1.885}$Sr$_{0.115}$CuO$_4$
Authors:
J. -J. Wen,
W. He,
H. Jang,
H. Nojiri,
S. Matsuzawa,
S. Song,
M. Chollet,
D. Zhu,
Y. -J. Liu,
M. Fujita,
J. M. Jiang,
C. R. Rotundu,
C. -C. Kao,
H. -C. Jiang,
J. -S. Lee,
Y. S. Lee
Abstract:
Superconductivity in the cuprates is found to be intertwined with charge and spin density waves. Determining the interactions between the different types of order is crucial for understanding these important materials. Here, we elucidate the role of the charge density wave (CDW) in the prototypical cuprate La$_{1.885}$Sr$_{0.115}$CuO$_4$, by studying the effects of large magnetic fields ($H$) up t…
▽ More
Superconductivity in the cuprates is found to be intertwined with charge and spin density waves. Determining the interactions between the different types of order is crucial for understanding these important materials. Here, we elucidate the role of the charge density wave (CDW) in the prototypical cuprate La$_{1.885}$Sr$_{0.115}$CuO$_4$, by studying the effects of large magnetic fields ($H$) up to 24 Tesla. At low temperatures ($T$), the observed CDW peaks reveal two distinct regions in the material: a majority phase with short-range CDW coexisting with superconductivity, and a minority phase with longer-range CDW coexisting with static spin density wave (SDW). With increasing magnetic field, the CDW first grows smoothly in a manner similar to the SDW. However, at high fields we discover a sudden increase in the CDW amplitude upon entering the vortex-liquid state. Our results signify strong coupling of the CDW to mobile superconducting vortices and link enhanced CDW amplitude with local superconducting pairing across the $H-T$ phase diagram.
△ Less
Submitted 11 November, 2021;
originally announced November 2021.
-
Online Facility Location with Predictions
Authors:
Shaofeng H. -C. Jiang,
Erzhi Liu,
You Lyu,
Zhihao Gavin Tang,
Yubo Zhang
Abstract:
We provide nearly optimal algorithms for online facility location (OFL) with predictions. In OFL, $n$ demand points arrive in order and the algorithm must irrevocably assign each demand point to an open facility upon its arrival. The objective is to minimize the total connection costs from demand points to assigned facilities plus the facility opening cost. We further assume the algorithm is addit…
▽ More
We provide nearly optimal algorithms for online facility location (OFL) with predictions. In OFL, $n$ demand points arrive in order and the algorithm must irrevocably assign each demand point to an open facility upon its arrival. The objective is to minimize the total connection costs from demand points to assigned facilities plus the facility opening cost. We further assume the algorithm is additionally given for each demand point $x_i$ a natural prediction $f_{x_i}^{\mathrm{pred}}$ which is supposed to be the facility $f_{x_i}^{\mathrm{opt}}$ that serves $x_i$ in the offline optimal solution.
Our main result is an $O(\min\{\log {\frac{nη_\infty}{\mathrm{OPT}}}, \log{n} \})$-competitive algorithm where $η_\infty$ is the maximum prediction error (i.e., the distance between $f_{x_i}^{\mathrm{pred}}$ and $f_{x_i}^{\mathrm{opt}}$). Our algorithm overcomes the fundamental $Ω(\frac{\log n}{\log \log n})$ lower bound of OFL (without predictions) when $η_\infty$ is small, and it still maintains $O(\log n)$ ratio even when $η_\infty$ is unbounded. Furthermore, our theoretical analysis is supported by empirical evaluations for the tradeoffs between $η_\infty$ and the competitive ratio on various real datasets of different types.
△ Less
Submitted 5 August, 2022; v1 submitted 17 October, 2021;
originally announced October 2021.
-
Coresets for Kernel Clustering
Authors:
Shaofeng H. -C. Jiang,
Robert Krauthgamer,
Jianing Lou,
Yubo Zhang
Abstract:
We devise coresets for kernel $k$-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel $k$-Means has superior clustering capability compared to classical $k$-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduce…
▽ More
We devise coresets for kernel $k$-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel $k$-Means has superior clustering capability compared to classical $k$-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs.
Our main result is a coreset for kernel $k$-Means that works for a general kernel and has size $\mathrm{poly}(kε^{-1})$. Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in $n$. This result immediately implies new algorithms for kernel $k$-Means, such as a $(1+ε)$-approximation in time near-linear in $n$, and a streaming algorithm using space and update time $\mathrm{poly}(k ε^{-1} \log n)$.
We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel $k$-Means++ (the kernelized version of the widely used $k$-Means++ algorithm), and we further use this faster kernel $k$-Means++ for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.
△ Less
Submitted 6 April, 2024; v1 submitted 6 October, 2021;
originally announced October 2021.
-
Coresets for Clustering with Missing Values
Authors:
Vladimir Braverman,
Shaofeng H. -C. Jiang,
Robert Krauthgamer,
Xuan Wu
Abstract:
We provide the first coreset for clustering points in $\mathbb{R}^d$ that have multiple missing values (coordinates). Previous coreset constructions only allow one missing coordinate. The challenge in this setting is that objective functions, like $k$-Means, are evaluated only on the set of available (non-missing) coordinates, which varies across points. Recall that an $ε$-coreset of a large datas…
▽ More
We provide the first coreset for clustering points in $\mathbb{R}^d$ that have multiple missing values (coordinates). Previous coreset constructions only allow one missing coordinate. The challenge in this setting is that objective functions, like $k$-Means, are evaluated only on the set of available (non-missing) coordinates, which varies across points. Recall that an $ε$-coreset of a large dataset is a small proxy, usually a reweighted subset of points, that $(1+ε)$-approximates the clustering objective for every possible center set.
Our coresets for $k$-Means and $k$-Median clustering have size $(jk)^{O(\min(j,k))} (ε^{-1} d \log n)^2$, where $n$ is the number of data points, $d$ is the dimension and $j$ is the maximum number of missing coordinates for each data point. We further design an algorithm to construct these coresets in near-linear time, and consequently improve a recent quadratic-time PTAS for $k$-Means with missing values [Eiben et al., SODA 2021] to near-linear time.
We validate our coreset construction, which is based on importance sampling and is easy to implement, on various real data sets. Our coreset exhibits a flexible tradeoff between coreset size and accuracy, and generally outperforms the uniform-sampling baseline. Furthermore, it significantly speeds up a Lloyd's-style heuristic for $k$-Means with missing values.
△ Less
Submitted 11 November, 2021; v1 submitted 30 June, 2021;
originally announced June 2021.
-
Streaming Algorithms for Geometric Steiner Forest
Authors:
Artur Czumaj,
Shaofeng H. -C. Jiang,
Robert Krauthgamer,
Pavel Veselý
Abstract:
We consider an important generalization of the Steiner tree problem, the \emph{Steiner forest problem}, in the Euclidean plane: the input is a multiset $X \subseteq \mathbb{R}^2$, partitioned into $k$ color classes $C_1, C_2, \ldots, C_k \subseteq X$. The goal is to find a minimum-cost Euclidean graph $G$ such that every color class $C_i$ is connected in $G$. We study this Steiner forest problem i…
▽ More
We consider an important generalization of the Steiner tree problem, the \emph{Steiner forest problem}, in the Euclidean plane: the input is a multiset $X \subseteq \mathbb{R}^2$, partitioned into $k$ color classes $C_1, C_2, \ldots, C_k \subseteq X$. The goal is to find a minimum-cost Euclidean graph $G$ such that every color class $C_i$ is connected in $G$. We study this Steiner forest problem in the streaming setting, where the stream consists of insertions and deletions of points to $X$. Each input point $x\in X$ arrives with its color $\textsf{color}(x) \in [k]$, and as usual for dynamic geometric streams, the input points are restricted to the discrete grid $\{0, \ldots, Δ\}^2$.
We design a single-pass streaming algorithm that uses $\mathrm{poly}(k \cdot \logΔ)$ space and time, and estimates the cost of an optimal Steiner forest solution within ratio arbitrarily close to the famous Euclidean Steiner ratio $α_2$ (currently $1.1547 \le α_2 \le 1.214$). This approximation guarantee matches the state-of-the-art bound for streaming Steiner tree, i.e., when $k=1$, and it is a major open question to improve the ratio to $1 + ε$ even for this special case. Our approach relies on a novel combination of streaming techniques, like sampling and linear sketching, with the classical Arora-style dynamic-programming framework for geometric optimization problems, which usually requires large memory and has so far not been applied in the streaming setting.
We complement our streaming algorithm for the Steiner forest problem with simple arguments showing that any finite approximation requires $Ω(k)$ bits of space.
△ Less
Submitted 10 May, 2024; v1 submitted 9 November, 2020;
originally announced November 2020.
-
Coresets for Clustering in Excluded-minor Graphs and Beyond
Authors:
Vladimir Braverman,
Shaofeng H. -C. Jiang,
Robert Krauthgamer,
Xuan Wu
Abstract:
Coresets are modern data-reduction tools that are widely used in data analysis to improve efficiency in terms of running time, space and communication complexity. Our main result is a fast algorithm to construct a small coreset for k-Median in (the shortest-path metric of) an excluded-minor graph. Specifically, we give the first coreset of size that depends only on $k$, $ε$ and the excluded-minor…
▽ More
Coresets are modern data-reduction tools that are widely used in data analysis to improve efficiency in terms of running time, space and communication complexity. Our main result is a fast algorithm to construct a small coreset for k-Median in (the shortest-path metric of) an excluded-minor graph. Specifically, we give the first coreset of size that depends only on $k$, $ε$ and the excluded-minor size, and our running time is quasi-linear (in the size of the input graph).
The main innovation in our new algorithm is that is iterative; it first reduces the $n$ input points to roughly $O(\log n)$ reweighted points, then to $O(\log\log n)$, and so forth until the size is independent of $n$. Each step in this iterative size reduction is based on the importance sampling framework of Feldman and Langberg (STOC 2011), with a crucial adaptation that reduces the number of \emph{distinct points}, by employing a terminal embedding (where low distortion is guaranteed only for the distance from every terminal to all other points). Our terminal embedding is technically involved and relies on shortest-path separators, a standard tool in planar and excluded-minor graphs.
Furthermore, our new algorithm is applicable also in Euclidean metrics, by simply using a recent terminal embedding result of Narayanan and Nelson, (STOC 2019), which extends the Johnson-Lindenstrauss Lemma. We thus obtain an efficient coreset construction in high-dimensional Euclidean spaces, thereby matching and simplifying state-of-the-art results (Sohler and Woodruff, FOCS 2018; Huang and Vishnoi, STOC 2020).
In addition, we also employ terminal embedding with additive distortion to obtain small coresets in graphs with bounded highway dimension, and use applications of our coresets to obtain improved approximation schemes, e.g., an improved PTAS for planar k-Median via a new centroid set.
△ Less
Submitted 15 July, 2020; v1 submitted 16 April, 2020;
originally announced April 2020.
-
Coresets for Clustering in Graphs of Bounded Treewidth
Authors:
Daniel Baker,
Vladimir Braverman,
Lingxiao Huang,
Shaofeng H. -C. Jiang,
Robert Krauthgamer,
Xuan Wu
Abstract:
We initiate the study of coresets for clustering in graph metrics, i.e., the shortest-path metric of edge-weighted graphs. Such clustering problems are essential to data analysis and used for example in road networks and data visualization. A coreset is a compact summary of the data that approximately preserves the clustering objective for every possible center set, and it offers significant effic…
▽ More
We initiate the study of coresets for clustering in graph metrics, i.e., the shortest-path metric of edge-weighted graphs. Such clustering problems are essential to data analysis and used for example in road networks and data visualization. A coreset is a compact summary of the data that approximately preserves the clustering objective for every possible center set, and it offers significant efficiency improvements in terms of running time, storage, and communication, including in streaming and distributed settings. Our main result is a near-linear time construction of a coreset for k-Median in a general graph $G$, with size $O_{ε, k}(\mathrm{tw}(G))$ where $\mathrm{tw}(G)$ is the treewidth of $G$, and we complement the construction with a nearly-tight size lower bound. The construction is based on the framework of Feldman and Langberg [STOC 2011], and our main technical contribution, as required by this framework, is a uniform bound of $O(\mathrm{tw}(G))$ on the shattering dimension under any point weights. We validate our coreset on real-world road networks, and our scalable algorithm constructs tiny coresets with high accuracy, which translates to a massive speedup of existing approximation algorithms such as local search for graph k-Median.
△ Less
Submitted 12 December, 2022; v1 submitted 10 July, 2019;
originally announced July 2019.
-
Coresets for Clustering with Fairness Constraints
Authors:
Lingxiao Huang,
Shaofeng H. -C. Jiang,
Nisheeth K. Vishnoi
Abstract:
In a recent work, [19] studied the following "fair" variants of classical clustering problems such as $k$-means and $k$-median: given a set of $n$ data points in $\mathbb{R}^d$ and a binary type associated to each data point, the goal is to cluster the points while ensuring that the proportion of each type in each cluster is roughly the same as its underlying proportion. Subsequent work has focuse…
▽ More
In a recent work, [19] studied the following "fair" variants of classical clustering problems such as $k$-means and $k$-median: given a set of $n$ data points in $\mathbb{R}^d$ and a binary type associated to each data point, the goal is to cluster the points while ensuring that the proportion of each type in each cluster is roughly the same as its underlying proportion. Subsequent work has focused on either extending this setting to when each data point has multiple, non-disjoint sensitive types such as race and gender [6], or to address the problem that the clustering algorithms in the above work do not scale well. The main contribution of this paper is an approach to clustering with fairness constraints that involve multiple, non-disjoint types, that is also scalable. Our approach is based on novel constructions of coresets: for the $k$-median objective, we construct an $\varepsilon$-coreset of size $O(Γk^2 \varepsilon^{-d})$ where $Γ$ is the number of distinct collections of groups that a point may belong to, and for the $k$-means objective, we show how to construct an $\varepsilon$-coreset of size $O(Γk^3\varepsilon^{-d-1})$. The former result is the first known coreset construction for the fair clustering problem with the $k$-median objective, and the latter result removes the dependence on the size of the full dataset as in [39] and generalizes it to multiple, non-disjoint types. Plugging our coresets into existing algorithms for fair clustering such as [5] results in the fastest algorithms for several cases. Empirically, we assess our approach over the \textbf{Adult}, \textbf{Bank}, \textbf{Diabetes} and \textbf{Athlete} dataset, and show that the coreset sizes are much smaller than the full dataset. We also achieve a speed-up to recent fair clustering algorithms [5,6] by incorporating our coreset construction.
△ Less
Submitted 17 December, 2019; v1 submitted 20 June, 2019;
originally announced June 2019.
-
Coresets for Ordered Weighted Clustering
Authors:
Vladimir Braverman,
Shaofeng H. -C. Jiang,
Robert Krauthgamer,
Xuan Wu
Abstract:
We design coresets for Ordered k-Median, a generalization of classical clustering problems such as k-Median and k-Center, that offers a more flexible data analysis, like easily combining multiple objectives (e.g., to increase fairness or for Pareto optimization). Its objective function is defined via the Ordered Weighted Averaging (OWA) paradigm of Yager (1988), where data points are weighted acco…
▽ More
We design coresets for Ordered k-Median, a generalization of classical clustering problems such as k-Median and k-Center, that offers a more flexible data analysis, like easily combining multiple objectives (e.g., to increase fairness or for Pareto optimization). Its objective function is defined via the Ordered Weighted Averaging (OWA) paradigm of Yager (1988), where data points are weighted according to a predefined weight vector, but in order of their contribution to the objective (distance from the centers).
A powerful data-reduction technique, called a coreset, is to summarize a point set $X$ in $\mathbb{R}^d$ into a small (weighted) point set $X'$, such that for every set of $k$ potential centers, the objective value of the coreset $X'$ approximates that of $X$ within factor $1\pm ε$. When there are multiple objectives (weights), the above standard coreset might have limited usefulness, whereas in a \emph{simultaneous} coreset, which was introduced recently by Bachem and Lucic and Lattanzi (2018), the above approximation holds for all weights (in addition to all centers). Our main result is a construction of a simultaneous coreset of size $O_{ε, d}(k^2 \log^2 |X|)$ for Ordered k-Median.
To validate the efficacy of our coreset construction we ran experiments on a real geographical data set. We find that our algorithm produces a small coreset, which translates to a massive speedup of clustering computations, while maintaining high accuracy for a range of weights.
△ Less
Submitted 11 March, 2019;
originally announced March 2019.
-
$\varepsilon$-Coresets for Clustering (with Outliers) in Doubling Metrics
Authors:
Lingxiao Huang,
Shaofeng H. -C. Jiang,
Jian Li,
Xuan Wu
Abstract:
We study the problem of constructing $\varepsilon$-coresets for the $(k, z)$-clustering problem in a doubling metric $M(X, d)$. An $\varepsilon$-coreset is a weighted subset $S\subseteq X$ with weight function $w : S \rightarrow \mathbb{R}_{\geq 0}$, such that for any $k$-subset $C \in [X]^k$, it holds that…
▽ More
We study the problem of constructing $\varepsilon$-coresets for the $(k, z)$-clustering problem in a doubling metric $M(X, d)$. An $\varepsilon$-coreset is a weighted subset $S\subseteq X$ with weight function $w : S \rightarrow \mathbb{R}_{\geq 0}$, such that for any $k$-subset $C \in [X]^k$, it holds that $\sum_{x \in S}{w(x) \cdot d^z(x, C)} \in (1 \pm \varepsilon) \cdot \sum_{x \in X}{d^z(x, C)}$.
We present an efficient algorithm that constructs an $\varepsilon$-coreset for the $(k, z)$-clustering problem in $M(X, d)$, where the size of the coreset only depends on the parameters $k, z, \varepsilon$ and the doubling dimension $\mathsf{ddim}(M)$. To the best of our knowledge, this is the first efficient $\varepsilon$-coreset construction of size independent of $|X|$ for general clustering problems in doubling metrics.
To this end, we establish the first relation between the doubling dimension of $M(X, d)$ and the shattering dimension (or VC-dimension) of the range space induced by the distance $d$. Such a relation was not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small $(1\pmε)$-distortion of the distance function $d$, and consider the notion of $τ$-error probabilistic shattering dimension, we can prove an upper bound of $O( \mathsf{ddim}(M)\cdot \log(1/\varepsilon) +\log\log{\frac{1}τ} )$ for the probabilistic shattering dimension for even weighted doubling metrics. We believe this new relation is of independent interest and may find other applications.
We also study the robust coresets and centroid sets in doubling metrics. Our robust coreset construction leads to new results in clustering and property testing, and the centroid sets can be used to accelerate the local search algorithms for clustering problems.
△ Less
Submitted 18 August, 2018; v1 submitted 7 April, 2018;
originally announced April 2018.
-
A Unified PTAS for Prize Collecting TSP and Steiner Tree Problem in Doubling Metrics
Authors:
T-H. Hubert Chan,
Haotian Jiang,
Shaofeng H. -C. Jiang
Abstract:
We present a unified polynomial-time approximation scheme (PTAS) for the prize collecting traveling salesman problem (PCTSP) and the prize collecting Steiner tree problem (PCSTP) in doubling metrics. Given a metric space and a penalty function on a subset of points known as terminals, a solution is a subgraph on points in the metric space, whose cost is the weight of its edges plus the penalty due…
▽ More
We present a unified polynomial-time approximation scheme (PTAS) for the prize collecting traveling salesman problem (PCTSP) and the prize collecting Steiner tree problem (PCSTP) in doubling metrics. Given a metric space and a penalty function on a subset of points known as terminals, a solution is a subgraph on points in the metric space, whose cost is the weight of its edges plus the penalty due to terminals not covered by the subgraph. Under our unified framework, the solution subgraph needs to be Eulerian for PCTSP, while it needs to be connected for PCSTP. Before our work, even a QPTAS for the problems in doubling metrics is not known.
Our unified PTAS is based on the previous dynamic programming frameworks proposed in [Talwar STOC 2004] and [Bartal, Gottlieb, Krauthgamer STOC 2012]. However, since it is unknown which part of the optimal cost is due to edge lengths and which part is due to penalties of uncovered terminals, we need to develop new techniques to apply previous divide-and-conquer strategies and sparse instance decompositions.
△ Less
Submitted 20 June, 2018; v1 submitted 21 October, 2017;
originally announced October 2017.
-
Online Submodular Maximization Problem with Vector Packing Constraint
Authors:
T-H. Hubert Chan,
Shaofeng H. -C. Jiang,
Zhihao Gavin Tang,
Xiaowei Wu
Abstract:
We consider the online vector packing problem in which we have a $d$ dimensional knapsack and items $u$ with weight vectors $\mathbf{w}_u \in \mathbb{R}_+^d$ arrive online in an arbitrary order. Upon the arrival of an item, the algorithm must decide immediately whether to discard or accept the item into the knapsack. When item $u$ is accepted, $\mathbf{w}_u(i)$ units of capacity on dimension $i$ w…
▽ More
We consider the online vector packing problem in which we have a $d$ dimensional knapsack and items $u$ with weight vectors $\mathbf{w}_u \in \mathbb{R}_+^d$ arrive online in an arbitrary order. Upon the arrival of an item, the algorithm must decide immediately whether to discard or accept the item into the knapsack. When item $u$ is accepted, $\mathbf{w}_u(i)$ units of capacity on dimension $i$ will be taken up, for each $i\in[d]$. To satisfy the knapsack constraint, an accepted item can be later disposed of with no cost, but discarded or disposed of items cannot be recovered. The objective is to maximize the utility of the accepted items $S$ at the end of the algorithm, which is given by $f(S)$ for some non-negative monotone submodular function $f$.
For any small constant $ε> 0$, we consider the special case that the weight of an item on every dimension is at most a $(1-ε)$ fraction of the total capacity, and give a polynomial-time deterministic $O(\frac{k}{ε^2})$-competitive algorithm for the problem, where $k$ is the (column) sparsity of the weight vectors. We also show several (almost) tight hardness results even when the algorithm is computationally unbounded. We show that under the $ε$-slack assumption, no deterministic algorithm can obtain any $o(k)$ competitive ratio, and no randomized algorithm can obtain any $o(\frac{k}{\log k})$ competitive ratio. For the general case (when $ε= 0$), no randomized algorithm can obtain any $o(k)$ competitive ratio.
In contrast to the $(1+δ)$ competitive ratio achieved in Kesselheim et al. (STOC 2014) for the problem with random arrival order of items and under large capacity assumption, we show that in the arbitrary arrival order case, even when $\| \mathbf{w}_u \|_\infty$ is arbitrarily small for all items $u$, it is impossible to achieve any $o(\frac{\log k}{\log\log k})$ competitive ratio.
△ Less
Submitted 21 June, 2017;
originally announced June 2017.
-
Online Submodular Maximization with Free Disposal: Randomization Beats 0.25 for Partition Matroids
Authors:
T-H. Hubert Chan,
Zhiyi Huang,
Shaofeng H. -C. Jiang,
Ning Kang,
Zhihao Gavin Tang
Abstract:
We study the online submodular maximization problem with free disposal under a matroid constraint. Elements from some ground set arrive one by one in rounds, and the algorithm maintains a feasible set that is independent in the underlying matroid. In each round when a new element arrives, the algorithm may accept the new element into its feasible set and possibly remove elements from it, provided…
▽ More
We study the online submodular maximization problem with free disposal under a matroid constraint. Elements from some ground set arrive one by one in rounds, and the algorithm maintains a feasible set that is independent in the underlying matroid. In each round when a new element arrives, the algorithm may accept the new element into its feasible set and possibly remove elements from it, provided that the resulting set is still independent. The goal is to maximize the value of the final feasible set under some monotone submodular function, to which the algorithm has oracle access.
For $k$-uniform matroids, we give a deterministic algorithm with competitive ratio at least $0.2959$, and the ratio approaches $\frac{1}{α_\infty} \approx 0.3178$ as $k$ approaches infinity, improving the previous best ratio of $0.25$ by Chakrabarti and Kale (IPCO 2014), Buchbinder et al. (SODA 2015) and Chekuri et al. (ICALP 2015). We also show that our algorithm is optimal among a class of deterministic monotone algorithms that accept a new arriving element only if the objective is strictly increased.
Further, we prove that no deterministic monotone algorithm can be strictly better than $0.25$-competitive even for partition matroids, the most modest generalization of $k$-uniform matroids, matching the competitive ratio by Chakrabarti and Kale (IPCO 2014) and Chekuri et al. (ICALP 2015). Interestingly, we show that randomized algorithms are strictly more powerful by giving a (non-monotone) randomized algorithm for partition matroids with ratio $\frac{1}{α_\infty} \approx 0.3178$.
△ Less
Submitted 25 October, 2016;
originally announced October 2016.
-
A PTAS for the Steiner Forest Problem in Doubling Metrics
Authors:
T-H. Hubert Chan,
Shuguang Hu,
Shaofeng H. -C. Jiang
Abstract:
We achieve a (randomized) polynomial-time approximation scheme (PTAS) for the Steiner Forest Problem in doubling metrics. Before our work, a PTAS is given only for the Euclidean plane in [FOCS 2008: Borradaile, Klein and Mathieu]. Our PTAS also shares similarities with the dynamic programming for sparse instances used in [STOC 2012: Bartal, Gottlieb and Krauthgamer] and [SODA 2016: Chan and Jiang]…
▽ More
We achieve a (randomized) polynomial-time approximation scheme (PTAS) for the Steiner Forest Problem in doubling metrics. Before our work, a PTAS is given only for the Euclidean plane in [FOCS 2008: Borradaile, Klein and Mathieu]. Our PTAS also shares similarities with the dynamic programming for sparse instances used in [STOC 2012: Bartal, Gottlieb and Krauthgamer] and [SODA 2016: Chan and Jiang]. However, extending previous approaches requires overcoming several non-trivial hurdles, and we make the following technical contributions.
(1) We prove a technical lemma showing that Steiner points have to be "near" the terminals in an optimal Steiner tree. This enables us to define a heuristic to estimate the local behavior of the optimal solution, even though the Steiner points are unknown in advance. This lemma also generalizes previous results in the Euclidean plane, and may be of independent interest for related problems involving Steiner points.
(2) We develop a novel algorithmic technique known as "adaptive cells" to overcome the difficulty of kee** track of multiple components in a solution. Our idea is based on but significantly different from the previously proposed "uniform cells" in the FOCS 2008 paper, whose techniques cannot be readily applied to doubling metrics.
△ Less
Submitted 22 August, 2016;
originally announced August 2016.