Search | arXiv e-print repository

Dynamic Correlation Clustering in Sublinear Update Time

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Andreas Maggiori, Nikos Parotsidis

Abstract: We study the classic problem of correlation clustering in dynamic node streams. In this setting, nodes are either added or randomly deleted over time, and each node pair is connected by a positive or negative edge. The objective is to continuously find a partition which minimizes the sum of positive edges crossing clusters and negative edges within clusters. We present an algorithm that maintains… ▽ More We study the classic problem of correlation clustering in dynamic node streams. In this setting, nodes are either added or randomly deleted over time, and each node pair is connected by a positive or negative edge. The objective is to continuously find a partition which minimizes the sum of positive edges crossing clusters and negative edges within clusters. We present an algorithm that maintains an $O(1)$-approximation with $O$(polylog $n$) amortized update time. Prior to our work, Behnezhad, Charikar, Ma, and L. Tan achieved a $5$-approximation with $O(1)$ expected update time in edge streams which translates in node streams to an $O(D)$-update time where $D$ is the maximum possible degree. Finally we complement our theoretical analysis with experiments on real world data. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: ICML'24 (spotlight)

arXiv:2406.04868 [pdf, ps, other]

Perturb-and-Project: Differentially Private Similarities and Marginals

Authors: Vincent Cohen-Addad, Tommaso d'Orsi, Alessandro Epasto, Vahab Mirrokni, Peilin Zhong

Abstract: We revisit the input perturbations framework for differential privacy where noise is added to the input $A\in \mathcal{S}$ and the result is then projected back to the space of admissible datasets $\mathcal{S}$. Through this framework, we first design novel efficient algorithms to privately release pair-wise cosine similarities. Second, we derive a novel algorithm to compute $k$-way marginal queri… ▽ More We revisit the input perturbations framework for differential privacy where noise is added to the input $A\in \mathcal{S}$ and the result is then projected back to the space of admissible datasets $\mathcal{S}$. Through this framework, we first design novel efficient algorithms to privately release pair-wise cosine similarities. Second, we derive a novel algorithm to compute $k$-way marginal queries over $n$ features. Prior work could achieve comparable guarantees only for $k$ even. Furthermore, we extend our results to $t$-sparse datasets, where our efficient algorithms yields novel, stronger guarantees whenever $t\le n^{5/6}/\log n\,.$ Finally, we provide a theoretical perspective on why \textit{fast} input perturbation algorithms works well in practice. The key technical ingredients behind our results are tight sum-of-squares certificates upper bounding the Gaussian complexity of sets of solutions. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 21 ppages, ICML 2024

ACM Class: F.2; G.3

arXiv:2406.04860 [pdf, other]

Multi-View Stochastic Block Models

Authors: Vincent Cohen-Addad, Tommaso d'Orsi, Silvio Lattanzi, Rajai Nasser

Abstract: Graph clustering is a central topic in unsupervised learning with a multitude of practical applications. In recent years, multi-view graph clustering has gained a lot of attention for its applicability to real-world instances where one has access to multiple data sources. In this paper we formalize a new family of models, called \textit{multi-view stochastic block models} that captures this settin… ▽ More Graph clustering is a central topic in unsupervised learning with a multitude of practical applications. In recent years, multi-view graph clustering has gained a lot of attention for its applicability to real-world instances where one has access to multiple data sources. In this paper we formalize a new family of models, called \textit{multi-view stochastic block models} that captures this setting. For this model, we first study efficient algorithms that naively work on the union of multiple graphs. Then, we introduce a new efficient algorithm that provably outperforms previous approaches by analyzing the structure of each graph separately. Furthermore, we complement our results with an information-theoretic lower bound studying the limits of what can be done in this model. Finally, we corroborate our results with experimental evaluations. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 31 pages, ICML 2024

ACM Class: F.2; G.3

arXiv:2406.04857 [pdf, ps, other]

A Near-Linear Time Approximation Algorithm for Beyond-Worst-Case Graph Clustering

Authors: Vincent Cohen-Addad, Tommaso d'Orsi, Aida Mousavifar

Abstract: We consider the semi-random graph model of [Makarychev, Makarychev and Vijayaraghavan, STOC'12], where, given a random bipartite graph with $α$ edges and an unknown bipartition $(A, B)$ of the vertex set, an adversary can add arbitrary edges inside each community and remove arbitrary edges from the cut $(A, B)$ (i.e. all adversarial changes are \textit{monotone} with respect to the bipartition). F… ▽ More We consider the semi-random graph model of [Makarychev, Makarychev and Vijayaraghavan, STOC'12], where, given a random bipartite graph with $α$ edges and an unknown bipartition $(A, B)$ of the vertex set, an adversary can add arbitrary edges inside each community and remove arbitrary edges from the cut $(A, B)$ (i.e. all adversarial changes are \textit{monotone} with respect to the bipartition). For this model, a polynomial time algorithm is known to approximate the Balanced Cut problem up to value $O(α)$ [MMV'12] as long as the cut $(A, B)$ has size $Ω(α)$. However, it consists of slow subroutines requiring optimal solutions for logarithmically many semidefinite programs. We study the fine-grained complexity of the problem and present the first near-linear time algorithm that achieves similar performances to that of [MMV'12]. Our algorithm runs in time $O(|V(G)|^{1+o(1)} + |E(G)|^{1+o(1)})$ and finds a balanced cut of value $O(α)$. Our approach appears easily extendible to related problem, such as Sparsest Cut, and also yields an near-linear time $O(1)$-approximation to Dagupta's objective function for hierarchical clustering [Dasgupta, STOC'16] for the semi-random hierarchical stochastic block model inputs of [Cohen-Addad, Kanade, Mallmann-Trenn, Mathieu, JACM'19]. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 24 pages, ICML 2024

ACM Class: F.2; G.3

arXiv:2405.01339 [pdf, other]

Sensitivity Sampling for $k$-Means: Worst Case and Stability Optimal Coreset Bounds

Authors: Nikhil Bansal, Vincent Cohen-Addad, Milind Prabhu, David Saulpic, Chris Schwiegelshohn

Abstract: Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as $k$-means. Given a point set $P$, a coreset $Ω$ is a small, weighted summary that preserves the cost of all candidate solutions $S$ up to a $(1\pm \varepsilon)$ factor. For $k$-means in $d$-dimensional Euclidean space the cost for solution $S$ is $\sum_{p\in P}\min_{s\in S}\|p-s\|^2$. A ver… ▽ More Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as $k$-means. Given a point set $P$, a coreset $Ω$ is a small, weighted summary that preserves the cost of all candidate solutions $S$ up to a $(1\pm \varepsilon)$ factor. For $k$-means in $d$-dimensional Euclidean space the cost for solution $S$ is $\sum_{p\in P}\min_{s\in S}\|p-s\|^2$. A very popular method for coreset construction, both in theory and practice, is Sensitivity Sampling, where points are sampled in proportion to their importance. We show that Sensitivity Sampling yields optimal coresets of size $\tilde{O}(k/\varepsilon^2\min(\sqrt{k},\varepsilon^{-2}))$ for worst-case instances. Uniquely among all known coreset algorithms, for well-clusterable data sets with $Ω(1)$ cost stability, Sensitivity Sampling gives coresets of size $\tilde{O}(k/\varepsilon^2)$, improving over the worst-case lower bound. Notably, Sensitivity Sampling does not have to know the cost stability in order to exploit it: It is appropriately sensitive to the clusterability of the data set while being oblivious to it. We also show that any coreset for stable instances consisting of only input points must have size $Ω(k/\varepsilon^2)$. Our results for Sensitivity Sampling also extend to the $k$-median problem, and more general metric spaces. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 57 pages

arXiv:2404.17509 [pdf, ps, other]

doi 10.1145/3618260.3649749

Understanding the Cluster LP for Correlation Clustering

Authors: Nairen Cao, Vincent Cohen-Addad, Euiwoong Lee, Shi Li, Alantha Newman, Lukas Vogl

Abstract: In the classic Correlation Clustering problem introduced by Bansal, Blum, and Chawla~(FOCS 2002), the input is a complete graph where edges are labeled either $+$ or $-$, and the goal is to find a partition of the vertices that minimizes the sum of the +edges across parts plus the sum of the -edges within parts. In recent years, Chawla, Makarychev, Schramm and Yaroslavtsev~(STOC 2015) gave a 2.06-… ▽ More In the classic Correlation Clustering problem introduced by Bansal, Blum, and Chawla~(FOCS 2002), the input is a complete graph where edges are labeled either $+$ or $-$, and the goal is to find a partition of the vertices that minimizes the sum of the +edges across parts plus the sum of the -edges within parts. In recent years, Chawla, Makarychev, Schramm and Yaroslavtsev~(STOC 2015) gave a 2.06-approximation by providing a near-optimal rounding of the standard LP, and Cohen-Addad, Lee, Li, and Newman~(FOCS 2022, 2023) finally bypassed the integrality gap of 2 for this LP giving a $1.73$-approximation for the problem. In order to create a simple and unified framework for Correlation Clustering similar to those for {\em typical} approximate optimization tasks, we propose the {\em cluster LP} as a strong linear program that might tightly capture the approximability of Correlation Clustering. It unifies all the previous relaxations for the problem. We demonstrate the power of the cluster LP by presenting a simple rounding algorithm, and providing two analyses, one analytically proving a 1.49-approximation and the other solving a factor-revealing SDP to show a 1.437-approximation. Both proofs introduce principled methods by which to analyze the performance of the algorithm, resulting in a significantly improved approximation guarantee. Finally, we prove an integrality gap of $4/3$ for the cluster LP, showing our 1.437-upper bound cannot be drastically improved. Our gap instance directly inspires an improved NP-hardness of approximation with a ratio $24/23 \approx 1.042$; no explicit hardness ratio was known before. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.06797 [pdf, other]

Fully Dynamic Correlation Clustering: Breaking 3-Approximation

Authors: Soheil Behnezhad, Moses Charikar, Vincent Cohen-Addad, Alma Ghafari, Weiyun Ma

Abstract: We study the classic correlation clustering in the dynamic setting. Given $n$ objects and a complete labeling of the object-pairs as either similar or dissimilar, the goal is to partition the objects into arbitrarily many clusters while minimizing disagreements with the labels. In the dynamic setting, an update consists of a flip of a label of an edge. In a breakthrough result, [BDHSS, FOCS'19] sh… ▽ More We study the classic correlation clustering in the dynamic setting. Given $n$ objects and a complete labeling of the object-pairs as either similar or dissimilar, the goal is to partition the objects into arbitrarily many clusters while minimizing disagreements with the labels. In the dynamic setting, an update consists of a flip of a label of an edge. In a breakthrough result, [BDHSS, FOCS'19] showed how to maintain a 3-approximation with polylogarithmic update time by providing a dynamic implementation of the Pivot algorithm of [ACN, STOC'05]. Since then, it has been a major open problem to determine whether the 3-approximation barrier can be broken in the fully dynamic setting. In this paper, we resolve this problem. Our algorithm, Modified Pivot, locally improves the output of Pivot by moving some vertices to other existing clusters or new singleton clusters. We present an analysis showing that this modification does indeed improve the approximation to below 3. We also show that its output can be maintained in polylogarithmic time per update. △ Less

Submitted 11 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.05433 [pdf, ps, other]

Combinatorial Correlation Clustering

Authors: Vincent Cohen-Addad, David Rasmussen Lolck, Marcin Pilipczuk, Mikkel Thorup, Shuyi Yan, Hanwen Zhang

Abstract: Correlation Clustering is a classic clustering objective arising in numerous machine learning and data mining applications. Given a graph $G=(V,E)$, the goal is to partition the vertex set into clusters so as to minimize the number of edges between clusters plus the number of edges missing within clusters. The problem is APX-hard and the best known polynomial time approximation factor is 1.73 by… ▽ More Correlation Clustering is a classic clustering objective arising in numerous machine learning and data mining applications. Given a graph $G=(V,E)$, the goal is to partition the vertex set into clusters so as to minimize the number of edges between clusters plus the number of edges missing within clusters. The problem is APX-hard and the best known polynomial time approximation factor is 1.73 by Cohen-Addad, Lee, Li, and Newman [FOCS'23]. They use an LP with $|V|^{1/ε^{Θ(1)}}$ variables for some small $ε$. However, due to the practical relevance of correlation clustering, there has also been great interest in getting more efficient sequential and parallel algorithms. The classic combinatorial pivot algorithm of Ailon, Charikar and Newman [JACM'08] provides a 3-approximation in linear time. Like most other algorithms discussed here, this uses randomization. Recently, Behnezhad, Charikar, Ma and Tan [FOCS'22] presented a $3+ε$-approximate solution for solving problem in a constant number of rounds in the Massively Parallel Computation (MPC) setting. Very recently, Cao, Huang, Su [SODA'24] provided a 2.4-approximation in a polylogarithmic number of rounds in the MPC model and in $\tilde{O}(|E|^{1.5})$ time in the classic sequential setting. They asked whether it is possible to get a better than 3-approximation in near-linear time? We resolve this problem with an efficient combinatorial algorithm providing a drastically better approximation factor. It achieves a $\sim 2-2/13 < 1.847$-approximation in sub-linear ($\tilde O(|V|)$) sequential time or in sub-linear ($\tilde O(|V|)$) space in the streaming setting, and it uses only a constant number of rounds in the MPC model. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Acccepted at STOC 2024

arXiv:2402.18263 [pdf, ps, other]

Max-Cut with $ε$-Accurate Predictions

Authors: Vincent Cohen-Addad, Tommaso d'Orsi, Anupam Gupta, Euiwoong Lee, Debmalya Panigrahi

Abstract: We study the approximability of the MaxCut problem in the presence of predictions. Specifically, we consider two models: in the noisy predictions model, for each vertex we are given its correct label in $\{-1,+1\}$ with some unknown probability $1/2 + ε$, and the other (incorrect) label otherwise. In the more-informative partial predictions model, for each vertex we are given its correct label wit… ▽ More We study the approximability of the MaxCut problem in the presence of predictions. Specifically, we consider two models: in the noisy predictions model, for each vertex we are given its correct label in $\{-1,+1\}$ with some unknown probability $1/2 + ε$, and the other (incorrect) label otherwise. In the more-informative partial predictions model, for each vertex we are given its correct label with probability $ε$ and no label otherwise. We assume only pairwise independence between vertices in both models. We show how these predictions can be used to improve on the worst-case approximation ratios for this problem. Specifically, we give an algorithm that achieves an $α+ \widetildeΩ(ε^4)$-approximation for the noisy predictions model, where $α\approx 0.878$ is the MaxCut threshold. While this result also holds for the partial predictions model, we can also give a $β+ Ω(ε)$-approximation, where $β\approx 0.858$ is the approximation ratio for MaxBisection given by Raghavendra and Tan. This answers a question posed by Ola Svensson in his plenary session talk at SODA'23. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: 18 pages

ACM Class: F.0

arXiv:2402.17327 [pdf, other]

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Authors: Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder

Abstract: We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably a… ▽ More We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon λΦ_k$, where $Φ_k$ represents the $k$-means cost for the input embeddings and $λ$ is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.06730 [pdf, other]

A Scalable Algorithm for Individually Fair K-means Clustering

Authors: MohammadHossein Bateni, Vincent Cohen-Addad, Alessandro Epasto, Silvio Lattanzi

Abstract: We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. Given $n$ points $P$ in a metric space, let $δ(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n / k$ points. A clustering is then called individually fair if it has centers within distance $δ(x)$ of $x$ for each $x\in P$. While g… ▽ More We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. Given $n$ points $P$ in a metric space, let $δ(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n / k$ points. A clustering is then called individually fair if it has centers within distance $δ(x)$ of $x$ for each $x\in P$. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in ~$O(nk^2)$ time and obtains a bicriteria $(O(1), 6)$ approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions. △ Less

Submitted 9 February, 2024; originally announced February 2024.

Comments: 32 pages, 2 figures, to appear at the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024

arXiv:2311.17840 [pdf, other]

A quasi-polynomial time algorithm for Multi-Dimensional Scaling via LP hierarchies

Authors: Ainesh Bakshi, Vincent Cohen-Addad, Samuel B. Hopkins, Rajesh Jayaram, Silvio Lattanzi

Abstract: Multi-dimensional Scaling (MDS) is a family of methods for embedding an $n$-point metric into low-dimensional Euclidean space. We study the Kamada-Kawai formulation of MDS: given a set of non-negative dissimilarities $\{d_{i,j}\}_{i , j \in [n]}$ over $n$ points, the goal is to find an embedding $\{x_1,\dots,x_n\} \in \mathbb{R}^k$ that minimizes \[\text{OPT} = \min_{x} \mathbb{E}_{i,j \in [n]} \l… ▽ More Multi-dimensional Scaling (MDS) is a family of methods for embedding an $n$-point metric into low-dimensional Euclidean space. We study the Kamada-Kawai formulation of MDS: given a set of non-negative dissimilarities $\{d_{i,j}\}_{i , j \in [n]}$ over $n$ points, the goal is to find an embedding $\{x_1,\dots,x_n\} \in \mathbb{R}^k$ that minimizes \[\text{OPT} = \min_{x} \mathbb{E}_{i,j \in [n]} \left[ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2 \right] \] Kamada-Kawai provides a more relaxed measure of the quality of a low-dimensional metric embedding than the traditional bi-Lipschitz-ness measure studied in theoretical computer science; this is advantageous because strong hardness-of-approximation results are known for the latter, Kamada-Kawai admits nontrivial approximation algorithms. Despite its popularity, our theoretical understanding of MDS is limited. Recently, Demaine, Hesterberg, Koehler, Lynch, and Urschel (arXiv:2109.11505) gave the first approximation algorithm with provable guarantees for Kamada-Kawai in the constant-$k$ regime, with cost $\text{OPT} +ε$ in $n^2 2^{\text{poly}(Δ/ε)}$ time, where $Δ$ is the aspect ratio of the input. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $Δ$: we achieve a solution with cost $\tilde{O}(\log Δ)\text{OPT}^{Ω(1)}+ε$ in time $n^{O(1)}2^{\text{poly}(\log(Δ)/ε)}$. Our approach is based on a novel analysis of a conditioning-based rounding scheme for the Sherali-Adams LP Hierarchy. Crucially, our analysis exploits the geometry of low-dimensional Euclidean space, allowing us to avoid an exponential dependence on the aspect ratio. We believe our geometry-aware treatment of the Sherali-Adams Hierarchy is an important step towards develo** general-purpose techniques for efficient metric optimization algorithms. △ Less

Submitted 11 April, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: Extended exposition

arXiv:2311.00892 [pdf, other]

A PTAS for $\ell_0$-Low Rank Approximation: Solving Dense CSPs over Reals

Authors: Vincent Cohen-Addad, Chenglin Fan, Suprovat Ghoshal, Euiwoong Lee, Arnaud de Mesmay, Alantha Newman, Tony Chang Wang

Abstract: We consider the Low Rank Approximation problem, where the input consists of a matrix $A \in \mathbb{R}^{n_R \times n_C}$ and an integer $k$, and the goal is to find a matrix $B$ of rank at most $k$ that minimizes $\| A - B \|_0$, which is the number of entries where $A$ and $B$ differ. For any constant $k$ and $\varepsilon > 0$, we present a polynomial time $(1 + \varepsilon)$-approximation time f… ▽ More We consider the Low Rank Approximation problem, where the input consists of a matrix $A \in \mathbb{R}^{n_R \times n_C}$ and an integer $k$, and the goal is to find a matrix $B$ of rank at most $k$ that minimizes $\| A - B \|_0$, which is the number of entries where $A$ and $B$ differ. For any constant $k$ and $\varepsilon > 0$, we present a polynomial time $(1 + \varepsilon)$-approximation time for this problem, which significantly improves the previous best $poly(k)$-approximation. Our algorithm is obtained by viewing the problem as a Constraint Satisfaction Problem (CSP) where each row and column becomes a variable that can have a value from $\mathbb{R}^k$. In this view, we have a constraint between each row and column, which results in a {\em dense} CSP, a well-studied topic in approximation algorithms. While most of previous algorithms focus on finite-size (or constant-size) domains and involve an exhaustive enumeration over the entire domain, we present a new framework that bypasses such an enumeration in $\mathbb{R}^k$. We also use tools from the rich literature of Low Rank Approximation in different objectives (e.g., $\ell_p$ with $p \in (0, \infty)$) or domains (e.g., finite fields/generalized Boolean). We believe that our techniques might be useful to study other real-valued CSPs and matrix optimization problems. On the hardness side, when $k$ is part of the input, we prove that Low Rank Approximation is NP-hard to approximate within a factor of $Ω(\log n)$. This is the first superconstant NP-hardness of approximation for any $p \in [0, \infty]$ that does not rely on stronger conjectures (e.g., the Small Set Expansion Hypothesis). △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: To appear in SODA 24

arXiv:2310.04076 [pdf, other]

Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation

Authors: Vincent Cohen-Addad, David Saulpic, Chris Schwiegelshohn

Abstract: In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the preci… ▽ More In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the precision parameter $\varepsilon^{-1}$ or $k$. Furthermore, there is no coreset construction that succeeds with probability $1-1/n$ and whose size does not depend on the number of input points, $n$. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman, WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are $Ω(1)$ for both $k$-median and $k$-means, even when allowing a complexity FPT in the number of clusters $k$. This stands in sharp contrast with the $(1+\varepsilon)$-approximation achievable in that case, when allowing randomization. In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We also construct a deterministic algorithm for computing $(1+\varepsilon)$-approximation to $k$-median and $k$-means in high dimensional Euclidean spaces in time $2^{k^2/\varepsilon^{O(1)}} poly(nd)$, close to the best randomized complexity. Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS '22] by a factor $k$. △ Less

Submitted 6 October, 2023; originally announced October 2023.

Comments: FOCS 2023. Abstract reduced for arxiv requirements

arXiv:2310.02882 [pdf, other]

Streaming Euclidean $k$-median and $k$-means with $o(\log n)$ Space

Authors: Vincent Cohen-Addad, David P. Woodruff, Samson Zhou

Abstract: We consider the classic Euclidean $k$-median and $k$-means objective on data streams, where the goal is to provide a $(1+\varepsilon)$-approximation to the optimal $k$-median or $k$-means solution, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques… ▽ More We consider the classic Euclidean $k$-median and $k$-means objective on data streams, where the goal is to provide a $(1+\varepsilon)$-approximation to the optimal $k$-median or $k$-means solution, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, the merge-and-reduce framework, bicriteria approximation, sensitivity sampling, and so on. Despite this intense effort to obtain smaller sketches for these problems, all known techniques require storing at least $Ω(\log(nΔ))$ words of memory, where $n$ is the size of the input and $Δ$ is the aspect ratio. A natural question is if one can beat this logarithmic dependence on $n$ and $Δ$. In this paper, we break this barrier by first giving an insertion-only streaming algorithm that achieves a $(1+\varepsilon)$-approximation to the more general $(k,z)$-clustering problem, using $\tilde{\mathcal{O}}\left(\frac{dk}{\varepsilon^2}\right)\cdot(2^{z\log z})\cdot\min\left(\frac{1}{\varepsilon^z},k\right)\cdot\text{poly}(\log\log(nΔ))$ words of memory. Our techniques can also be used to achieve two-pass algorithms for $k$-median and $k$-means clustering on dynamic streams using $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^2}\right)\cdot\text{poly}(d,k,\log\log(nΔ))$ words of memory. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: To appear at FOCS 2023

arXiv:2309.17243 [pdf, other]

Handling Correlated Rounding Error via Preclustering: A 1.73-approximation for Correlation Clustering

Authors: Vincent Cohen-Addad, Euiwoong Lee, Shi Li, Alantha Newman

Abstract: We consider the classic Correlation Clustering problem: Given a complete graph where edges are labelled either $+$ or $-$, the goal is to find a partition of the vertices that minimizes the number of the \pedges across parts plus the number of the \medges within parts. Recently, Cohen-Addad, Lee and Newman [CLN22] presented a 1.994-approximation algorithm for the problem using the Sherali-Adams hi… ▽ More We consider the classic Correlation Clustering problem: Given a complete graph where edges are labelled either $+$ or $-$, the goal is to find a partition of the vertices that minimizes the number of the \pedges across parts plus the number of the \medges within parts. Recently, Cohen-Addad, Lee and Newman [CLN22] presented a 1.994-approximation algorithm for the problem using the Sherali-Adams hierarchy, hence breaking through the integrality gap of 2 for the classic linear program and improving upon the 2.06-approximation of Chawla, Makarychev, Schramm and Yaroslavtsev [CMSY15]. We significantly improve the state-of-the-art by providing a 1.73-approximation for the problem. Our approach introduces a preclustering of Correlation Clustering instances that allows us to essentially ignore the error arising from the {\em correlated rounding} used by [CLN22]. This additional power simplifies the previous algorithm and analysis. More importantly, it enables a new {\em set-based rounding} that complements the previous roundings. A combination of these two rounding algorithms yields the improved bound. △ Less

Submitted 29 September, 2023; originally announced September 2023.

arXiv:2309.16384 [pdf, other]

Multi-Swap $k$-Means++

Authors: Lorenzo Beretta, Vincent Cohen-Addad, Silvio Lattanzi, Nikos Parotsidis

Abstract: The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the… ▽ More The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets. △ Less

Submitted 28 September, 2023; originally announced September 2023.

Comments: NeurIPS 2023

arXiv:2304.07268 [pdf, ps, other]

Planar and Minor-Free Metrics Embed into Metrics of Polylogarithmic Treewidth with Expected Multiplicative Distortion Arbitrarily Close to 1

Authors: Vincent Cohen-Addad, Hung Le, Marcin Pilipczuk, Michał Pilipczuk

Abstract: We prove that there is a randomized polynomial-time algorithm that given an edge-weighted graph $G$ excluding a fixed-minor $Q$ on $n$ vertices and an accuracy parameter $\varepsilon>0$, constructs an edge-weighted graph~$H$ and an embedding $η\colon V(G)\to V(H)$ with the following properties: * For any constant size $Q$, the treewidth of $H$ is polynomial in $\varepsilon^{-1}$, $\log n$, and the… ▽ More We prove that there is a randomized polynomial-time algorithm that given an edge-weighted graph $G$ excluding a fixed-minor $Q$ on $n$ vertices and an accuracy parameter $\varepsilon>0$, constructs an edge-weighted graph~$H$ and an embedding $η\colon V(G)\to V(H)$ with the following properties: * For any constant size $Q$, the treewidth of $H$ is polynomial in $\varepsilon^{-1}$, $\log n$, and the logarithm of the stretch of the distance metric in $G$. * The expected multiplicative distortion is $(1+\varepsilon)$: for every pair of vertices $u,v$ of $G$, we have $\mathrm{dist}_H(η(u),η(v))\geq \mathrm{dist}_G(u,v)$ always and $\mathrm{Exp}[\mathrm{dist}_H(η(u),η(v))]\leq (1+\varepsilon)\mathrm{dist}_G(u,v)$. Our embedding is the first to achieve polylogarithmic treewidth of the host graph and comes close to the lower bound by Carroll and Goel, who showed that any embedding of a planar graph with $\mathcal{O}(1)$ expected distortion requires the host graph to have treewidth $Ω(\log n)$. It also provides a unified framework for obtaining randomized quasi-polynomial-time approximation schemes for a variety of problems including network design, clustering or routing problems, in minor-free metrics where the optimization goal is the sum of selected distances. Applications include the capacitated vehicle routing problem, and capacitated clustering problems. △ Less

Submitted 14 April, 2023; originally announced April 2023.

arXiv:2302.00037 [pdf, other]

Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees

Authors: Jacob Imola, Alessandro Epasto, Mohammad Mahdian, Vincent Cohen-Addad, Vahab Mirrokni

Abstract: Hierarchical Clustering is a popular unsupervised machine learning method with decades of history and numerous applications. We initiate the study of differentially private approximation algorithms for hierarchical clustering under the rigorous framework introduced by (Dasgupta, 2016). We show strong lower bounds for the problem: that any $ε$-DP algorithm must exhibit $O(|V|^2/ ε)$-additive error… ▽ More Hierarchical Clustering is a popular unsupervised machine learning method with decades of history and numerous applications. We initiate the study of differentially private approximation algorithms for hierarchical clustering under the rigorous framework introduced by (Dasgupta, 2016). We show strong lower bounds for the problem: that any $ε$-DP algorithm must exhibit $O(|V|^2/ ε)$-additive error for an input dataset $V$. Then, we exhibit a polynomial-time approximation algorithm with $O(|V|^{2.5}/ ε)$-additive error, and an exponential-time algorithm that meets the lower bound. To overcome the lower bound, we focus on the stochastic block model, a popular model of graphs, and, with a separation assumption on the blocks, propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly. Finally, we perform an empirical study of our algorithms and validate their performance. △ Less

Submitted 23 May, 2023; v1 submitted 31 January, 2023; originally announced February 2023.

Comments: 28 pages, 1 figure

arXiv:2301.04822 [pdf, ps, other]

Private estimation algorithms for stochastic block models and mixture models

Authors: Hongjie Chen, Vincent Cohen-Addad, Tommaso d'Orsi, Alessandro Epasto, Jacob Imola, David Steurer, Stefan Tiegel

Abstract: We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient $(ε, δ)$-… ▽ More We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient $(ε, δ)$-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an $(ε, δ)$-differentially private algorithm that recovers the centers of the $k$-mixture when the minimum separation is at least $ O(k^{1/t}\sqrt{t})$. For all choices of $t$, this algorithm requires sample complexity $n\geq k^{O(1)}d^{O(t)}$ and time complexity $(nd)^{O(t)}$. Prior work required minimum separation at least $O(\sqrt{k})$ as well as an explicit upper bound on the Euclidean norm of the centers. △ Less

Submitted 15 November, 2023; v1 submitted 11 January, 2023; originally announced January 2023.

arXiv:2212.14220 [pdf, ps, other]

Graph Searching with Predictions

Authors: Siddhartha Banerjee, Vincent Cohen-Addad, Anupam Gupta, Zhouzi Li

Abstract: Consider an agent exploring an unknown graph in search of some goal state. As it walks around the graph, it learns the nodes and their neighbors. The agent only knows where the goal state is when it reaches it. How do we reach this goal while moving only a small distance? This problem seems hopeless, even on trees of bounded degree, unless we give the agent some help. This setting with ''help'' of… ▽ More Consider an agent exploring an unknown graph in search of some goal state. As it walks around the graph, it learns the nodes and their neighbors. The agent only knows where the goal state is when it reaches it. How do we reach this goal while moving only a small distance? This problem seems hopeless, even on trees of bounded degree, unless we give the agent some help. This setting with ''help'' often arises in exploring large search spaces (e.g., huge game trees) where we assume access to some score/quality function for each node, which we use to guide us towards the goal. In our case, we assume the help comes in the form of distance predictions: each node $v$ provides a prediction $f(v)$ of its distance to the goal vertex. Naturally if these predictions are correct, we can reach the goal along a shortest path. What if the predictions are unreliable and some of them are erroneous? Can we get an algorithm whose performance relates to the error of the predictions? In this work, we consider the problem on trees and give deterministic algorithms whose total movement cost is only $O(OPT + Δ\cdot ERR)$, where $OPT$ is the distance from the start to the goal vertex, $Δ$ the maximum degree, and the $ERR$ is the total number of vertices whose predictions are erroneous. We show this guarantee is optimal. We then consider a ''planning'' version of the problem where the graph and predictions are known at the beginning, so the agent can use this global information to devise a search strategy of low cost. For this planning version, we go beyond trees and give an algorithms which gets good performance on (weighted) graphs with bounded doubling dimension. △ Less

Submitted 29 December, 2022; originally announced December 2022.

arXiv:2212.06546 [pdf, other]

Streaming Euclidean MST to a Constant Factor

Authors: Vincent Cohen-Addad, Xi Chen, Rajesh Jayaram, Amit Levi, Erik Waingarten

Abstract: We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \mathbb{R}^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+ε)$ approximations are possible in sublinear space [Fra… ▽ More We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \mathbb{R}^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+ε)$ approximations are possible in sublinear space [Frahling, Indyk, Sohler, SoCG '05]. However, for high dimensional spaces the best known approximation for this problem was $\tilde{O}(\log n)$, due to [Chen, Jayaram, Levi, Waingarten, STOC '22], improving on the prior $O(\log^2 n)$ bound due to [Indyk, STOC '04] and [Andoni, Indyk, Krauthgamer, SODA '08]. In this paper, we break the logarithmic barrier, and give the first constant factor sublinear space approximation to Euclidean MST. For any $ε\geq 1$, our algorithm achieves an $\tilde{O}(ε^{-2})$ approximation in $n^{O(ε)}$ space. We complement this by proving that any single pass algorithm which obtains a better than $1.10$-approximation must use $Ω(\sqrt{n})$ space, demonstrating that $(1+ε)$ approximations are not possible in high-dimensions, and that our algorithm is tight up to a constant. Nevertheless, we demonstrate that $(1+ε)$ approximations are possible in sublinear space with $O(1/ε)$ passes over the stream. More generally, for any $α\geq 2$, we give a $α$-pass streaming algorithm which achieves a $(1+O(\frac{\log α+ 1}{ αε}))$ approximation in $n^{O(ε)} d^{O(1)}$ space. Our streaming algorithms are linear sketches, and therefore extend to the massively-parallel computation model (MPC). Thus, our results imply the first $(1+ε)$-approximation to Euclidean MST in a constant number of rounds in the MPC model. △ Less

Submitted 13 December, 2022; originally announced December 2022.

arXiv:2211.08184 [pdf, other]

Improved Coresets for Euclidean $k$-Means

Authors: Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, Omar Ali Sheikh-Omar

Abstract: Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weigh… ▽ More Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset. The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a $(1\pm \varepsilon)$ factor. The current state of the art coreset size is $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for Euclidean $k$-means and $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for Euclidean $k$-median. The best known lower bound for both problems is $Ω(k \varepsilon^{-2})$. In this paper, we improve the upper bounds $\tilde O(\min(k^{3/2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for $k$-means and $\tilde O(\min(k^{4/3} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for $k$-median. In particular, ours is the first provable bound that breaks through the $k^2$ barrier while retaining an optimal dependency on $\varepsilon$. △ Less

Submitted 16 November, 2022; v1 submitted 15 November, 2022; originally announced November 2022.

arXiv:2209.01901 [pdf, ps, other]

The Power of Uniform Sampling for Coresets

Authors: Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H. -C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, Xuan Wu

Abstract: Motivated by practical generalizations of the classic $k$-median and $k$-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive er… ▽ More Motivated by practical generalizations of the classic $k$-median and $k$-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error. This reduction enables us to construct coresets using uniform sampling, in contrast to the widely-used importance sampling, and consequently we can easily handle constrained objectives. Notably and perhaps surprisingly, this simpler sampling scheme can yield coresets whose size is independent of $n$, the number of input points. Our technique yields smaller coresets, and sometimes the first coresets, for a large number of constrained clustering problems, including capacitated clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in minor-excluded graph, and polygon clustering under Fréchet and Hausdorff distance. Finally, our technique yields also smaller coresets for $1$-median in low-dimensional Euclidean spaces, specifically of size $\tilde{O}(\varepsilon^{-1.5})$ in $\mathbb{R}^2$ and $\tilde{O}(\varepsilon^{-1.6})$ in $\mathbb{R}^3$. △ Less

Submitted 17 September, 2022; v1 submitted 5 September, 2022; originally announced September 2022.

arXiv:2208.14129 [pdf, other]

On the Fixed-Parameter Tractability of Capacitated Clustering

Authors: Vincent Cohen-Addad, Jason Li

Abstract: We study the complexity of the classic capacitated k-median and k-means problems parameterized by the number of centers, k. These problems are notoriously difficult since the best known approximation bound for high dimensional Euclidean space and general metric space is $Θ(\log k)$ and it remains a major open problem whether a constant factor exists. We show that there exists a $(3+ε)$-approximati… ▽ More We study the complexity of the classic capacitated k-median and k-means problems parameterized by the number of centers, k. These problems are notoriously difficult since the best known approximation bound for high dimensional Euclidean space and general metric space is $Θ(\log k)$ and it remains a major open problem whether a constant factor exists. We show that there exists a $(3+ε)$-approximation algorithm for the capacitated k-median and a $(9+ε)$-approximation algorithm for the capacitated k-means problem in general metric spaces whose running times are $f(ε,k) n^{O(1)}$. For Euclidean inputs of arbitrary dimension, we give a $(1+ε)$-approximation algorithm for both problems with a similar running time. This is a significant improvement over the $(7+ε)$-approximation of Adamczyk et al. for k-median in general metric spaces and the $(69+ε)$-approximation of Xu et al. for Euclidean k-means. △ Less

Submitted 30 August, 2022; originally announced August 2022.

Comments: Full version of the ICALP'19 paper (w/ same title, same authors)

arXiv:2208.13920 [pdf, other]

Fitting Metrics and Ultrametrics with Minimum Disagreements

Authors: Vincent Cohen-Addad, Chenglin Fan, Euiwoong Lee, Arnaud de Mesmay

Abstract: Given $x \in (\mathbb{R}_{\geq 0})^{\binom{[n]}{2}}$ recording pairwise distances, the METRIC VIOLATION DISTANCE (MVD) problem asks to compute the $\ell_0$ distance between $x$ and the metric cone; i.e., modify the minimum number of entries of $x$ to make it a metric. Due to its large number of applications in various data analysis and optimization tasks, this problem has been actively studied rec… ▽ More Given $x \in (\mathbb{R}_{\geq 0})^{\binom{[n]}{2}}$ recording pairwise distances, the METRIC VIOLATION DISTANCE (MVD) problem asks to compute the $\ell_0$ distance between $x$ and the metric cone; i.e., modify the minimum number of entries of $x$ to make it a metric. Due to its large number of applications in various data analysis and optimization tasks, this problem has been actively studied recently. We present an $O(\log n)$-approximation algorithm for MVD, exponentially improving the previous best approximation ratio of $O(OPT^{1/3})$ of Fan et al. [ SODA, 2018]. Furthermore, a major strength of our algorithm is its simplicity and running time. We also study the related problem of ULTRAMETRIC VIOLATION DISTANCE (UMVD), where the goal is to compute the $\ell_0$ distance to the cone of ultrametrics, and achieve a constant factor approximation algorithm. The UMVD can be regarded as an extension of the problem of fitting ultrametrics studied by Ailon and Charikar [SIAM J. Computing, 2011] and by Cohen-Addad et al. [FOCS, 2021] from $\ell_1$ norm to $\ell_0$ norm. We show that this problem can be favorably interpreted as an instance of Correlation Clustering with an additional hierarchical structure, which we solve using a new $O(1)$-approximation algorithm for correlation clustering that has the structural property that it outputs a refinement of the optimum clusters. An algorithm satisfying such a property can be considered of independent interest. We also provide an $O(\log n \log \log n)$ approximation algorithm for weighted instances. Finally, we investigate the complementary version of these problems where one aims at choosing a maximum number of entries of $x$ forming an (ultra-)metric. In stark contrast with the minimization versions, we prove that these maximization versions are hard to approximate within any constant factor assuming the Unique Games Conjecture. △ Less

Submitted 29 August, 2022; originally announced August 2022.

Comments: To appear at FOCS 2022 (Full version)

arXiv:2207.10889 [pdf, ps, other]

Correlation Clustering with Sherali-Adams

Authors: Vincent Cohen-Addad, Euiwoong Lee, Alantha Newman

Abstract: Given a complete graph $G = (V, E)$ where each edge is labeled $+$ or $-$, the Correlation Clustering problem asks to partition $V$ into clusters to minimize the number of $+$edges between different clusters plus the number of $-$edges within the same cluster. Correlation Clustering has been used to model a large number of clustering problems in practice, making it one of the most widely studied c… ▽ More Given a complete graph $G = (V, E)$ where each edge is labeled $+$ or $-$, the Correlation Clustering problem asks to partition $V$ into clusters to minimize the number of $+$edges between different clusters plus the number of $-$edges within the same cluster. Correlation Clustering has been used to model a large number of clustering problems in practice, making it one of the most widely studied clustering formulations. The approximability of Correlation Clustering has been actively investigated [BBC04, CGW05, ACN08], culminating in a $2.06$-approximation algorithm [CMSY15], based on rounding the standard LP relaxation. Since the integrality gap for this formulation is 2, it has remained a major open question to determine if the approximation factor of 2 can be reached, or even breached. In this paper, we answer this question affirmatively by showing that there exists a $(1.994 + ε)$-approximation algorithm based on $O(1/ε^2$) rounds of the Sherali-Adams hierarchy. In order to round a solution to the Sherali-Adams relaxation, we adapt the {\em correlated rounding} originally developed for CSPs [BRS11, GS11, RT12]. With this tool, we reach an approximation ratio of $2+ε$ for Correlation Clustering. To breach this ratio, we go beyond the traditional triangle-based analysis by employing a global charging scheme that amortizes the total cost of the rounding across different triangles. △ Less

Submitted 3 May, 2023; v1 submitted 22 July, 2022; originally announced July 2022.

arXiv:2207.05150 [pdf, ps, other]

Breaching the 2 LMP Approximation Barrier for Facility Location with Applications to k-Median

Authors: Vincent Cohen-Addad, Fabrizio Grandoni, Euiwoong Lee, Chris Schwiegelshohn

Abstract: The Uncapacitated Facility Location (UFL) problem is one of the most fundamental clustering problems: Given a set of clients $C$ and a set of facilities $F$ in a metric space $(C \cup F, dist)$ with facility costs $open : F \to \mathbb{R}^+$, the goal is to find a set of facilities $S \subseteq F$ to minimize the sum of the opening cost $open(S)$ and the connection cost… ▽ More The Uncapacitated Facility Location (UFL) problem is one of the most fundamental clustering problems: Given a set of clients $C$ and a set of facilities $F$ in a metric space $(C \cup F, dist)$ with facility costs $open : F \to \mathbb{R}^+$, the goal is to find a set of facilities $S \subseteq F$ to minimize the sum of the opening cost $open(S)$ and the connection cost $d(S) := \sum_{p \in C} \min_{c \in S} dist(p, c)$. An algorithm for UFL is called a Lagrangian Multiplier Preserving (LMP) $α$ approximation if it outputs a solution $S\subseteq F$ satisfying $open(S) + d(S) \leq open(S^*) + αd(S^*)$ for any $S^* \subseteq F$. The best-known LMP approximation ratio for UFL is at most $2$ by the JMS algorithm of Jain, Mahdian, and Saberi based on the Dual-Fitting technique. We present a (slightly) improved LMP approximation algorithm for UFL. This is achieved by combining the Dual-Fitting technique with Local Search, another popular technique to address clustering problems. From a conceptual viewpoint, our result gives a theoretical evidence that local search can be enhanced so as to avoid bad local optima by choosing the initial feasible solution with LP-based techniques. Using the framework of bipoint solutions, our result directly implies a (slightly) improved approximation for the $k$-Median problem from 2.6742 to 2.67059. △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: 55 pages

arXiv:2206.08646 [pdf, other]

Scalable Differentially Private Clustering via Hierarchically Separated Trees

Authors: Vincent Cohen-Addad, Alessandro Epasto, Silvio Lattanzi, Vahab Mirrokni, Andres Munoz, David Saulpic, Chris Schwiegelshohn, Sergei Vassilvitskii

Abstract: We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / ε^2)$, where $ε$ is the priv… ▽ More We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / ε^2)$, where $ε$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: To appear at KDD'22

arXiv:2205.12327 [pdf, other]

Beyond Impossibility: Balancing Sufficiency, Separation and Accuracy

Authors: Limor Gultchin, Vincent Cohen-Addad, Sophie Giffard-Roisin, Varun Kanade, Frederik Mallmann-Trenn

Abstract: Among the various aspects of algorithmic fairness studied in recent years, the tension between satisfying both \textit{sufficiency} and \textit{separation} -- e.g. the ratios of positive or negative predictive values, and false positive or false negative rates across groups -- has received much attention. Following a debate sparked by COMPAS, a criminal justice predictive system, the academic comm… ▽ More Among the various aspects of algorithmic fairness studied in recent years, the tension between satisfying both \textit{sufficiency} and \textit{separation} -- e.g. the ratios of positive or negative predictive values, and false positive or false negative rates across groups -- has received much attention. Following a debate sparked by COMPAS, a criminal justice predictive system, the academic community has responded by laying out important theoretical understanding, showing that one cannot achieve both with an imperfect predictor when there is no equal distribution of labels across the groups. In this paper, we shed more light on what might be still possible beyond the impossibility -- the existence of a trade-off means we should aim to find a good balance within it. After refining the existing theoretical result, we propose an objective that aims to balance \textit{sufficiency} and \textit{separation} measures, while maintaining similar accuracy levels. We show the use of such an objective in two empirical case studies, one involving a multi-objective framework, and the other fine-tuning of a model pre-trained for accuracy. We show promising results, where better trade-offs are achieved compared to existing alternatives. △ Less

Submitted 24 May, 2022; originally announced May 2022.

arXiv:2204.04828 [pdf, ps, other]

Improved Approximations for Euclidean $k$-means and $k$-median, via Nested Quasi-Independent Sets

Authors: Vincent Cohen-Addad, Hossein Esfandiari, Vahab Mirrokni, Shyam Narayanan

Abstract: Motivated by data analysis and machine learning applications, we consider the popular high-dimensional Euclidean $k$-median and $k$-means problems. We propose a new primal-dual algorithm, inspired by the classic algorithm of Jain and Vazirani and the recent algorithm of Ahmadian, Norouzi-Fard, Svensson, and Ward. Our algorithm achieves an approximation ratio of $2.406$ and $5.912$ for Euclidean… ▽ More Motivated by data analysis and machine learning applications, we consider the popular high-dimensional Euclidean $k$-median and $k$-means problems. We propose a new primal-dual algorithm, inspired by the classic algorithm of Jain and Vazirani and the recent algorithm of Ahmadian, Norouzi-Fard, Svensson, and Ward. Our algorithm achieves an approximation ratio of $2.406$ and $5.912$ for Euclidean $k$-median and $k$-means, respectively, improving upon the 2.633 approximation ratio of Ahmadian et al. and the 6.1291 approximation ratio of Grandoni, Ostrovsky, Rabani, Schulman, and Venkat. Our techniques involve a much stronger exploitation of the Euclidean metric than previous work on Euclidean clustering. In addition, we introduce a new method of removing excess centers using a variant of independent sets over graphs that we dub a "nested quasi-independent set". In turn, this technique may be of interest for other optimization problems in Euclidean and $\ell_p$ metric spaces. △ Less

Submitted 11 April, 2022; v1 submitted 10 April, 2022; originally announced April 2022.

Comments: 74 pages. To appear in Symposium on Theory of Computing (STOC), 2022

arXiv:2203.01857 [pdf, ps, other]

Improved Approximation Algorithms and Lower Bounds for Search-Diversification Problems

Authors: Amir Abboud, Vincent Cohen-Addad, Euiwoong Lee, Pasin Manurangsi

Abstract: We study several questions related to diversifying search results. We give improved approximation algorithms in each of the following problems, together with some lower bounds. - We give a polynomial-time approximation scheme (PTAS) for a diversified search ranking problem [Bansal et al., ICALP 2010] whose objective is to minimizes the discounted cumulative gain. Our PTAS runs in time… ▽ More We study several questions related to diversifying search results. We give improved approximation algorithms in each of the following problems, together with some lower bounds. - We give a polynomial-time approximation scheme (PTAS) for a diversified search ranking problem [Bansal et al., ICALP 2010] whose objective is to minimizes the discounted cumulative gain. Our PTAS runs in time $n^{2^{O(\log(1/ε)/ε)}} \cdot m^{O(1)}$ where $n$ denotes the number of elements in the databases. Complementing this, we show that no PTAS can run in time $f(ε) \cdot (nm)^{2^{o(1/ε)}}$ assuming Gap-ETH; therefore our running time is nearly tight. Both of our bounds answer open questions of Bansal et al. - We next consider the Max-Sum Dispersion problem, whose objective is to select $k$ out of $n$ elements that maximizes the dispersion, which is defined as the sum of the pairwise distances under a given metric. We give a quasipolynomial-time approximation scheme for the problem which runs in time $n^{O_ε(\log n)}$. This improves upon previously known polynomial-time algorithms with approximate ratios 0.5 [Hassin et al., Oper. Res. Lett. 1997; Borodin et al., ACM Trans. Algorithms 2017]. Furthermore, we observe that known reductions rule out approximation schemes that run in $n^{\tilde{o}_ε(\log n)}$ time assuming ETH. - We consider a generalization of Max-Sum Dispersion called Max-Sum Diversification. In addition to the sum of pairwise distance, the objective includes another function $f$. For monotone submodular $f$, we give a quasipolynomial-time algorithm with approximation ratio arbitrarily close to $(1 - 1/e)$. This improves upon the best polynomial-time algorithm which has approximation ratio $0.5$ by Borodin et al. Furthermore, the $(1 - 1/e)$ factor is tight as achieving better-than-$(1 - 1/e)$ approximation is NP-hard [Feige, J. ACM 1998]. △ Less

Submitted 3 March, 2022; originally announced March 2022.

arXiv:2203.01440 [pdf, ps, other]

Near-Optimal Correlation Clustering with Privacy

Authors: Vincent Cohen-Addad, Chenglin Fan, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski

Abstract: Correlation clustering is a central problem in unsupervised learning, with applications spanning community detection, duplicate detection, automated labelling and many more. In the correlation clustering problem one receives as input a set of nodes and for each node a list of co-clustering preferences, and the goal is to output a clustering that minimizes the disagreement with the specified nodes'… ▽ More Correlation clustering is a central problem in unsupervised learning, with applications spanning community detection, duplicate detection, automated labelling and many more. In the correlation clustering problem one receives as input a set of nodes and for each node a list of co-clustering preferences, and the goal is to output a clustering that minimizes the disagreement with the specified nodes' preferences. In this paper, we introduce a simple and computationally efficient algorithm for the correlation clustering problem with provable privacy guarantees. Our approximation guarantees are stronger than those shown in prior work and are optimal up to logarithmic factors. △ Less

Submitted 2 March, 2022; originally announced March 2022.

arXiv:2202.12793 [pdf, other]

Towards Optimal Lower Bounds for k-median and k-means Coresets

Authors: Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn

Abstract: Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern da… ▽ More Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative $(1 \pm \varepsilon)$ factor, hence reducing from large to small scale the input to the problem. In this paper, we present improved lower bounds for coresets in various metric spaces. In finite metrics consisting of $n$ points and doubling metrics with doubling constant $D$, we show that any coreset for $(k,z)$ clustering must consist of at least $Ω(k \varepsilon^{-2} \log n)$ and $Ω(k \varepsilon^{-2} D)$ points, respectively. Both bounds match previous upper bounds up to polylog factors. In Euclidean spaces, we show that any coreset for $(k,z)$ clustering must consists of at least $Ω(k\varepsilon^{-2})$ points. We complement these lower bounds with a coreset construction consisting of at most $\tilde{O}(k\varepsilon^{-2}\cdot \min(\varepsilon^{-z},k))$ points. △ Less

Submitted 25 February, 2022; originally announced February 2022.

arXiv:2112.03222 [pdf, ps, other]

On Complexity of 1-Center in Various Metrics

Authors: Amir Abboud, Mohammad Hossein Bateni, Vincent Cohen-Addad, Karthik C. S., Saeed Seddighin

Abstract: We consider the classic 1-center problem: Given a set $P$ of $n$ points in a metric space find the point in $P$ that minimizes the maximum distance to the other points of $P$. We study the complexity of this problem in $d$-dimensional $\ell_p$-metrics and in edit and Ulam metrics over strings of length $d$. Our results for the 1-center problem may be classified based on $d$ as follows.… ▽ More We consider the classic 1-center problem: Given a set $P$ of $n$ points in a metric space find the point in $P$ that minimizes the maximum distance to the other points of $P$. We study the complexity of this problem in $d$-dimensional $\ell_p$-metrics and in edit and Ulam metrics over strings of length $d$. Our results for the 1-center problem may be classified based on $d$ as follows. $\bullet$ Small $d$: Assuming the hitting set conjecture (HSC), we show that when $d=ω(\log n)$, no subquadratic algorithm can solve 1-center problem in any of the $\ell_p$-metrics, or in edit or Ulam metrics. $\bullet$ Large $d$: When $d=Ω(n)$, we extend our conditional lower bound to rule out subquartic algorithms for 1-center problem in edit metric (assuming Quantified SETH). On the other hand, we give a $(1+ε)$-approximation for 1-center in Ulam metric with running time $\tilde{O_{\varepsilon}}(nd+n^2\sqrt{d})$. We also strengthen some of the above lower bounds by allowing approximations or by reducing the dimension $d$, but only against a weaker class of algorithms which list all requisite solutions. Moreover, we extend one of our hardness results to rule out subquartic algorithms for the well-studied 1-median problem in the edit metric, where given a set of $n$ strings each of length $n$, the goal is to find a string in the set that minimizes the sum of the edit distances to the rest of the strings in the set. △ Less

Submitted 9 July, 2023; v1 submitted 6 December, 2021; originally announced December 2021.

arXiv:2111.10912 [pdf, ps, other]

Johnson Coverage Hypothesis: Inapproximability of k-means and k-median in L_p metrics

Authors: Vincent Cohen-Addad, Karthik C. S., Euiwoong Lee

Abstract: K-median and k-means are the two most popular objectives for clustering algorithms. Despite intensive effort, a good understanding of the approximability of these objectives, particularly in $\ell_p$-metrics, remains a major open problem. In this paper, we significantly improve upon the hardness of approximation factors known in literature for these objectives in $\ell_p$-metrics. We introduce a… ▽ More K-median and k-means are the two most popular objectives for clustering algorithms. Despite intensive effort, a good understanding of the approximability of these objectives, particularly in $\ell_p$-metrics, remains a major open problem. In this paper, we significantly improve upon the hardness of approximation factors known in literature for these objectives in $\ell_p$-metrics. We introduce a new hypothesis called the Johnson Coverage Hypothesis (JCH), which roughly asserts that the well-studied max k-coverage problem on set systems is hard to approximate to a factor greater than 1-1/e, even when the membership graph of the set system is a subgraph of the Johnson graph. We then show that together with generalizations of the embedding techniques introduced by Cohen-Addad and Karthik (FOCS '19), JCH implies hardness of approximation results for k-median and k-means in $\ell_p$-metrics for factors which are close to the ones obtained for general metrics. In particular, assuming JCH we show that it is hard to approximate the k-means objective: $\bullet$ Discrete case: To a factor of 3.94 in the $\ell_1$-metric and to a factor of 1.73 in the $\ell_2$-metric; this improves upon the previous factor of 1.56 and 1.17 respectively, obtained under UGC. $\bullet$ Continuous case: To a factor of 2.10 in the $\ell_1$-metric and to a factor of 1.36 in the $\ell_2$-metric; this improves upon the previous factor of 1.07 in the $\ell_2$-metric obtained under UGC. We also obtain similar improvements under JCH for the k-median objective. Additionally, we prove a weak version of JCH using the work of Dinur et al. (SICOMP '05) on Hypergraph Vertex Cover, and recover all the results stated above of Cohen-Addad and Karthik (FOCS '19) to (nearly) the same inapproximability factors but now under the standard NP$\neq$P assumption (instead of UGC). △ Less

Submitted 21 November, 2021; originally announced November 2021.

Comments: Abstract in metadata shortened to meet arxiv requirements

arXiv:2111.06163 [pdf, other]

A 2-Approximation for the Bounded Treewidth Sparsest Cut Problem in FPT Time

Authors: Vincent Cohen-Addad, Tobias Mömke, Victor Verdugo

Abstract: In the non-uniform sparsest cut problem, we are given a supply graph G and a demand graph D, both with the same set of nodes V. The goal is to find a cut of V that minimizes the ratio of the total capacity on the edges of G crossing the cut over the total demand of the crossing edges of D. In this work, we study the non-uniform sparsest cut problem for supply graphs with bounded treewidth k. For t… ▽ More In the non-uniform sparsest cut problem, we are given a supply graph G and a demand graph D, both with the same set of nodes V. The goal is to find a cut of V that minimizes the ratio of the total capacity on the edges of G crossing the cut over the total demand of the crossing edges of D. In this work, we study the non-uniform sparsest cut problem for supply graphs with bounded treewidth k. For this case, Gupta, Talwar and Witmer [STOC 2013] obtained a 2-approximation with polynomial running time for fixed k, and the question of whether there exists a c-approximation algorithm for a constant c independent of k, that runs in FPT time, remained open. We answer this question in the affirmative. We design a 2-approximation algorithm for the non-uniform sparsest cut with bounded treewidth supply graphs that runs in FPT time, when parameterized by the treewidth. Our algorithm is based on rounding the optimal solution of a linear programming relaxation inspired by the Sherali-Adams hierarchy. In contrast to the classic Sherali-Adams approach, we construct a relaxation driven by a tree decomposition of the supply graph by including a carefully chosen set of lifting variables and constraints to encode information of subsets of nodes with super-constant size, and at the same time we have a sufficiently small linear program that can be solved in FPT time. △ Less

Submitted 11 November, 2021; originally announced November 2021.

Comments: 14 pages, 2 figures

arXiv:2111.04589 [pdf, other]

An Improved Local Search Algorithm for k-Median

Authors: Vincent Cohen-Addad, Anupam Gupta, Lunjia Hu, Hoon Oh, David Saulpic

Abstract: We present a new local-search algorithm for the $k$-median clustering problem. We show that local optima for this algorithm give a $(2.836+ε)$-approximation; our result improves upon the $(3+ε)$-approximate local-search algorithm of Arya et al. [STOC 01]. Moreover, a computer-aided analysis of a natural extension suggests that this approach may lead to an improvement over the best-known approximat… ▽ More We present a new local-search algorithm for the $k$-median clustering problem. We show that local optima for this algorithm give a $(2.836+ε)$-approximation; our result improves upon the $(3+ε)$-approximate local-search algorithm of Arya et al. [STOC 01]. Moreover, a computer-aided analysis of a natural extension suggests that this approach may lead to an improvement over the best-known approximation guarantee for the problem. The new ingredient in our algorithm is the use of a potential function based on both the closest and second-closest facilities to each client. Specifically, the potential is the sum over all clients, of the distance of the client to its closest facility, plus (a small constant times) the truncated distance to its second-closest facility. We move from one solution to another only if the latter can be obtained by swap** a constant number of facilities, and has a smaller potential than the former. This refined potential allows us to avoid the bad local optima given by Arya et al. for the local-search algorithm based only on the cost of the solution. △ Less

Submitted 8 November, 2021; originally announced November 2021.

Comments: To appear at SODA 22

ACM Class: F.2.2

arXiv:2110.02807 [pdf, other]

doi 10.1109/FOCS52979.2021.00054

Fitting Distances by Tree Metrics Minimizing the Total Error within a Constant Factor

Authors: Vincent Cohen-Addad, Debarati Das, Evangelos Kipouridis, Nikos Parotsidis, Mikkel Thorup

Abstract: We consider the numerical taxonomy problem of fitting a positive distance function ${D:{S\choose 2}\rightarrow \mathbb R_{>0}}$ by a tree metric. We want a tree $T$ with positive edge weights and including $S$ among the vertices so that their distances in $T$ match those in $D$. A nice application is in evolutionary biology where the tree $T$ aims to approximate the branching process leading to th… ▽ More We consider the numerical taxonomy problem of fitting a positive distance function ${D:{S\choose 2}\rightarrow \mathbb R_{>0}}$ by a tree metric. We want a tree $T$ with positive edge weights and including $S$ among the vertices so that their distances in $T$ match those in $D$. A nice application is in evolutionary biology where the tree $T$ aims to approximate the branching process leading to the observed distances in $D$ [Cavalli-Sforza and Edwards 1967]. We consider the total error, that is the sum of distance errors over all pairs of points. We present a deterministic polynomial time algorithm minimizing the total error within a constant factor. We can do this both for general trees, and for the special case of ultrametrics with a root having the same distance to all vertices in $S$. The problems are APX-hard, so a constant factor is the best we can hope for in polynomial time. The best previous approximation factor was $O((\log n)(\log \log n))$ by Ailon and Charikar [2005] who wrote "Determining whether an $O(1)$ approximation can be obtained is a fascinating question". △ Less

Submitted 11 March, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: 46 pages, Accepted to FOCS 2021 (Full version)

arXiv:2106.08448 [pdf, other]

Correlation Clustering in Constant Many Parallel Rounds

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski

Abstract: Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular,… ▽ More Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem on graphs using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental analysis of our techniques. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: ICML 2021 (long talk)

arXiv:2106.08195 [pdf, ps, other]

A Linear-Time $n^{0.4}$-Approximation for Longest Common Subsequence

Authors: Karl Bringmann, Vincent Cohen-Addad, Debarati Das

Abstract: We consider the classic problem of computing the Longest Common Subsequence (LCS) of two strings of length $n$. While a simple quadratic algorithm has been known for the problem for more than 40 years, no faster algorithm has been found despite an extensive effort. The lack of progress on the problem has recently been explained by Abboud, Backurs, and Vassilevska Williams [FOCS'15] and Bringmann a… ▽ More We consider the classic problem of computing the Longest Common Subsequence (LCS) of two strings of length $n$. While a simple quadratic algorithm has been known for the problem for more than 40 years, no faster algorithm has been found despite an extensive effort. The lack of progress on the problem has recently been explained by Abboud, Backurs, and Vassilevska Williams [FOCS'15] and Bringmann and Künnemann [FOCS'15] who proved that there is no subquadratic algorithm unless the Strong Exponential Time Hypothesis fails. This has led the community to look for subquadratic approximation algorithms for the problem. Yet, unlike the edit distance problem for which a constant-factor approximation in almost-linear time is known, very little progress has been made on LCS, making it a notoriously difficult problem also in the realm of approximation. For the general setting, only a naive $O(n^{\varepsilon/2})$-approximation algorithm with running time $\tilde{O}(n^{2-\varepsilon})$ has been known, for any constant $0 < \varepsilon \le 1$. Recently, a breakthrough result by Hajiaghayi, Seddighin, Seddighin, and Sun [SODA'19] provided a linear-time algorithm that yields a $O(n^{0.497956})$-approximation in expectation; improving upon the naive $O(\sqrt{n})$-approximation for the first time. In this paper, we provide an algorithm that in time $O(n^{2-\varepsilon})$ computes an $\tilde{O}(n^{2\varepsilon/5})$-approximation with high probability, for any $0 < \varepsilon \le 1$. Our result (1) gives an $\tilde{O}(n^{0.4})$-approximation in linear time, improving upon the bound of Hajiaghayi, Seddighin, Seddighin, and Sun, (2) provides an algorithm whose approximation scales with any subquadratic running time $O(n^{2-\varepsilon})$, improving upon the naive bound of $O(n^{\varepsilon/2})$ for any $\varepsilon$, and (3) instead of only in expectation, succeeds with high probability. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: full version of ICALP'21 paper, abstract shortened to fit Arxiv requirements

MSC Class: 68W32; 68W25 ACM Class: F.2.2

arXiv:2105.15187 [pdf, other]

A Quasipolynomial $(2+\varepsilon)$-Approximation for Planar Sparsest Cut

Authors: Vincent Cohen-Addad, Anupam Gupta, Philip N. Klein, Jason Li

Abstract: The (non-uniform) sparsest cut problem is the following graph-partitioning problem: given a "supply" graph, and demands on pairs of vertices, delete some subset of supply edges to minimize the ratio of the supply edges cut to the total demand of the pairs separated by this deletion. Despite much effort, there are only a handful of nontrivial classes of supply graphs for which constant-factor appro… ▽ More The (non-uniform) sparsest cut problem is the following graph-partitioning problem: given a "supply" graph, and demands on pairs of vertices, delete some subset of supply edges to minimize the ratio of the supply edges cut to the total demand of the pairs separated by this deletion. Despite much effort, there are only a handful of nontrivial classes of supply graphs for which constant-factor approximations are known. We consider the problem for planar graphs, and give a $(2+\varepsilon)$-approximation algorithm that runs in quasipolynomial time. Our approach defines a new structural decomposition of an optimal solution using a "patching" primitive. We combine this decomposition with a Sherali-Adams-style linear programming relaxation of the problem, which we then round. This should be compared with the polynomial-time approximation algorithm of Rao (1999), which uses the metric linear programming relaxation and $\ell_1$-embeddings, and achieves an $O(\sqrt{\log n})$-approximation in polynomial time. △ Less

Submitted 31 May, 2021; originally announced May 2021.

Comments: To appear at STOC 2021

arXiv:2104.06133 [pdf, other]

doi 10.1145/3406325.3451022

A New Coreset Framework for Clustering

Authors: Vincent Cohen-Addad, David Saulpic, Chris Schwiegelshohn

Abstract: Given a metric space, the $(k,z)$-clustering problem consists of finding $k$ centers such that the sum of the of distances raised to the power $z$ of every point to its closest center is minimized. This encapsulates the famous $k$-median ($z=1$) and $k$-means ($z=2$) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known a… ▽ More Given a metric space, the $(k,z)$-clustering problem consists of finding $k$ centers such that the sum of the of distances raised to the power $z$ of every point to its closest center is minimized. This encapsulates the famous $k$-median ($z=1$) and $k$-means ($z=2$) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as \emph{coresets}, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases. △ Less

Submitted 29 July, 2022; v1 submitted 13 April, 2021; originally announced April 2021.

Comments: Improved presentation. Adds a simpler suboptimal proof for interesting points, and an improved analysis for planar graphs. Corrects errors in the construction of centroid sets

arXiv:2012.11891 [pdf, ps, other]

Fast and Accurate $k$-means++ via Rejection Sampling

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson

Abstract: $k$-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, $k… ▽ More $k$-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, $k$-means++ sometimes suffers from being slow on large data-sets so a natural question has been to obtain more efficient algorithms with similar guarantees. In this paper, we present a near linear time algorithm for $k$-means++ seeding. Interestingly our algorithm obtains the same theoretical guarantees as $k$-means++ and significantly improves earlier results on fast $k$-means++ seeding. Moreover, we show empirically that our algorithm is significantly faster than $k$-means++ and obtains solutions of equivalent quality. △ Less

Submitted 22 December, 2020; originally announced December 2020.

arXiv:2010.00087 [pdf, ps, other]

On Approximability of Clustering Problems Without Candidate Centers

Authors: Vincent Cohen-Addad, Karthik C. S., Euiwoong Lee

Abstract: The k-means objective is arguably the most widely-used cost function for modeling clustering tasks in a metric space. In practice and historically, k-means is thought of in a continuous setting, namely where the centers can be located anywhere in the metric space. For example, the popular Lloyd's heuristic locates a center at the mean of each cluster. Despite persistent efforts on understanding… ▽ More The k-means objective is arguably the most widely-used cost function for modeling clustering tasks in a metric space. In practice and historically, k-means is thought of in a continuous setting, namely where the centers can be located anywhere in the metric space. For example, the popular Lloyd's heuristic locates a center at the mean of each cluster. Despite persistent efforts on understanding the approximability of k-means, and other classic clustering problems such as k-median and k-minsum, our knowledge of the hardness of approximation factors of these problems remains quite poor. In this paper, we significantly improve upon the hardness of approximation factors known in the literature for these objectives. We show that if the input lies in a general metric space, it is NP-hard to approximate: $\bullet$ Continuous k-median to a factor of $2-o(1)$; this improves upon the previous inapproximability factor of 1.36 shown by Guha and Khuller (J. Algorithms '99). $\bullet$ Continuous k-means to a factor of $4- o(1)$; this improves upon the previous inapproximability factor of 2.10 shown by Guha and Khuller (J. Algorithms '99). $\bullet$ k-minsum to a factor of $1.415$; this improves upon the APX-hardness shown by Guruswami and Indyk (SODA '03). Our results shed new and perhaps counter-intuitive light on the differences between clustering problems in the continuous setting versus the discrete setting (where the candidate centers are given as part of the input). △ Less

Submitted 2 October, 2020; v1 submitted 30 September, 2020; originally announced October 2020.

arXiv:2009.05039 [pdf, other]

On Light Spanners, Low-treewidth Embeddings and Efficient Traversing in Minor-free Graphs

Authors: Vincent Cohen-Addad, Arnold Filtser, Philip N. Klein, Hung Le

Abstract: Understanding the structure of minor-free metrics, namely shortest path metrics obtained over a weighted graph excluding a fixed minor, has been an important research direction since the fundamental work of Robertson and Seymour. A fundamental idea that helps both to understand the structural properties of these metrics and lead to strong algorithmic results is to construct a "small-complexity" gr… ▽ More Understanding the structure of minor-free metrics, namely shortest path metrics obtained over a weighted graph excluding a fixed minor, has been an important research direction since the fundamental work of Robertson and Seymour. A fundamental idea that helps both to understand the structural properties of these metrics and lead to strong algorithmic results is to construct a "small-complexity" graph that approximately preserves distances between pairs of points of the metric. We show the two following structural results for minor-free metrics: 1. Construction of a light subset spanner. Given a subset of vertices called terminals, and $ε$, in polynomial time we construct a subgraph that preserves all pairwise distances between terminals up to a multiplicative $1+ε$ factor, of total weight at most $O_ε(1)$ times the weight of the minimal Steiner tree spanning the terminals. 2. Construction of a stochastic metric embedding into low treewidth graphs with expected additive distortion $εD$. Namely, given a minor free graph $G=(V,E,w)$ of diameter $D$, and parameter $ε$, we construct a distribution $\mathcal{D}$ over dominating metric embeddings into treewidth-$O_ε(\log n)$ graphs such that the additive distortion is at most $εD$. One of our important technical contributions is a novel framework that allows us to reduce \emph{both problems} to problems on simpler graphs of bounded diameter. Our results have the following algorithmic consequences: (1) the first efficient approximation scheme for subset TSP in minor-free metrics; (2) the first approximation scheme for vehicle routing with bounded capacity in minor-free metrics; (3) the first efficient approximation scheme for vehicle routing with bounded capacity on bounded genus metrics. △ Less

Submitted 10 September, 2020; originally announced September 2020.

Comments: 65 pages, 6 figures. Abstract shorten due to limited characters

ACM Class: F.2.2

arXiv:2009.00188 [pdf, other]

On the computational tractability of a geographic clustering problem arising in redistricting

Authors: Vincent Cohen-Addad, Philip N. Klein, Dániel Marx

Abstract: Redistricting is the problem of dividing a state into a number $k$ of regions, called districts. Voters in each district elect a representative. The primary criteria are: each district is connected, district populations are equal (or nearly equal), and districts are "compact". There are multiple competing definitions of compactness, usually minimizing some quantity. One measure that has been rec… ▽ More Redistricting is the problem of dividing a state into a number $k$ of regions, called districts. Voters in each district elect a representative. The primary criteria are: each district is connected, district populations are equal (or nearly equal), and districts are "compact". There are multiple competing definitions of compactness, usually minimizing some quantity. One measure that has been recently promoted by Duchin and others is number of cut edges. In redistricting, one is given atomic regions out of which each district must be built. The populations of the atomic regions are given. Consider the graph with one vertex per atomic region (with weight equal to the region's population) and an edge between atomic regions that share a boundary. A districting plan is a partition of vertices into $k$ parts, each connnected, of nearly equal weight. The districts are considered compact to the extent that the plan minimizes the number of edges crossing between different parts. Consider two problems: find the most compact districting plan, and sample districting plans under a compactness constraint uniformly at random. Both problems are NP-hard so we restrict the input graph to have branchwidth at most $w$. (A planar graph's branchwidth is bounded by its diameter.) If both $k$ and $w$ are bounded by constants, the problems are solvable in polynomial time. Assume vertices have weight~1. One would like algorithms whose running times are of the form $O(f(k,w) n^c)$ for some constant $c$ independent of $k$ and $w$, in which case the problems are said to be fixed-parameter tractable with respect to $k$ and $w$). We show that, under a complexity-theoretic assumption, no such algorithms exist. However, we do give algorithms with running time $O(c^wn^{k+1})$. Thus if the diameter of the graph is moderately small and the number of districts is very small, our algorithm is useable. △ Less

Submitted 31 August, 2020; originally announced September 2020.

arXiv:2008.06700 [pdf, other]

On Efficient Low Distortion Ultrametric Embedding

Authors: Vincent Cohen-Addad, Karthik C. S., Guillaume Lagarde

Abstract: A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the… ▽ More A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of $n$ points in $Ω(\log n)$ dimensions exhibit a quite prohibitive running time of $Θ(n^2)$. In this paper, we provide a new algorithm which takes as input a set of points $P$ in $\mathbb{R}^d$, and for every $c\ge 1$, runs in time $n^{1+\fracρ{c^2}}$ (for some universal constant $ρ>1$) to output an ultrametric $Δ$ such that for any two points $u,v$ in $P$, we have $Δ(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the "best" ultrametric representation of $P$. Here, the best ultrametric is the ultrametric $\tildeΔ$ that minimizes the maximum distance distortion with respect to the $\ell_2$ distance, namely that minimizes $\underset{u,v \in P}{\max}\ \frac{\tildeΔ(u,v)}{\|u-v\|_2}$. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant $\varepsilon>0$, no algorithm with running time $n^{2-\varepsilon}$ can distinguish between inputs in $\ell_\infty$-metric that admit isometric embedding and those that incur a distortion of $\frac{3}{2}$. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time. △ Less

Submitted 15 August, 2020; originally announced August 2020.

arXiv:2007.02377 [pdf, other]

New Hardness Results for Planar Graph Problems in P and an Algorithm for Sparsest Cut

Authors: Amir Abboud, Vincent Cohen-Addad, Philip N. Klein

Abstract: The Sparsest Cut is a fundamental optimization problem that has been extensively studied. For planar inputs the problem is in $P$ and can be solved in $\tilde{O}(n^3)$ time if all vertex weights are $1$. Despite a significant amount of effort, the best algorithms date back to the early 90's and can only achieve $O(\log n)$-approximation in $\tilde{O}(n)$ time or a constant factor approximation in… ▽ More The Sparsest Cut is a fundamental optimization problem that has been extensively studied. For planar inputs the problem is in $P$ and can be solved in $\tilde{O}(n^3)$ time if all vertex weights are $1$. Despite a significant amount of effort, the best algorithms date back to the early 90's and can only achieve $O(\log n)$-approximation in $\tilde{O}(n)$ time or a constant factor approximation in $\tilde{O}(n^2)$ time [Rao, STOC92]. Our main result is an $Ω(n^{2-ε})$ lower bound for Sparsest Cut even in planar graphs with unit vertex weights, under the $(min,+)$-Convolution conjecture, showing that approximations are inevitable in the near-linear time regime. To complement the lower bound, we provide a constant factor approximation in near-linear time, improving upon the 25-year old result of Rao in both time and accuracy. Our lower bound accomplishes a repeatedly raised challenge by being the first fine-grained lower bound for a natural planar graph problem in P. Moreover, we prove near-quadratic lower bounds under SETH for variants of the closest pair problem in planar graphs, and use them to show that the popular Average-Linkage procedure for Hierarchical Clustering cannot be simulated in truly subquadratic time. We prove an $Ω(n/\log{n})$ lower bound on the number of communication rounds required to compute the weighted diameter of a network in the CONGEST model, even when the underlying graph is planar and all nodes are $D=4$ hops away from each other. This is the first poly($n$) + $ω(D)$ lower bound in the planar-distributed setting, and it complements the recent poly$(D, \log{n})$ upper bounds of Li and Parter [STOC 2019] for (exact) unweighted diameter and for ($1+ε$) approximate weighted diameter. △ Less

Submitted 5 July, 2020; originally announced July 2020.

arXiv:1909.06861 [pdf, other]

Online k-means Clustering

Authors: Vincent Cohen-Addad, Benjamin Guedj, Varun Kanade, Guy Rom

Abstract: We study the problem of online clustering where a clustering algorithm has to assign a new point that arrives to one of $k$ clusters. The specific formulation we use is the $k$-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred is the squared distance between the new point and the closest center. The goal is to minimize regret with r… ▽ More We study the problem of online clustering where a clustering algorithm has to assign a new point that arrives to one of $k$ clusters. The specific formulation we use is the $k$-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the $k$-means objective ($\mathcal{C}$) in hindsight. We show that provided the data lies in a bounded region, an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of $\tilde{O}(\sqrt{T})$ in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the $k$-means problem. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to $(1 + ε) \mathcal{C}$ and present a no-regret algorithm with runtime $O\left(T(\mathrm{poly}(log(T),k,d,1/ε)^{k(d+O(1))}\right)$. Our algorithm is based on maintaining an incremental coreset and an adaptive variant of the MWUA. We show that naïve online algorithms, such as \emph{Follow The Leader}, fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data. △ Less

Submitted 15 September, 2019; originally announced September 2019.

Comments: 11 pages, 1 figure

Journal ref: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR 130:1126-1134, 2021

Showing 1–50 of 74 results for author: Cohen-Addad, V