Search | arXiv e-print repository

Improved Solutions for Multidimensional Approximate Agreement via Centroid Computation

Abstract: In this paper, we present distributed fault-tolerant algorithms that approximate the centroid of a set of $n$ data points in $\mathbb{R}^d$. Our work falls into the broader area of approximate multidimensional Byzantine agreement. The standard approach used in existing algorithms is to agree on a vector inside the convex hull of all correct vectors. This strategy dismisses many possibly correct da… ▽ More In this paper, we present distributed fault-tolerant algorithms that approximate the centroid of a set of $n$ data points in $\mathbb{R}^d$. Our work falls into the broader area of approximate multidimensional Byzantine agreement. The standard approach used in existing algorithms is to agree on a vector inside the convex hull of all correct vectors. This strategy dismisses many possibly correct data points. As a result, the algorithm does not necessarily agree on a representative value. To find better convergence strategies for the algorithms, we use the novel concept of defining an approximation of the centroid in the presence of Byzantine adversaries. We show that the standard agreement algorithms do not allow us to compute a better approximation than $2d$ of the centroid in the synchronous case. We investigate the trade-off between the quality of the approximation, the resilience of the algorithm, and the validity of the solution in order to design better approximation algorithms. For the synchronous case, we show that it is possible to achieve an optimal approximation of the centroid with up to $t<n/(d+1)$ Byzantine data points. This approach however does not give any guarantee on the validity of the solution. Therefore, we develop a second approach that reaches a $2\sqrt{d}$-approximation of the centroid, while satisfying the standard validity condition for agreement protocols. We are even able to restrict the validity condition to agreement inside the box of correct data points, while achieving optimal resilience of $t< n/3$. For the asynchronous case, we can adapt all three algorithms to reach the same approximation results (up to a constant factor). Our results suggest that it is reasonable to study the trade-off between validity conditions and the quality of the solution. △ Less

Submitted 13 September, 2023; v1 submitted 22 June, 2023; originally announced June 2023.

arXiv:2306.00432 [pdf, ps, other]

Time and Space Optimal Massively Parallel Algorithm for the 2-Ruling Set Problem

Authors: Mélanie Cambus, Fabian Kuhn, Shreyas Pai, Jara Uitto

Abstract: In this work, we present a constant-round algorithm for the $2$-ruling set problem in the Congested Clique model. As a direct consequence, we obtain a constant round algorithm in the MPC model with linear space-per-machine and optimal total space. Our results improve on the $O(\log \log \log n)$-round algorithm by [HPS, DISC'14] and the $O(\log \log Δ)$-round algorithm by [GGKMR, PODC'18]. Our tec… ▽ More In this work, we present a constant-round algorithm for the $2$-ruling set problem in the Congested Clique model. As a direct consequence, we obtain a constant round algorithm in the MPC model with linear space-per-machine and optimal total space. Our results improve on the $O(\log \log \log n)$-round algorithm by [HPS, DISC'14] and the $O(\log \log Δ)$-round algorithm by [GGKMR, PODC'18]. Our techniques can also be applied to the semi-streaming model to obtain an $O(1)$-pass algorithm. Our main technical contribution is a novel sampling procedure that returns a small subgraph such that almost all nodes in the input graph are adjacent to the sampled subgraph. An MIS on the sampled subgraph provides a $2$-ruling set for a large fraction of the input graph. As a technical challenge, we must handle the remaining part of the graph, which might still be relatively large. We overcome this challenge by showing useful structural properties of the remaining graph and show that running our process twice yields a $2$-ruling set of the original input graph with high probability. △ Less

Submitted 10 October, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

arXiv:2205.07593 [pdf, ps, other]

A $(3+\varepsilon)$-Approximate Correlation Clustering Algorithm in Dynamic Streams

Authors: Mélanie Cambus, Fabian Kuhn, Etna Lindy, Shreyas Pai, Jara Uitto

Abstract: Grou** together similar elements in datasets is a common task in data mining and machine learning. In this paper, we study streaming and parallel algorithms for correlation clustering, where each pair of elements is labeled either similar or dissimilar. The task is to partition the elements and the objective is to minimize disagreements, that is, the number of dissimilar elements grouped togethe… ▽ More Grou** together similar elements in datasets is a common task in data mining and machine learning. In this paper, we study streaming and parallel algorithms for correlation clustering, where each pair of elements is labeled either similar or dissimilar. The task is to partition the elements and the objective is to minimize disagreements, that is, the number of dissimilar elements grouped together and similar elements that get separated. Our main contribution is a semi-streaming algorithm that achieves a $(3 + \varepsilon)$-approximation to the minimum number of disagreements using a single pass over the stream. In addition, the algorithm also works for dynamic streams. Our approach builds on the analysis of the PIVOT algorithm by Ailon, Charikar, and Newman [JACM'08] that obtains a $3$-approximation in the centralized setting. Our design allows us to sparsify the input graph by ignoring a large portion of the nodes and edges without a large extra cost as compared to the analysis of PIVOT. This sparsification makes our technique applicable in several models of massive graph processing, such as semi-streaming and Massively Parallel Computing (MPC), where sparse graphs can typically be handled much more efficiently. Our work improves on the approximation ratio of the recent single-pass $5$-approximation algorithm and on the number of passes of the recent $O(1/\varepsilon)$-pass $(3 + \varepsilon)$-approximation algorithm [Behnezhad, Charikar, Ma, Tan FOCS'22, SODA'23]. Our algorithm is also more robust and can be applied in dynamic streams. Furthermore, it is the first single pass $(3 + \varepsilon)$-approximation algorithm that uses polynomial post-processing time. △ Less

Submitted 1 November, 2023; v1 submitted 16 May, 2022; originally announced May 2022.

Comments: to appear in SODA 2024

arXiv:2102.11660 [pdf, other]

Massively Parallel Correlation Clustering in Bounded Arboricity Graphs

Authors: Mélanie Cambus, Davin Choo, Havu Miikonen, Jara Uitto

Abstract: Identifying clusters of similar elements in a set is a common task in data analysis. With the immense growth of data and physical limitations on single processor speed, it is necessary to find efficient parallel algorithms for clustering tasks. In this paper, we study the problem of correlation clustering in bounded arboricity graphs with respect to the Massively Parallel Computation (MPC) model.… ▽ More Identifying clusters of similar elements in a set is a common task in data analysis. With the immense growth of data and physical limitations on single processor speed, it is necessary to find efficient parallel algorithms for clustering tasks. In this paper, we study the problem of correlation clustering in bounded arboricity graphs with respect to the Massively Parallel Computation (MPC) model. More specifically, we are given a complete graph where the edges are either positive or negative, indicating whether pairs of vertices are similar or dissimilar. The task is to partition the vertices into clusters with as few disagreements as possible. That is, we want to minimize the number of positive inter-cluster edges and negative intra-cluster edges. Consider an input graph $G$ on $n$ vertices such that the positive edges induce a $λ$-arboric graph. Our main result is a 3-approximation ($\textit{in expectation}$) algorithm to correlation clustering that runs in $\mathcal{O}(\log λ\cdot \textrm{poly}(\log \log n))$ MPC rounds in the $\textit{strongly sublinear memory regime}$. This is obtained by combining structural properties of correlation clustering on bounded arboricity graphs with the insights of Fischer and Noever (SODA '18) on randomized greedy MIS and the $\texttt{PIVOT}$ algorithm of Ailon, Charikar, and Newman (STOC '05). Combined with known graph matching algorithms, our structural property also implies an exact algorithm and algorithms with $\textit{worst case}$ $(1+ε)$-approximation guarantees in the special case of forests, where $λ=1$. △ Less

Submitted 6 August, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

Showing 1–4 of 4 results for author: Cambus, M