-
Hierarchical Agglomerative Graph Clustering in Poly-Logarithmic Depth
Authors:
Laxman Dhulipala,
David Eisenstat,
Jakub Łącki,
Vahab Mirronki,
Jessica Shi
Abstract:
Obtaining scalable algorithms for hierarchical agglomerative clustering (HAC) is of significant interest due to the massive size of real-world datasets. At the same time, efficiently parallelizing HAC is difficult due to the seemingly sequential nature of the algorithm. In this paper, we address this issue and present ParHAC, the first efficient parallel HAC algorithm with sublinear depth for the…
▽ More
Obtaining scalable algorithms for hierarchical agglomerative clustering (HAC) is of significant interest due to the massive size of real-world datasets. At the same time, efficiently parallelizing HAC is difficult due to the seemingly sequential nature of the algorithm. In this paper, we address this issue and present ParHAC, the first efficient parallel HAC algorithm with sublinear depth for the widely-used average-linkage function. In particular, we provide a $(1+ε)$-approximation algorithm for this problem on $m$ edge graphs using $\tilde{O}(m)$ work and poly-logarithmic depth. Moreover, we show that obtaining similar bounds for exact average-linkage HAC is not possible under standard complexity-theoretic assumptions.
We complement our theoretical results with a comprehensive study of the ParHAC algorithm in terms of its scalability, performance, and quality, and compare with several state-of-the-art sequential and parallel baselines. On a broad set of large publicly-available real-world datasets, we find that ParHAC obtains a 50.1x speedup on average over the best sequential baseline, while achieving quality similar to the exact HAC algorithm. We also show that ParHAC can cluster one of the largest publicly available graph datasets with 124 billion edges in a little over three hours using a commodity multicore machine.
△ Less
Submitted 23 June, 2022;
originally announced June 2022.
-
Scalable Community Detection via Parallel Correlation Clustering
Authors:
Jessica Shi,
Laxman Dhulipala,
David Eisenstat,
Jakub Łącki,
Vahab Mirrokni
Abstract:
Graph clustering and community detection are central problems in modern data mining. The increasing need for analyzing billion-scale data calls for faster and more scalable algorithms for these problems. There are certain trade-offs between the quality and speed of such clustering algorithms. In this paper, we design scalable algorithms that achieve high quality when evaluated based on ground trut…
▽ More
Graph clustering and community detection are central problems in modern data mining. The increasing need for analyzing billion-scale data calls for faster and more scalable algorithms for these problems. There are certain trade-offs between the quality and speed of such clustering algorithms. In this paper, we design scalable algorithms that achieve high quality when evaluated based on ground truth. We develop a generalized sequential and shared-memory parallel framework based on the LambdaCC objective (introduced by Veldt et al.), which encompasses modularity and correlation clustering. Our framework consists of highly-optimized implementations that scale to large data sets of billions of edges and that obtain high-quality clusters compared to ground-truth data, on both unweighted and weighted graphs. Our empirical evaluation shows that this framework improves the state-of-the-art trade-offs between speed and quality of scalable community detection. For example, on a 30-core machine with two-way hyper-threading, our implementations achieve orders of magnitude speedups over other correlation clustering baselines, and up to 28.44x speedups over our own sequential baselines while maintaining or improving quality.
△ Less
Submitted 27 July, 2021;
originally announced August 2021.
-
Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time
Authors:
Laxman Dhulipala,
David Eisenstat,
Jakub Łącki,
Vahab Mirrokni,
Jessica Shi
Abstract:
We study the widely used hierarchical agglomerative clustering (HAC) algorithm on edge-weighted graphs. We define an algorithmic framework for hierarchical agglomerative graph clustering that provides the first efficient $\tilde{O}(m)$ time exact algorithms for classic linkage measures, such as complete- and WPGMA-linkage, as well as other measures. Furthermore, for average-linkage, arguably the m…
▽ More
We study the widely used hierarchical agglomerative clustering (HAC) algorithm on edge-weighted graphs. We define an algorithmic framework for hierarchical agglomerative graph clustering that provides the first efficient $\tilde{O}(m)$ time exact algorithms for classic linkage measures, such as complete- and WPGMA-linkage, as well as other measures. Furthermore, for average-linkage, arguably the most popular variant of HAC, we provide an algorithm that runs in $\tilde{O}(n\sqrt{m})$ time. For this variant, this is the first exact algorithm that runs in subquadratic time, as long as $m=n^{2-ε}$ for some constant $ε> 0$. We complement this result with a simple $ε$-close approximation algorithm for average-linkage in our framework that runs in $\tilde{O}(m)$ time. As an application of our algorithms, we consider clustering points in a metric space by first using $k$-NN to generate a graph from the point set, and then running our algorithms on the resulting weighted graph. We validate the performance of our algorithms on publicly available datasets, and show that our approach can speed up clustering of point datasets by a factor of 20.7--76.5x.
△ Less
Submitted 10 June, 2021;
originally announced June 2021.
-
Design and Analysis of Bipartite Experiments under a Linear Exposure-Response Model
Authors:
Christopher Harshaw,
Fredrik Sävje,
David Eisenstat,
Vahab Mirrokni,
Jean Pouget-Abadie
Abstract:
A bipartite experiment consists of one set of units being assigned treatments and another set of units for which we measure outcomes. The two sets of units are connected by a bipartite graph, governing how the treated units can affect the outcome units. In this paper, we consider estimation of the average total treatment effect in the bipartite experimental framework under a linear exposure-respon…
▽ More
A bipartite experiment consists of one set of units being assigned treatments and another set of units for which we measure outcomes. The two sets of units are connected by a bipartite graph, governing how the treated units can affect the outcome units. In this paper, we consider estimation of the average total treatment effect in the bipartite experimental framework under a linear exposure-response model. We introduce the Exposure Reweighted Linear (ERL) estimator, and show that the estimator is unbiased, consistent and asymptotically normal, provided that the bipartite graph is sufficiently sparse. To facilitate inference, we introduce an unbiased and consistent estimator of the variance of the ERL point estimator. In addition, we introduce a cluster-based design, Exposure-Design, that uses heuristics to increase the precision of the ERL estimator by realizing a desirable exposure distribution.
△ Less
Submitted 1 December, 2021; v1 submitted 10 March, 2021;
originally announced March 2021.
-
Time-Space Trade-offs in Population Protocols
Authors:
Dan Alistarh,
James Aspnes,
David Eisenstat,
Rati Gelashvili,
Ronald L. Rivest
Abstract:
Population protocols are a popular model of distributed computing, in which randomly-interacting agents with little computational power cooperate to jointly perform computational tasks. Inspired by developments in molecular computation, and in particular DNA computing, recent algorithmic work has focused on the complexity of solving simple yet fundamental tasks in the population model, such as lea…
▽ More
Population protocols are a popular model of distributed computing, in which randomly-interacting agents with little computational power cooperate to jointly perform computational tasks. Inspired by developments in molecular computation, and in particular DNA computing, recent algorithmic work has focused on the complexity of solving simple yet fundamental tasks in the population model, such as leader election (which requires stabilization to a single agent in a special "leader" state), and majority (in which agents must stabilize to a decision as to which of two possible initial states had higher initial count). Known results point towards an inherent trade-off between the time complexity of such algorithms, and the space complexity, i.e. size of the memory available to each agent.
In this paper, we explore this trade-off and provide new upper and lower bounds for majority and leader election. First, we prove a unified lower bound, which relates the space available per node with the time complexity achievable by a protocol: for instance, our result implies that any protocol solving either of these tasks for $n$ agents using $O( \log \log n )$ states must take $Ω( n / \rm{polylog} n )$ expected time. This is the first result to characterize time complexity for protocols which employ super-constant number of states per node, and proves that fast, poly-logarithmic running times require protocols to have relatively large space costs.
On the positive side, we give algorithms showing that fast, poly-logarithmic stabilization time can be achieved using $O( \log^2 n )$ space per node, in the case of both tasks. Overall, our results highlight a time complexity separation between $O(\log \log n)$ and $Θ( \log^2 n )$ state space size for both majority and leader election in population protocols, and introduce new techniques, which should be applicable more broadly.
△ Less
Submitted 17 April, 2017; v1 submitted 25 February, 2016;
originally announced February 2016.
-
Facility Location in Evolving Metrics
Authors:
David Eisenstat,
Claire Mathieu,
Nicolas Schabanel
Abstract:
Understanding the dynamics of evolving social or infrastructure networks is a challenge in applied areas such as epidemiology, viral marketing, or urban planning. During the past decade, data has been collected on such networks but has yet to be fully analyzed. We propose to use information on the dynamics of the data to find stable partitions of the network into groups. For that purpose, we intro…
▽ More
Understanding the dynamics of evolving social or infrastructure networks is a challenge in applied areas such as epidemiology, viral marketing, or urban planning. During the past decade, data has been collected on such networks but has yet to be fully analyzed. We propose to use information on the dynamics of the data to find stable partitions of the network into groups. For that purpose, we introduce a time-dependent, dynamic version of the facility location problem, that includes a switching cost when a client's assignment changes from one facility to another. This might provide a better representation of an evolving network, emphasizing the abrupt change of relationships between subjects rather than the continuous evolution of the underlying network. We show that in realistic examples this model yields indeed better fitting solutions than optimizing every snapshot independently. We present an $O(\log nT)$-approximation algorithm and a matching hardness result, where $n$ is the number of clients and $T$ the number of time steps. We also give an other algorithms with approximation ratio $O(\log nT)$ for the variant where one pays at each time step (leasing) for each open facility.
△ Less
Submitted 26 March, 2014;
originally announced March 2014.
-
An efficient polynomial-time approximation scheme for Steiner forest in planar graphs
Authors:
David Eisenstat,
Philip Klein,
Claire Mathieu
Abstract:
We give an $O(n \log^3 n)$ approximation scheme for Steiner forest in planar graphs, improving on the previous approximation scheme for this problem, which runs in $O(n^{f(ε)})$ time.
We give an $O(n \log^3 n)$ approximation scheme for Steiner forest in planar graphs, improving on the previous approximation scheme for this problem, which runs in $O(n^{f(ε)})$ time.
△ Less
Submitted 25 October, 2011; v1 submitted 6 October, 2011;
originally announced October 2011.
-
Random road networks: the quadtree model
Authors:
David Eisenstat
Abstract:
What does a typical road network look like? Existing generative models tend to focus on one aspect to the exclusion of others. We introduce the general-purpose \emph{quadtree model} and analyze its shortest paths and maximum flow.
What does a typical road network look like? Existing generative models tend to focus on one aspect to the exclusion of others. We introduce the general-purpose \emph{quadtree model} and analyze its shortest paths and maximum flow.
△ Less
Submitted 27 January, 2011; v1 submitted 29 August, 2010;
originally announced August 2010.
-
Two-enqueuer queue in Common2
Authors:
David Eisenstat
Abstract:
The question of whether all shared objects with consensus number 2 belong to Common2, the set of objects that can be implemented in a wait-free manner by any type of consensus number 2, was first posed by Herlihy. In the absence of general results, several researchers have obtained implementations for restricted-concurrency versions of FIFO queues. We present the first Common2 algorithm for a qu…
▽ More
The question of whether all shared objects with consensus number 2 belong to Common2, the set of objects that can be implemented in a wait-free manner by any type of consensus number 2, was first posed by Herlihy. In the absence of general results, several researchers have obtained implementations for restricted-concurrency versions of FIFO queues. We present the first Common2 algorithm for a queue with two enqueuers and any number of dequeuers.
△ Less
Submitted 7 April, 2009; v1 submitted 4 May, 2008;
originally announced May 2008.
-
The computational power of population protocols
Authors:
Dana Angluin,
James Aspnes,
David Eisenstat,
Eric Ruppert
Abstract:
We consider the model of population protocols introduced by Angluin et al., in which anonymous finite-state agents stably compute a predicate of the multiset of their inputs via two-way interactions in the all-pairs family of communication networks. We prove that all predicates stably computable in this model (and certain generalizations of it) are semilinear, answering a central open question a…
▽ More
We consider the model of population protocols introduced by Angluin et al., in which anonymous finite-state agents stably compute a predicate of the multiset of their inputs via two-way interactions in the all-pairs family of communication networks. We prove that all predicates stably computable in this model (and certain generalizations of it) are semilinear, answering a central open question about the power of the model. Removing the assumption of two-way interaction, we also consider several variants of the model in which agents communicate by anonymous message-passing where the recipient of each message is chosen by an adversary and the sender is not identified to the recipient. These one-way models are distinguished by whether messages are delivered immediately or after a delay, whether a sender can record that it has sent a message, and whether a recipient can queue incoming messages, refusing to accept new messages until it has had a chance to send out messages of its own. We characterize the classes of predicates stably computable in each of these one-way models using natural subclasses of the semilinear predicates.
△ Less
Submitted 21 August, 2006;
originally announced August 2006.