-
Stars: Tera-Scale Graph Building for Clustering and Graph Learning
Authors:
CJ Carey,
Jonathan Halcrow,
Rajesh Jayaram,
Vahab Mirrokni,
Warren Schudy,
Peilin Zhong
Abstract:
A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly,…
▽ More
A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly, constructing dense graphs is infeasible in practice for large datasets, and secondly, the runtime of downstream tasks is directly influenced by the sparsity of the similarity graph. In this work, we present $\textit{Stars}$: a highly scalable method for building extremely sparse graphs via two-hop spanners, which are graphs where similar points are connected by a path of length at most two. Stars can construct two-hop spanners with significantly fewer similarity comparisons, which are a major bottleneck for learning based models where comparisons are expensive to evaluate. Theoretically, we demonstrate that Stars builds a graph in nearly-linear time, where approximate nearest neighbors are contained within two-hop neighborhoods. In practice, we have deployed Stars for multiple data sets allowing for graph building at the $\textit{Tera-Scale}$, i.e., for graphs with tens of trillions of edges. We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss.
△ Less
Submitted 9 January, 2023; v1 submitted 5 December, 2022;
originally announced December 2022.
-
Practical Large-Scale Linear Programming using Primal-Dual Hybrid Gradient
Authors:
David Applegate,
Mateo Díaz,
Oliver Hinder,
Haihao Lu,
Miles Lubin,
Brendan O'Donoghue,
Warren Schudy
Abstract:
We present PDLP, a practical first-order method for linear programming (LP) that can solve to the high levels of accuracy that are expected in traditional LP applications. In addition, it can scale to very large problems because its core operation is matrix-vector multiplications. PDLP is derived by applying the primal-dual hybrid gradient (PDHG) method, popularized by Chambolle and Pock (2011), t…
▽ More
We present PDLP, a practical first-order method for linear programming (LP) that can solve to the high levels of accuracy that are expected in traditional LP applications. In addition, it can scale to very large problems because its core operation is matrix-vector multiplications. PDLP is derived by applying the primal-dual hybrid gradient (PDHG) method, popularized by Chambolle and Pock (2011), to a saddle-point formulation of LP. PDLP enhances PDHG for LP by combining several new techniques with older tricks from the literature; the enhancements include diagonal preconditioning, presolving, adaptive step sizes, and adaptive restarting. PDLP improves the state of the art for first-order methods applied to LP. We compare PDLP with SCS, an ADMM-based solver, on a set of 383 LP instances derived from MIPLIB 2017. With a target of $10^{-8}$ relative accuracy and 1 hour time limit, PDLP achieves a 6.3x reduction in the geometric mean of solve times and a 4.6x reduction in the number of instances unsolved (from 227 to 49). Furthermore, we highlight standard benchmark instances and a large-scale application (PageRank) where our open-source prototype of PDLP, written in Julia, outperforms a commercial LP solver.
△ Less
Submitted 7 January, 2022; v1 submitted 8 June, 2021;
originally announced June 2021.
-
Parallel Graph Algorithms in Constant Adaptive Rounds: Theory meets Practice
Authors:
Soheil Behnezhad,
Laxman Dhulipala,
Hossein Esfandiari,
Jakub Łącki,
Vahab Mirrokni,
Warren Schudy
Abstract:
We study fundamental graph problems such as graph connectivity, minimum spanning forest (MSF), and approximate maximum (weight) matching in a distributed setting. In particular, we focus on the Adaptive Massively Parallel Computation (AMPC) model, which is a theoretical model that captures MapReduce-like computation augmented with a distributed hash table.
We show the first AMPC algorithms for a…
▽ More
We study fundamental graph problems such as graph connectivity, minimum spanning forest (MSF), and approximate maximum (weight) matching in a distributed setting. In particular, we focus on the Adaptive Massively Parallel Computation (AMPC) model, which is a theoretical model that captures MapReduce-like computation augmented with a distributed hash table.
We show the first AMPC algorithms for all of the studied problems that run in a constant number of rounds and use only $O(n^ε)$ space per machine, where $0 < ε< 1$. Our results improve both upon the previous results in the AMPC model, as well as the best-known results in the MPC model, which is the theoretical model underpinning many popular distributed computation frameworks, such as MapReduce, Hadoop, Beam, Pregel and Giraph.
Finally, we provide an empirical comparison of the algorithms in the MPC and AMPC models in a fault-tolerant distriubted computation environment. We empirically evaluate our algorithms on a set of large real-world graphs and show that our AMPC algorithms can achieve improvements in both running time and round-complexity over optimized MPC baselines.
△ Less
Submitted 24 September, 2020;
originally announced September 2020.
-
Massively Parallel Computation via Remote Memory Access
Authors:
Soheil Behnezhad,
Laxman Dhulipala,
Hossein Esfandiari,
Jakub Łącki,
Warren Schudy,
Vahab Mirrokni
Abstract:
We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is an extension of the Massively Parallel Computation (MPC) model. At a high level, the AMPC model strengthens the MPC model by storing all messages sent within a round in a distributed data store. In the following round, all machines are provided with random read access to the data store, subject to the same constraints…
▽ More
We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is an extension of the Massively Parallel Computation (MPC) model. At a high level, the AMPC model strengthens the MPC model by storing all messages sent within a round in a distributed data store. In the following round, all machines are provided with random read access to the data store, subject to the same constraints on the total amount of communication as in the MPC model. Our model is inspired by the previous empirical studies of distributed graph algorithms using MapReduce and a distributed hash table service.
This extension allows us to give new graph algorithms with much lower round complexities compared to the best known solutions in the MPC model. In particular, in the AMPC model we show how to solve maximal independent set in $O(1)$ rounds and connectivity/minimum spanning tree in $O(\log\log_{m/n} n)$ rounds both using $O(n^δ)$ space per machine for constant $δ< 1$. In the same memory regime for MPC, the best known algorithms for these problems require polylog $n$ rounds. Our results imply that the 2-Cycle conjecture, which is widely believed to hold in the MPC model, does not hold in the AMPC model.
△ Less
Submitted 18 May, 2019;
originally announced May 2019.
-
Bernstein-like Concentration and Moment Inequalities for Polynomials of Independent Random Variables: Multilinear Case
Authors:
Warren Schudy,
Maxim Sviridenko
Abstract:
We show that the probability that a multilinear polynomial $f$ of independent random variables exceeds its mean by $λ$ is at most $e^{-λ^2 / (R^q Var(f))}$ for sufficiently small $λ$, where $R$ is an absolute constant. This matches (up to constants in the exponent) what one would expect from the central limit theorem. Our methods handle a variety of types of random variables including Gaussian, Bo…
▽ More
We show that the probability that a multilinear polynomial $f$ of independent random variables exceeds its mean by $λ$ is at most $e^{-λ^2 / (R^q Var(f))}$ for sufficiently small $λ$, where $R$ is an absolute constant. This matches (up to constants in the exponent) what one would expect from the central limit theorem. Our methods handle a variety of types of random variables including Gaussian, Boolean, exponential, and Poisson. Previous work by Kim-Vu and Schudy-Sviridenko gave bounds of the same form that involved less natural parameters in place of the variance.
△ Less
Submitted 8 June, 2012; v1 submitted 23 September, 2011;
originally announced September 2011.
-
Concentration and Moment Inequalities for Polynomials of Independent Random Variables
Authors:
Warren Schudy,
Maxim Sviridenko
Abstract:
In this work we design a general method for proving moment inequalities for polynomials of independent random variables. Our method works for a wide range of random variables including Gaussian, Boolean, exponential, Poisson and many others. We apply our method to derive general concentration inequalities for polynomials of independent random variables. We show that our method implies concentratio…
▽ More
In this work we design a general method for proving moment inequalities for polynomials of independent random variables. Our method works for a wide range of random variables including Gaussian, Boolean, exponential, Poisson and many others. We apply our method to derive general concentration inequalities for polynomials of independent random variables. We show that our method implies concentration inequalities for some previously open problems, e.g. permanent of a random symmetric matrices. We show that our concentration inequality is stronger than the well-known concentration inequality due to Kim and Vu. The main advantage of our method in comparison with the existing ones is a wide range of random variables we can handle and bounds for previously intractable regimes of high degree polynomials and small expectations. On the negative side we show that even for boolean random variables each term in our concentration inequality is tight.
△ Less
Submitted 8 June, 2012; v1 submitted 26 April, 2011;
originally announced April 2011.
-
Faster Algorithms for Feedback Arc Set Tournament, Kemeny Rank Aggregation and Betweenness Tournament
Authors:
Marek Karpinski,
Warren Schudy
Abstract:
We study fixed parameter algorithms for three problems: Kemeny rank aggregation, feedback arc set tournament, and betweenness tournament. For Kemeny rank aggregation we give an algorithm with runtime O*(2^O(sqrt{OPT})), where n is the number of candidates, OPT is the cost of the optimal ranking, and O* hides polynomial factors. This is a dramatic improvement on the previously best known runtime of…
▽ More
We study fixed parameter algorithms for three problems: Kemeny rank aggregation, feedback arc set tournament, and betweenness tournament. For Kemeny rank aggregation we give an algorithm with runtime O*(2^O(sqrt{OPT})), where n is the number of candidates, OPT is the cost of the optimal ranking, and O* hides polynomial factors. This is a dramatic improvement on the previously best known runtime of O*(2^O(OPT)). For feedback arc set tournament we give an algorithm with runtime O*(2^O(sqrt{OPT})), an improvement on the previously best known O*(OPT^O(sqrt{OPT})) (Alon, Lokshtanov and Saurabh 2009). For betweenness tournament we give an algorithm with runtime O*(2^O(sqrt{OPT/n})), where n is the number of vertices and OPT is the optimal cost. This improves on the previously known O*(OPT^O(OPT^{1/3}))$ (Saurabh 2009), especially when OPT is small. Unusually we can solve instances with OPT as large as n (log n)^2 in polynomial time!
△ Less
Submitted 22 June, 2010;
originally announced June 2010.
-
Online Correlation Clustering
Authors:
Claire Mathieu,
Ocan Sankur,
Warren Schudy
Abstract:
We study the online clustering problem where data items arrive in an online fashion. The algorithm maintains a clustering of data items into similarity classes. Upon arrival of v, the relation between v and previously arrived items is revealed, so that for each u we are told whether v is similar to u. The algorithm can create a new cluster for v and merge existing clusters.
When the objective…
▽ More
We study the online clustering problem where data items arrive in an online fashion. The algorithm maintains a clustering of data items into similarity classes. Upon arrival of v, the relation between v and previously arrived items is revealed, so that for each u we are told whether v is similar to u. The algorithm can create a new cluster for v and merge existing clusters.
When the objective is to minimize disagreements between the clustering and the input, we prove that a natural greedy algorithm is O(n)-competitive, and this is optimal.
When the objective is to maximize agreements between the clustering and the input, we prove that the greedy algorithm is .5-competitive; that no online algorithm can be better than .834-competitive; we prove that it is possible to get better than 1/2, by exhibiting a randomized algorithm with competitive ratio .5+c for a small positive fixed constant c.
△ Less
Submitted 3 February, 2010; v1 submitted 6 January, 2010;
originally announced January 2010.
-
Approximation Schemes for the Betweenness Problem in Tournaments and Related Ranking Problems
Authors:
Marek Karpinski,
Warren Schudy
Abstract:
We design the first polynomial time approximation schemes (PTASs) for the Minimum Betweenness problem in tournaments and some related higher arity ranking problems. This settles the approximation status of the Betweenness problem in tournaments along with other ranking problems which were open for some time now. The results depend on a new technique of dealing with fragile ranking constraints and…
▽ More
We design the first polynomial time approximation schemes (PTASs) for the Minimum Betweenness problem in tournaments and some related higher arity ranking problems. This settles the approximation status of the Betweenness problem in tournaments along with other ranking problems which were open for some time now. The results depend on a new technique of dealing with fragile ranking constraints and could be of independent interest.
△ Less
Submitted 8 July, 2010; v1 submitted 11 November, 2009;
originally announced November 2009.
-
Linear Time Approximation Schemes for the Gale-Berlekamp Game and Related Minimization Problems
Authors:
Marek Karpinski,
Warren Schudy
Abstract:
We design a linear time approximation scheme for the Gale-Berlekamp Switching Game and generalize it to a wider class of dense fragile minimization problems including the Nearest Codeword Problem (NCP) and Unique Games Problem. Further applications include, among other things, finding a constrained form of matrix rigidity and maximum likelihood decoding of an error correcting code. As another ap…
▽ More
We design a linear time approximation scheme for the Gale-Berlekamp Switching Game and generalize it to a wider class of dense fragile minimization problems including the Nearest Codeword Problem (NCP) and Unique Games Problem. Further applications include, among other things, finding a constrained form of matrix rigidity and maximum likelihood decoding of an error correcting code. As another application of our method we give the first linear time approximation schemes for correlation clustering with a fixed number of clusters and its hierarchical generalization. Our results depend on a new technique for dealing with small objective function values of optimization problems and could be of independent interest.
△ Less
Submitted 19 November, 2008;
originally announced November 2008.