Search | arXiv e-print repository

Alchemist: An Apache Spark <=> MPI Interface

Authors: Alex Gittens, Kai Rothauge, Shusen Wang, Michael W. Mahoney, Jey Kottalam, Lisa Gerhardt, Prabhat, Michael Ringenburg, Kristyn Maschhoff

Abstract: The Apache Spark framework for distributed computation is popular in the data analytics community due to its ease of use, but its MapReduce-style programming model can incur significant overheads when performing computations that do not map directly onto this model. One way to mitigate these costs is to off-load computations onto MPI codes. In recent work, we introduced Alchemist, a system for the… ▽ More The Apache Spark framework for distributed computation is popular in the data analytics community due to its ease of use, but its MapReduce-style programming model can incur significant overheads when performing computations that do not map directly onto this model. One way to mitigate these costs is to off-load computations onto MPI codes. In recent work, we introduced Alchemist, a system for the analysis of large-scale data sets. Alchemist calls MPI-based libraries from within Spark applications, and it has minimal coding, communication, and memory overheads. In particular, Alchemist allows users to retain the productivity benefits of working within the Spark software ecosystem without sacrificing performance efficiency in linear algebra, machine learning, and other related computations. In this paper, we discuss the motivation behind the development of Alchemist, and we provide a detailed overview its design and usage. We also demonstrate the efficiency of our approach on medium-to-large data sets, using some standard linear algebra operations, namely matrix multiplication and the truncated singular value decomposition of a dense matrix, and we compare the performance of Spark with that of Spark+Alchemist. These computations are run on the NERSC supercomputer Cori Phase 1, a Cray XC40. △ Less

Submitted 3 June, 2018; originally announced June 2018.

Comments: Accepted for publication in Concurrency and Computation: Practice and Experience, Special Issue on the Cray User Group 2018. arXiv admin note: text overlap with arXiv:1805.11800

arXiv:1805.11800 [pdf, other]

doi 10.1145/3219819.3219927

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Authors: Alex Gittens, Kai Rothauge, Shusen Wang, Michael W. Mahoney, Lisa Gerhardt, Prabhat, Jey Kottalam, Michael Ringenburg, Kristyn Maschhoff

Abstract: Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface… ▽ More Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI). To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation. We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case studies: a large-scale application of the conjugate gradient method to solve very large linear systems arising in a speech classification problem, where we see an improvement of an order of magnitude; and the truncated singular value decomposition (SVD) of a 400GB three-dimensional ocean temperature data set, where we see a speedup of up to 7.9x. We also illustrate that the truncated SVD computation is easily scalable to terabyte-sized data by applying it to data sets of sizes up to 17.6TB. △ Less

Submitted 30 May, 2018; originally announced May 2018.

Comments: Accepted for publication in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, 2018

arXiv:1803.08021 [pdf, other]

Error Estimation for Randomized Least-Squares Algorithms via the Bootstrap

Authors: Miles E. Lopes, Shusen Wang, Michael W. Mahoney

Abstract: Over the course of the past decade, a variety of randomized algorithms have been proposed for computing approximate least-squares (LS) solutions in large-scale settings. A longstanding practical issue is that, for any given input, the user rarely knows the actual error of an approximate solution (relative to the exact solution). Likewise, it is difficult for the user to know precisely how much com… ▽ More Over the course of the past decade, a variety of randomized algorithms have been proposed for computing approximate least-squares (LS) solutions in large-scale settings. A longstanding practical issue is that, for any given input, the user rarely knows the actual error of an approximate solution (relative to the exact solution). Likewise, it is difficult for the user to know precisely how much computation is needed to achieve the desired error tolerance. Consequently, the user often appeals to worst-case error bounds that tend to offer only qualitative guidance. As a more practical alternative, we propose a bootstrap method to compute a posteriori error estimates for randomized LS algorithms. These estimates permit the user to numerically assess the error of a given solution, and to predict how much work is needed to improve a "preliminary" solution. In addition, we provide theoretical consistency results for the method, which are the first such results in this context (to the best of our knowledge). From a practical standpoint, the method also has considerable flexibility, insofar as it can be applied to several popular sketching algorithms, as well as a variety of error metrics. Moreover, the extra step of error estimation does not add much cost to an underlying sketching algorithm. Finally, we demonstrate the effectiveness of the method with empirical results. △ Less

Submitted 6 September, 2018; v1 submitted 21 March, 2018; originally announced March 2018.

arXiv:1802.09113 [pdf, other]

GPU Accelerated Sub-Sampled Newton's Method

Authors: Sudhir B. Kylasa, Farbod Roosta-Khorasani, Michael W. Mahoney, Ananth Grama

Abstract: First order methods, which solely rely on gradient information, are commonly used in diverse machine learning (ML) and data analysis (DA) applications. This is attributed to the simplicity of their implementations, as well as low per-iteration computational/storage costs. However, they suffer from significant disadvantages; most notably, their performance degrades with increasing problem ill-condi… ▽ More First order methods, which solely rely on gradient information, are commonly used in diverse machine learning (ML) and data analysis (DA) applications. This is attributed to the simplicity of their implementations, as well as low per-iteration computational/storage costs. However, they suffer from significant disadvantages; most notably, their performance degrades with increasing problem ill-conditioning. Furthermore, they often involve a large number of hyper-parameters, and are notoriously sensitive to parameters such as the step-size. By incorporating additional information from the Hessian, second-order methods, have been shown to be resilient to many such adversarial effects. However, these advantages of using curvature information come at the cost of higher per-iteration costs, which in \enquote{big data} regimes, can be computationally prohibitive. In this paper, we show that, contrary to conventional belief, second-order methods, when implemented appropriately, can be more efficient than first-order alternatives in many large-scale ML/ DA applications. In particular, in convex settings, we consider variants of classical Newton\textsf{'}s method in which the Hessian and/or the gradient are randomly sub-sampled. We show that by effectively leveraging the power of GPUs, such randomized Newton-type algorithms can be significantly accelerated, and can easily outperform state of the art implementations of existing techniques in popular ML/ DA software packages such as TensorFlow. Additionally these randomized methods incur a small memory overhead compared to first-order methods. In particular, we show that for million-dimensional problems, our GPU accelerated sub-sampled Newton\textsf{'}s method achieves a higher test accuracy in milliseconds as compared with tens of seconds for first order alternatives. △ Less

Submitted 2 March, 2018; v1 submitted 25 February, 2018; originally announced February 2018.

arXiv:1802.08241 [pdf, other]

Hessian-based Analysis of Large Batch Training and Robustness to Adversaries

Authors: Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, Michael W. Mahoney

Abstract: Large batch size training of Neural Networks has been shown to incur accuracy loss when trained with the current methods. The exact underlying reasons for this are still not completely understood. Here, we study large batch size training through the lens of the Hessian operator and robust optimization. In particular, we perform a Hessian based study to analyze exactly how the landscape of the loss… ▽ More Large batch size training of Neural Networks has been shown to incur accuracy loss when trained with the current methods. The exact underlying reasons for this are still not completely understood. Here, we study large batch size training through the lens of the Hessian operator and robust optimization. In particular, we perform a Hessian based study to analyze exactly how the landscape of the loss function changes when training with large batch size. We compute the true Hessian spectrum, without approximation, by back-propagating the second derivative. Extensive experiments on multiple networks show that saddle-points are not the cause for generalization gap of large batch size training, and the results consistently show that large batch converges to points with noticeably higher Hessian spectrum. Furthermore, we show that robust training allows one to favor flat areas, as points with large Hessian spectrum show poor robustness to adversarial perturbation. We further study this relationship, and provide empirical and theoretical proof that the inner loop for robust training is a saddle-free optimization problem \textit{almost everywhere}. We present detailed experiments with five different network architectures, including a residual network, tested on MNIST, CIFAR-10, and CIFAR-100 datasets. We have open sourced our method which can be accessed at [1]. △ Less

Submitted 2 December, 2018; v1 submitted 22 February, 2018; originally announced February 2018.

Comments: Presented in NeurIPS'18 conference

Journal ref: NeurIPS 2018

arXiv:1802.06925 [pdf, other]

Inexact Non-Convex Newton-Type Methods

Authors: Zhewei Yao, Peng Xu, Farbod Roosta-Khorasani, Michael W. Mahoney

Abstract: For solving large-scale non-convex problems, we propose inexact variants of trust region and adaptive cubic regularization methods, which, to increase efficiency, incorporate various approximations. In particular, in addition to approximate sub-problem solves, both the Hessian and the gradient are suitably approximated. Using rather mild conditions on such approximations, we show that our proposed… ▽ More For solving large-scale non-convex problems, we propose inexact variants of trust region and adaptive cubic regularization methods, which, to increase efficiency, incorporate various approximations. In particular, in addition to approximate sub-problem solves, both the Hessian and the gradient are suitably approximated. Using rather mild conditions on such approximations, we show that our proposed inexact methods achieve similar optimal worst-case iteration complexities as the exact counterparts. Our proposed algorithms, and their respective theoretical analysis, do not require knowledge of any unknowable problem-related quantities, and hence are easily implementable in practice. In the context of finite-sum problems, we then explore randomized sub-sampling methods as ways to construct the gradient and Hessian approximations and examine the empirical performance of our algorithms on some real datasets. △ Less

Submitted 19 February, 2018; originally announced February 2018.

Comments: 36 pages, 2 figures

arXiv:1802.06307 [pdf, other]

Out-of-sample extension of graph adjacency spectral embedding

Authors: Keith Levin, Farbod Roosta-Khorasani, Michael W. Mahoney, Carey E. Priebe

Abstract: Many popular dimensionality reduction procedures have out-of-sample extensions, which allow a practitioner to apply a learned embedding to observations not seen in the initial training sample. In this work, we consider the problem of obtaining an out-of-sample extension for the adjacency spectral embedding, a procedure for embedding the vertices of a graph into Euclidean space. We present two diff… ▽ More Many popular dimensionality reduction procedures have out-of-sample extensions, which allow a practitioner to apply a learned embedding to observations not seen in the initial training sample. In this work, we consider the problem of obtaining an out-of-sample extension for the adjacency spectral embedding, a procedure for embedding the vertices of a graph into Euclidean space. We present two different approaches to this problem, one based on a least-squares objective and the other based on a maximum-likelihood formulation. We show that if the graph of interest is drawn according to a certain latent position model called a random dot product graph, then both of these out-of-sample extensions estimate the true latent position of the out-of-sample vertex with the same error rate. Further, we prove a central limit theorem for the least-squares-based extension, showing that the estimate is asymptotically normal about the truth in the large-graph limit. △ Less

Submitted 17 February, 2018; originally announced February 2018.

arXiv:1712.08880 [pdf, ps, other]

Lectures on Randomized Numerical Linear Algebra

Authors: Petros Drineas, Michael W. Mahoney

Abstract: This chapter is based on lectures on Randomized Numerical Linear Algebra from the 2016 Park City Mathematics Institute summer school on The Mathematics of Data. This chapter is based on lectures on Randomized Numerical Linear Algebra from the 2016 Park City Mathematics Institute summer school on The Mathematics of Data. △ Less

Submitted 24 December, 2017; originally announced December 2017.

Comments: To appear in the edited volume of lectures from the 2016 PCMI summer school

arXiv:1712.06047 [pdf, other]

Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization

Authors: Aditya Devarakonda, Kimon Fountoulakis, James Demmel, Michael W. Mahoney

Abstract: Parallel computing has played an important role in speeding up convex optimization methods for big data analytics and large-scale machine learning (ML). However, the scalability of these optimization methods is inhibited by the cost of communicating and synchronizing processors in a parallel setting. Iterative ML methods are particularly sensitive to communication cost since they often require com… ▽ More Parallel computing has played an important role in speeding up convex optimization methods for big data analytics and large-scale machine learning (ML). However, the scalability of these optimization methods is inhibited by the cost of communicating and synchronizing processors in a parallel setting. Iterative ML methods are particularly sensitive to communication cost since they often require communication every iteration. In this work, we extend well-known techniques from Communication-Avoiding Krylov subspace methods to first-order, block coordinate descent methods for Support Vector Machines and Proximal Least-Squares problems. Our Synchronization-Avoiding (SA) variants reduce the latency cost by a tunable factor of $s$ at the expense of a factor of $s$ increase in flops and bandwidth costs. We show that the SA-variants are numerically stable and can attain large speedups of up to $5.1\times$ on a Cray XC30 supercomputer. △ Less

Submitted 16 December, 2017; originally announced December 2017.

MSC Class: 68W10; 90C25 ACM Class: G.1.6

arXiv:1712.05855 [pdf, other]

A Berkeley View of Systems Challenges for AI

Authors: Ion Stoica, Dawn Song, Raluca Ada Popa, David Patterson, Michael W. Mahoney, Randy Katz, Anthony D. Joseph, Michael Jordan, Joseph M. Hellerstein, Joseph E. Gonzalez, Ken Goldberg, Ali Ghodsi, David Culler, Pieter Abbeel

Abstract: With the increasing commoditization of computer vision, speech recognition and machine translation systems and the widespread deployment of learning-based back-end technologies such as digital advertising and intelligent infrastructures, AI (Artificial Intelligence) has moved from research labs to production. These changes have been made possible by unprecedented levels of data and computation, by… ▽ More With the increasing commoditization of computer vision, speech recognition and machine translation systems and the widespread deployment of learning-based back-end technologies such as digital advertising and intelligent infrastructures, AI (Artificial Intelligence) has moved from research labs to production. These changes have been made possible by unprecedented levels of data and computation, by methodological advances in machine learning, by innovations in systems software and architectures, and by the broad accessibility of these technologies. The next generation of AI systems promises to accelerate these developments and increasingly impact our lives via frequent interactions and making (often mission-critical) decisions on our behalf, often in highly personalized contexts. Realizing this promise, however, raises daunting challenges. In particular, we need AI systems that make timely and safe decisions in unpredictable environments, that are robust against sophisticated adversaries, and that can process ever increasing amounts of data across organizations and individuals without compromising confidentiality. These challenges will be exacerbated by the end of the Moore's Law, which will constrain the amount of data these technologies can store and process. In this paper, we propose several open research directions in systems, architectures, and security that can address these challenges and help unlock AI's potential to improve lives and society. △ Less

Submitted 15 December, 2017; originally announced December 2017.

Comments: Berkeley Technical Report

Report number: EECS-2017-159

arXiv:1710.09553 [pdf, other]

Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior

Authors: Charles H. Martin, Michael W. Mahoney

Abstract: We describe an approach to understand the peculiar and counterintuitive generalization properties of deep neural networks. The approach involves going beyond worst-case theoretical capacity control frameworks that have been popular in machine learning in recent years to revisit old ideas in the statistical mechanics of neural networks. Within this approach, we present a prototypical Very Simple De… ▽ More We describe an approach to understand the peculiar and counterintuitive generalization properties of deep neural networks. The approach involves going beyond worst-case theoretical capacity control frameworks that have been popular in machine learning in recent years to revisit old ideas in the statistical mechanics of neural networks. Within this approach, we present a prototypical Very Simple Deep Learning (VSDL) model, whose behavior is controlled by two control parameters, one describing an effective amount of data, or load, on the network (that decreases when noise is added to the input), and one with an effective temperature interpretation (that increases when algorithms are early stopped). Using this model, we describe how a very simple application of ideas from the statistical mechanics theory of generalization provides a strong qualitative description of recently-observed empirical results regarding the inability of deep neural networks not to overfit training data, discontinuous learning and sharp transitions in the generalization properties of learning algorithms, etc. △ Less

Submitted 17 February, 2019; v1 submitted 26 October, 2017; originally announced October 2017.

Comments: 31 pages; added brief discussion of recent papers that use/extend these ideas

arXiv:1710.06520 [pdf, other]

LASAGNE: Locality And Structure Aware Graph Node Embedding

Authors: Evgeniy Faerman, Felix Borutta, Kimon Fountoulakis, Michael W. Mahoney

Abstract: In this work we propose Lasagne, a methodology to learn locality and structure aware graph node embeddings in an unsupervised way. In particular, we show that the performance of existing random-walk based approaches depends strongly on the structural properties of the graph, e.g., the size of the graph, whether the graph has a flat or upward-slo** Network Community Profile (NCP), whether the gra… ▽ More In this work we propose Lasagne, a methodology to learn locality and structure aware graph node embeddings in an unsupervised way. In particular, we show that the performance of existing random-walk based approaches depends strongly on the structural properties of the graph, e.g., the size of the graph, whether the graph has a flat or upward-slo** Network Community Profile (NCP), whether the graph is expander-like, whether the classes of interest are more k-core-like or more peripheral, etc. For larger graphs with flat NCPs that are strongly expander-like, existing methods lead to random walks that expand rapidly, touching many dissimilar nodes, thereby leading to lower-quality vector representations that are less useful for downstream tasks. Rather than relying on global random walks or neighbors within fixed hop distances, Lasagne exploits strongly local Approximate Personalized PageRank stationary distributions to more precisely engineer local information into node embeddings. This leads, in particular, to more meaningful and more useful vector representations of nodes in poorly-structured graphs. We show that Lasagne leads to significant improvement in downstream multi-label classification for larger graphs with flat NCPs, that it is comparable for smaller graphs with upward-slo** NCPs, and that is comparable to existing methods for link prediction tasks. △ Less

Submitted 17 October, 2017; originally announced October 2017.

arXiv:1709.03528 [pdf, other]

GIANT: Globally Improved Approximate Newton Method for Distributed Optimization

Authors: Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, Michael W. Mahoney

Abstract: For distributed computing environment, we consider the empirical risk minimization problem and propose a distributed and communication-efficient Newton-type optimization method. At every iteration, each worker locally finds an Approximate NewTon (ANT) direction, which is sent to the main driver. The main driver, then, averages all the ANT directions received from workers to form a {\it Globally Im… ▽ More For distributed computing environment, we consider the empirical risk minimization problem and propose a distributed and communication-efficient Newton-type optimization method. At every iteration, each worker locally finds an Approximate NewTon (ANT) direction, which is sent to the main driver. The main driver, then, averages all the ANT directions received from workers to form a {\it Globally Improved ANT} (GIANT) direction. GIANT is highly communication efficient and naturally exploits the trade-offs between local computations and global communications in that more local computations result in fewer overall rounds of communications. Theoretically, we show that GIANT enjoys an improved convergence rate as compared with first-order methods and existing distributed Newton-type methods. Further, and in sharp contrast with many existing distributed Newton-type methods, as well as popular first-order methods, a highly advantageous practical feature of GIANT is that it only involves one tuning parameter. We conduct large-scale experiments on a computer cluster and, empirically, demonstrate the superior performance of GIANT. △ Less

Submitted 11 September, 2018; v1 submitted 11 September, 2017; originally announced September 2017.

Comments: Fixed some typos. Improved writing

arXiv:1708.07827 [pdf, other]

Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study

Authors: Peng Xu, Farbod Roosta-Khorasani, Michael W. Mahoney

Abstract: While first-order optimization methods such as stochastic gradient descent (SGD) are popular in machine learning (ML), they come with well-known deficiencies, including relatively-slow convergence, sensitivity to the settings of hyper-parameters such as learning rate, stagnation at high training errors, and difficulty in esca** flat regions and saddle points. These issues are particularly acute… ▽ More While first-order optimization methods such as stochastic gradient descent (SGD) are popular in machine learning (ML), they come with well-known deficiencies, including relatively-slow convergence, sensitivity to the settings of hyper-parameters such as learning rate, stagnation at high training errors, and difficulty in esca** flat regions and saddle points. These issues are particularly acute in highly non-convex settings such as those arising in neural networks. Motivated by this, there has been recent interest in second-order methods that aim to alleviate these shortcomings by capturing curvature information. In this paper, we report detailed empirical evaluations of a class of Newton-type methods, namely sub-sampled variants of trust region (TR) and adaptive regularization with cubics (ARC) algorithms, for non-convex ML problems. In doing so, we demonstrate that these methods not only can be computationally competitive with hand-tuned SGD with momentum, obtaining comparable or better generalization performance, but also they are highly robust to hyper-parameter settings. Further, in contrast to SGD with momentum, we show that the manner in which these Newton-type methods employ curvature information allows them to seamlessly escape flat regions and saddle points. △ Less

Submitted 15 February, 2018; v1 submitted 24 August, 2017; originally announced August 2017.

Comments: 21 pages, 11 figures. Restructure the paper and add experiments

arXiv:1708.07164 [pdf, ps, other]

Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information

Authors: Peng Xu, Fred Roosta, Michael W. Mahoney

Abstract: We consider variants of trust-region and cubic regularization methods for non-convex optimization, in which the Hessian matrix is approximated. Under mild conditions on the inexact Hessian, and using approximate solution of the corresponding sub-problems, we provide iteration complexity to achieve $ ε$-approximate second-order optimality which have shown to be tight. Our Hessian approximation cond… ▽ More We consider variants of trust-region and cubic regularization methods for non-convex optimization, in which the Hessian matrix is approximated. Under mild conditions on the inexact Hessian, and using approximate solution of the corresponding sub-problems, we provide iteration complexity to achieve $ ε$-approximate second-order optimality which have shown to be tight. Our Hessian approximation conditions constitute a major relaxation over the existing ones in the literature. Consequently, we are able to show that such mild conditions allow for the construction of the approximate Hessian through various random sampling methods. In this light, we consider the canonical problem of finite-sum minimization, provide appropriate uniform and non-uniform sub-sampling strategies to construct such Hessian approximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and cubic regularization methods. △ Less

Submitted 14 May, 2019; v1 submitted 23 August, 2017; originally announced August 2017.

Comments: 32 pages

Journal ref: Mathematical Programming 2019

arXiv:1708.01945 [pdf, other]

A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

Authors: Miles E. Lopes, Shusen Wang, Michael W. Mahoney

Abstract: In recent years, randomized methods for numerical linear algebra have received growing interest as a general approach to large-scale problems. Typically, the essential ingredient of these methods is some form of randomized dimension reduction, which accelerates computations, but also creates random approximation error. In this way, the dimension reduction step encodes a tradeoff between cost and a… ▽ More In recent years, randomized methods for numerical linear algebra have received growing interest as a general approach to large-scale problems. Typically, the essential ingredient of these methods is some form of randomized dimension reduction, which accelerates computations, but also creates random approximation error. In this way, the dimension reduction step encodes a tradeoff between cost and accuracy. However, the exact numerical relationship between cost and accuracy is typically unknown, and consequently, it may be difficult for the user to precisely know (1) how accurate a given solution is, or (2) how much computation is needed to achieve a given level of accuracy. In the current paper, we study randomized matrix multiplication (sketching) as a prototype setting for addressing these general problems. As a solution, we develop a bootstrap method for \emph{directly estimating} the accuracy as a function of the reduced dimension (as opposed to deriving worst-case bounds on the accuracy in terms of the reduced dimension). From a computational standpoint, the proposed method does not substantially increase the cost of standard sketching methods, and this is made possible by an "extrapolation" technique. In addition, we provide both theoretical and empirical results to demonstrate the effectiveness of the proposed method. △ Less

Submitted 3 April, 2019; v1 submitted 6 August, 2017; originally announced August 2017.

Journal ref: Journal of Machine Learning Research, 20(39): 1-40, 2019

arXiv:1706.05826 [pdf, other]

Capacity Releasing Diffusion for Speed and Locality

Authors: Di Wang, Kimon Fountoulakis, Monika Henzinger, Michael W. Mahoney, Satish Rao

Abstract: Diffusions and related random walk procedures are of central importance in many areas of machine learning, data analysis, and applied mathematics. Because they spread mass agnostically at each step in an iterative manner, they can sometimes spread mass "too aggressively," thereby failing to find the "right" clusters. We introduce a novel Capacity Releasing Diffusion (CRD) Process, which is both fa… ▽ More Diffusions and related random walk procedures are of central importance in many areas of machine learning, data analysis, and applied mathematics. Because they spread mass agnostically at each step in an iterative manner, they can sometimes spread mass "too aggressively," thereby failing to find the "right" clusters. We introduce a novel Capacity Releasing Diffusion (CRD) Process, which is both faster and stays more local than the classical spectral diffusion process. As an application, we use our CRD Process to develop an improved local algorithm for graph clustering. Our local graph clustering method can find local clusters in a model of clustering where one begins the CRD Process in a cluster whose vertices are connected better internally than externally by an $O(\log^2 n)$ factor, where $n$ is the number of nodes in the cluster. Thus, our CRD Process is the first local graph clustering algorithm that is not subject to the well-known quadratic Cheeger barrier. Our result requires a certain smoothness condition, which we expect to be an artifact of our analysis. Our empirical evaluation demonstrates improved results, in particular for realistic social graphs where there are moderately good---but not very good---clusters. △ Less

Submitted 10 June, 2018; v1 submitted 19 June, 2017; originally announced June 2017.

Comments: Appeared in ICML 2017. Current version added reference and discussion of work on generalized Cheeger's inequalities

arXiv:1706.02803 [pdf, other]

Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds

Authors: Shusen Wang, Alex Gittens, Michael W. Mahoney

Abstract: Kernel $k$-means clustering can correctly identify and extract a far more varied collection of cluster structures than the linear $k$-means clustering algorithm. However, kernel $k$-means clustering is computationally expensive when the non-linear feature map is high-dimensional and there are many input points. Kernel approximation, e.g., the Nyström method, has been applied in previous works to a… ▽ More Kernel $k$-means clustering can correctly identify and extract a far more varied collection of cluster structures than the linear $k$-means clustering algorithm. However, kernel $k$-means clustering is computationally expensive when the non-linear feature map is high-dimensional and there are many input points. Kernel approximation, e.g., the Nyström method, has been applied in previous works to approximately solve kernel learning problems when both of the above conditions are present. This work analyzes the application of this paradigm to kernel $k$-means clustering, and shows that applying the linear $k$-means clustering algorithm to $\frac{k}ε (1 + o(1))$ features constructed using a so-called rank-restricted Nyström approximation results in cluster assignments that satisfy a $1 + ε$ approximation ratio in terms of the kernel $k$-means cost function, relative to the guarantee provided by the same algorithm without the use of the Nyström method. As part of the analysis, this work establishes a novel $1 + ε$ relative-error trace norm guarantee for low-rank approximation using the rank-restricted Nyström approximation. Empirical evaluations on the $8.1$ million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel $k$-means clustering with Nyström approximation. This work argues that spectral clustering using Nyström approximation---a popular and computationally efficient, but theoretically unsound approach to non-linear clustering---should be replaced with the efficient and theoretically sound combination of kernel $k$-means clustering with Nyström approximation. The superior performance of the latter approach is empirically verified. △ Less

Submitted 10 February, 2019; v1 submitted 8 June, 2017; originally announced June 2017.

Journal ref: Journal of Machine Learning Research 20 (2019) 1-49

arXiv:1705.07585 [pdf, other]

Union of Intersections (UoI) for Interpretable Data Driven Discovery and Prediction

Authors: Kristofer E. Bouchard, Alejandro F. Bujan, Farbod Roosta-Khorasani, Shashanka Ubaru, Prabhat, Antoine M. Snijders, Jian-Hua Mao, Edward F. Chang, Michael W. Mahoney, Sharmodeep Bhattacharyya

Abstract: The increasing size and complexity of scientific data could dramatically enhance discovery and prediction for basic scientific applications. Realizing this potential, however, requires novel statistical analysis methods that are both interpretable and predictive. We introduce Union of Intersections (UoI), a flexible, modular, and scalable framework for enhanced model selection and estimation. Meth… ▽ More The increasing size and complexity of scientific data could dramatically enhance discovery and prediction for basic scientific applications. Realizing this potential, however, requires novel statistical analysis methods that are both interpretable and predictive. We introduce Union of Intersections (UoI), a flexible, modular, and scalable framework for enhanced model selection and estimation. Methods based on UoI perform model selection and model estimation through intersection and union operations, respectively. We show that UoI-based methods achieve low-variance and nearly unbiased estimation of a small number of interpretable features, while maintaining high-quality prediction accuracy. We perform extensive numerical investigation to evaluate a UoI algorithm ($UoI_{Lasso}$) on synthetic and real data. In doing so, we demonstrate the extraction of interpretable functional networks from human electrophysiology recordings as well as accurate prediction of phenotypes from genotype-phenotype data with reduced features. We also show (with the $UoI_{L1Logistic}$ and $UoI_{CUR}$ variants of the basic framework) improved prediction parsimony for classification and matrix factorization on several benchmark biomedical data sets. These results suggest that methods based on the UoI framework could improve interpretation and prediction in data-driven discovery across scientific fields. △ Less

Submitted 2 November, 2017; v1 submitted 22 May, 2017; originally announced May 2017.

Comments: 42 pages; a conference version is in NIPS 2017

arXiv:1704.04519 [pdf, ps, other]

doi 10.2140/involve.2019.12.941

Orbit Spaces of Linear Circle Actions

Authors: Suzanne Craig, Naiche Downey, Lucas Goad, Michael J. Mahoney, Jordan Watts

Abstract: In this paper, it is shown that non-isomorphic effective linear circle actions yield non-diffeomorphic differential structures on the corresponding orbit spaces. In this paper, it is shown that non-isomorphic effective linear circle actions yield non-diffeomorphic differential structures on the corresponding orbit spaces. △ Less

Submitted 1 April, 2019; v1 submitted 14 April, 2017; originally announced April 2017.

Comments: 17 pages. This paper is the result of a Summer 2016 REU (undergraduate research) project at the University of Colorado Boulder. [V2 update: the earlier version contained material not dealt with during the REU, which will be published separately.] To appear in Involve

Journal ref: Involve 12 (2019) 941-959

arXiv:1703.07520 [pdf, other]

Social Discrete Choice Models

Authors: Danqing Zhang, Kimon Fountoulakis, Junyu Cao, Michael Mahoney, Alexei Pozdnoukhov

Abstract: Human decision making underlies data generating process in multiple application areas, and models explaining and predicting choices made by individuals are in high demand. Discrete choice models are widely studied in economics and computational social sciences. As digital social networking facilitates information flow and spread of influence between individuals, new advances in modeling are needed… ▽ More Human decision making underlies data generating process in multiple application areas, and models explaining and predicting choices made by individuals are in high demand. Discrete choice models are widely studied in economics and computational social sciences. As digital social networking facilitates information flow and spread of influence between individuals, new advances in modeling are needed to incorporate social information into these models in addition to characteristic features affecting individual choices. In this paper, we propose two novel models with scalable training algorithms: local logistics graph regularization (LLGR) and latent class graph regularization (LCGR) models. We add social regularization to represent similarity between friends, and we introduce latent classes to account for possible preference discrepancies between different social groups. Training of the LLGR model is performed using alternating direction method of multipliers (ADMM), and training of the LCGR model is performed using a specialized Monte Carlo expectation maximization (MCEM) algorithm. Scalability to large graphs is achieved by parallelizing computation in both the expectation and the maximization steps. The LCGR model is the first latent class classification model that incorporates social relationships among individuals represented by a given graph. To evaluate our two models, we consider three classes of data to illustrate a typical large-scale use case in internet and social media applications. We experiment on synthetic datasets to empirically explain when the proposed model is better than vanilla classification models that do not exploit graph structure. We also experiment on real-world data, including both small scale and large scale real-world datasets, to demonstrate on which types of datasets our model can be expected to outperform state-of-the-art models. △ Less

Submitted 2 November, 2017; v1 submitted 22 March, 2017; originally announced March 2017.

arXiv:1702.04837 [pdf, other]

Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging

Authors: Shusen Wang, Alex Gittens, Michael W. Mahoney

Abstract: We address the statistical and optimization impacts of the classical sketch and Hessian sketch used to approximately solve the Matrix Ridge Regression (MRR) problem. Prior research has quantified the effects of classical sketch on the strictly simpler least squares regression (LSR) problem. We establish that classical sketch has a similar effect upon the optimization properties of MRR as it does o… ▽ More We address the statistical and optimization impacts of the classical sketch and Hessian sketch used to approximately solve the Matrix Ridge Regression (MRR) problem. Prior research has quantified the effects of classical sketch on the strictly simpler least squares regression (LSR) problem. We establish that classical sketch has a similar effect upon the optimization properties of MRR as it does on those of LSR: namely, it recovers nearly optimal solutions. By contrast, Hessian sketch does not have this guarantee, instead, the approximation error is governed by a subtle interplay between the "mass" in the responses and the optimal objective value. For both types of approximation, the regularization in the sketched MRR problem results in significantly different statistical properties from those of the sketched LSR problem. In particular, there is a bias-variance trade-off in sketched MRR that is not present in sketched LSR. We provide upper and lower bounds on the bias and variance of sketched MRR, these bounds show that classical sketch significantly increases the variance, while Hessian sketch significantly increases the bias. Empirically, sketched MRR solutions can have risks that are higher by an order-of-magnitude than those of the optimal MRR solutions. We establish theoretically and empirically that model averaging greatly decreases the gap between the risks of the true and sketched solutions to the MRR problem. Thus, in parallel or distributed settings, sketching combined with model averaging is a powerful technique that quickly obtains near-optimal solutions to the MRR problem while greatly mitigating the increased statistical risk incurred by sketching. △ Less

Submitted 5 May, 2018; v1 submitted 15 February, 2017; originally announced February 2017.

Comments: To appear in Journal of Machine Learning Research, 2018. A short version has appeared in International Conference on Machine Learning (ICML), 2017

Journal ref: Journal of Machine Learning Research, 19, pp1-50, 2018

arXiv:1612.04003 [pdf, other]

Avoiding communication in primal and dual block coordinate descent methods

Authors: Aditya Devarakonda, Kimon Fountoulakis, James Demmel, Michael W. Mahoney

Abstract: Primal and dual block coordinate descent methods are iterative methods for solving regularized and unregularized optimization problems. Distributed-memory parallel implementations of these methods have become popular in analyzing large machine learning datasets. However, existing implementations communicate at every iteration which, on modern data center and supercomputing architectures, often dom… ▽ More Primal and dual block coordinate descent methods are iterative methods for solving regularized and unregularized optimization problems. Distributed-memory parallel implementations of these methods have become popular in analyzing large machine learning datasets. However, existing implementations communicate at every iteration which, on modern data center and supercomputing architectures, often dominates the cost of floating-point computation. Recent results on communication-avoiding Krylov subspace methods suggest that large speedups are possible by re-organizing iterative algorithms to avoid communication. We show how applying similar algorithmic transformations can lead to primal and dual block coordinate descent methods that only communicate every $s$ iterations--where $s$ is a tuning parameter--instead of every iteration for the \textit{regularized least-squares problem}. We show that the communication-avoiding variants reduce the number of synchronizations by a factor of $s$ on distributed-memory parallel machines without altering the convergence rate and attains strong scaling speedups of up to $6.1\times$ on a Cray XC30 supercomputer. △ Less

Submitted 1 May, 2017; v1 submitted 12 December, 2016; originally announced December 2016.

MSC Class: 68W10; 65F10 ACM Class: G.1.0; G.1.3; G.1.6

arXiv:1609.03932 [pdf, other]

doi 10.3847/0004-637X/833/1/26

Map** the Similarities of Spectra: Global and Locally-biased Approaches to SDSS Galaxy Data

Authors: David Lawlor, Tamás Budavári, Michael W. Mahoney

Abstract: We apply a novel spectral graph technique, that of locally-biased semi-supervised eigenvectors, to study the diversity of galaxies. This technique permits us to characterize empirically the natural variations in observed spectra data, and we illustrate how this approach can be used in an exploratory manner to highlight both large-scale global as well as small-scale local structure in Sloan Digital… ▽ More We apply a novel spectral graph technique, that of locally-biased semi-supervised eigenvectors, to study the diversity of galaxies. This technique permits us to characterize empirically the natural variations in observed spectra data, and we illustrate how this approach can be used in an exploratory manner to highlight both large-scale global as well as small-scale local structure in Sloan Digital Sky Survey (SDSS) data. We use this method in a way that simultaneously takes into account the measurements of spectral lines as well as the continuum shape. Unlike Principal Component Analysis, this method does not assume that the Euclidean distance between galaxy spectra is a good global measure of similarity between all spectra, but instead it only assumes that local difference information between similar spectra is reliable. Moreover, unlike other nonlinear dimensionality methods, this method can be used to characterize very finely both small-scale local as well as large-scale global properties of realistic noisy data. The power of the method is demonstrated on the SDSS Main Galaxy Sample by illustrating that the derived embeddings of spectra carry an unprecedented amount of information. By using a straightforward global or unsupervised variant, we observe that the main features correlate strongly with star formation rate and that they clearly separate active galactic nuclei. Computed parameters of the method can be used to describe line strengths and their interdependencies. By using a locally-biased or semi-supervised variant, we are able to focus on typical variations around specific objects of astronomical interest. We present several examples illustrating that this approach can enable new discoveries in the data as well as a detailed understanding of very fine local structure that would otherwise be overwhelmed by large-scale noise and global trends in the data. △ Less

Submitted 13 September, 2016; originally announced September 2016.

Comments: 34 pages. A modified version of this paper has been accepted to The Astrophysical Journal

arXiv:1608.04845 [pdf, ps, other]

Lecture Notes on Spectral Graph Methods

Authors: Michael W. Mahoney

Abstract: These are lecture notes that are based on the lectures from a class I taught on the topic of Spectral Graph Methods at UC Berkeley during the Spring 2015 semester. These are lecture notes that are based on the lectures from a class I taught on the topic of Spectral Graph Methods at UC Berkeley during the Spring 2015 semester. △ Less

Submitted 16 August, 2016; originally announced August 2016.

Comments: 257 pages

arXiv:1608.04481 [pdf, ps, other]

Lecture Notes on Randomized Linear Algebra

Authors: Michael W. Mahoney

Abstract: These are lecture notes that are based on the lectures from a class I taught on the topic of Randomized Linear Algebra (RLA) at UC Berkeley during the Fall 2013 semester. These are lecture notes that are based on the lectures from a class I taught on the topic of Randomized Linear Algebra (RLA) at UC Berkeley during the Fall 2013 semester. △ Less

Submitted 16 August, 2016; originally announced August 2016.

Comments: 188 pages

arXiv:1607.04940 [pdf, other]

An optimization approach to locally-biased graph algorithms

Authors: Kimon Fountoulakis, David Gleich, Michael Mahoney

Abstract: Locally-biased graph algorithms are algorithms that attempt to find local or small-scale structure in a large data graph. In some cases, this can be accomplished by adding some sort of locality constraint and calling a traditional graph algorithm; but more interesting are locally-biased graph algorithms that compute answers by running a procedure that does not even look at most of the input graph.… ▽ More Locally-biased graph algorithms are algorithms that attempt to find local or small-scale structure in a large data graph. In some cases, this can be accomplished by adding some sort of locality constraint and calling a traditional graph algorithm; but more interesting are locally-biased graph algorithms that compute answers by running a procedure that does not even look at most of the input graph. This corresponds more closely to what practitioners from various data science domains do, but it does not correspond well with the way that algorithmic and statistical theory is typically formulated. Recent work from several research communities has focused on develo** locally-biased graph algorithms that come with strong complementary algorithmic and statistical theory and that are useful in practice in downstream data science applications. We provide a review and overview of this work, highlighting commonalities between seemingly-different approaches, and highlighting promising directions for future work. △ Less

Submitted 4 December, 2016; v1 submitted 17 July, 2016; originally announced July 2016.

Comments: 19 pages, 13 figures

arXiv:1607.04378 [pdf, other]

DCAR: A Discriminative and Compact Audio Representation to Improve Event Detection

Authors: Li** **g, Bo Liu, Jaeyoung Choi, Adam Janin, Julia Bernd, Michael W. Mahoney, Gerald Friedland

Abstract: This paper presents a novel two-phase method for audio representation, Discriminative and Compact Audio Representation (DCAR), and evaluates its performance at detecting events in consumer-produced videos. In the first phase of DCAR, each audio track is modeled using a Gaussian mixture model (GMM) that includes several components to capture the variability within that track. The second phase takes… ▽ More This paper presents a novel two-phase method for audio representation, Discriminative and Compact Audio Representation (DCAR), and evaluates its performance at detecting events in consumer-produced videos. In the first phase of DCAR, each audio track is modeled using a Gaussian mixture model (GMM) that includes several components to capture the variability within that track. The second phase takes into account both global structure and local structure. In this phase, the components are rendered more discriminative and compact by formulating an optimization problem on Grassmannian manifolds, which we found represents the structure of audio effectively. Our experiments used the YLI-MED dataset (an open TRECVID-style video corpus based on YFCC100M), which includes ten events. The results show that the proposed DCAR representation consistently outperforms state-of-the-art audio representations. DCAR's advantage over i-vector, mv-vector, and GMM representations is significant for both easier and harder discrimination tasks. We discuss how these performance differences across easy and hard cases follow from how each type of model leverages (or doesn't leverage) the intrinsic structure of the data. Furthermore, DCAR shows a particularly notable accuracy advantage on events where humans have more difficulty classifying the videos, i.e., events with lower mean annotator confidence. △ Less

Submitted 15 July, 2016; originally announced July 2016.

Comments: An abbreviated version of this paper will be published in ACM Multimedia 2016

ACM Class: H.5.1

arXiv:1607.01335 [pdf, other]

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Authors: Alex Gittens, Aditya Devarakonda, Evan Racah, Michael Ringenburg, Lisa Gerhardt, Jey Kottalam, Jialin Liu, Kristyn Maschhoff, Shane Canon, Jatin Chhugani, Pramod Sharma, Jiyan Yang, James Demmel, Jim Harrell, Venkat Krishnamurthy, Michael W. Mahoney, Prabhat

Abstract: We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity… ▽ More We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance. △ Less

Submitted 20 September, 2016; v1 submitted 5 July, 2016; originally announced July 2016.

ACM Class: G.1.3; C.2.4

arXiv:1607.00559 [pdf, ps, other]

Sub-sampled Newton Methods with Non-uniform Sampling

Authors: Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher Ré, Michael W. Mahoney

Abstract: We consider the problem of finding the minimizer of a convex function $F: \mathbb R^d \rightarrow \mathbb R$ of the form $F(w) := \sum_{i=1}^n f_i(w) + R(w)$ where a low-rank factorization of $\nabla^2 f_i(w)$ is readily available. We consider the regime where $n \gg d$. As second-order methods prove to be effective in finding the minimizer to a high-precision, in this work, we propose randomized… ▽ More We consider the problem of finding the minimizer of a convex function $F: \mathbb R^d \rightarrow \mathbb R$ of the form $F(w) := \sum_{i=1}^n f_i(w) + R(w)$ where a low-rank factorization of $\nabla^2 f_i(w)$ is readily available. We consider the regime where $n \gg d$. As second-order methods prove to be effective in finding the minimizer to a high-precision, in this work, we propose randomized Newton-type algorithms that exploit \textit{non-uniform} sub-sampling of $\{\nabla^2 f_i(w)\}_{i=1}^{n}$, as well as inexact updates, as means to reduce the computational complexity. Two non-uniform sampling distributions based on {\it block norm squares} and {\it block partial leverage scores} are considered in order to capture important terms among $\{\nabla^2 f_i(w)\}_{i=1}^{n}$. We show that at each iteration non-uniformly sampling at most $\mathcal O(d \log d)$ terms from $\{\nabla^2 f_i(w)\}_{i=1}^{n}$ is sufficient to achieve a linear-quadratic convergence rate in $w$ when a suitable initial point is provided. In addition, we show that our algorithms achieve a lower computational complexity and exhibit more robustness and better dependence on problem specific quantities, such as the condition number, compared to similar existing methods, especially the ones based on uniform sampling. Finally, we empirically demonstrate that our methods are at least twice as fast as Newton's methods with ridge logistic regression on several real datasets. △ Less

Submitted 5 July, 2016; v1 submitted 2 July, 2016; originally announced July 2016.

Comments: minor fix on v1

arXiv:1605.08490 [pdf, other]

A Simple and Strongly-Local Flow-Based Method for Cut Improvement

Authors: Nate Veldt, David F. Gleich, Michael W. Mahoney

Abstract: Many graph-based learning problems can be cast as finding a good set of vertices nearby a seed set, and a powerful methodology for these problems is based on maximum flows. We introduce and analyze a new method for locally-biased graph-based learning called SimpleLocal, which finds good conductance cuts near a set of seed vertices. An important feature of our algorithm is that it is strongly-local… ▽ More Many graph-based learning problems can be cast as finding a good set of vertices nearby a seed set, and a powerful methodology for these problems is based on maximum flows. We introduce and analyze a new method for locally-biased graph-based learning called SimpleLocal, which finds good conductance cuts near a set of seed vertices. An important feature of our algorithm is that it is strongly-local, meaning it does not need to explore the entire graph to find cuts that are locally optimal. This method solves the same objective as existing strongly-local flow-based methods, but it enables a simple implementation. We also show how it achieves localization through an implicit L1-norm penalty term. As a flow-based method, our algorithm exhibits several ad- vantages in terms of cut optimality and accurate identification of target regions in a graph. We demonstrate the power of SimpleLocal by solving problems on a 467 million edge graph based on an MRI scan. △ Less

Submitted 26 May, 2016; originally announced May 2016.

arXiv:1605.08108 [pdf, other]

FLAG n' FLARE: Fast Linearly-Coupled Adaptive Gradient Methods

Authors: Xiang Cheng, Farbod Roosta-Khorasani, Stefan Palombo, Peter L. Bartlett, Michael W. Mahoney

Abstract: We consider first order gradient methods for effectively optimizing a composite objective in the form of a sum of smooth and, potentially, non-smooth functions. We present accelerated and adaptive gradient methods, called FLAG and FLARE, which can offer the best of both worlds. They can achieve the optimal convergence rate by attaining the optimal first-order oracle complexity for smooth convex op… ▽ More We consider first order gradient methods for effectively optimizing a composite objective in the form of a sum of smooth and, potentially, non-smooth functions. We present accelerated and adaptive gradient methods, called FLAG and FLARE, which can offer the best of both worlds. They can achieve the optimal convergence rate by attaining the optimal first-order oracle complexity for smooth convex optimization. Additionally, they can adaptively and non-uniformly re-scale the gradient direction to adapt to the limited curvature available and conform to the geometry of the domain. We show theoretically and empirically that, through the compounding effects of acceleration and adaptivity, FLAG and FLARE can be highly effective for many data fitting and machine learning applications. △ Less

Submitted 11 November, 2017; v1 submitted 25 May, 2016; originally announced May 2016.

arXiv:1604.07515 [pdf, other]

Parallel Local Graph Clustering

Authors: Julian Shun, Farbod Roosta-Khorasani, Kimon Fountoulakis, Michael W. Mahoney

Abstract: Graph clustering has many important applications in computing, but due to growing sizes of graphs, even traditionally fast clustering methods such as spectral partitioning can be computationally expensive for real-world graphs of interest. Motivated partly by this, so-called local algorithms for graph clustering have received significant interest due to the fact that they can find good clusters in… ▽ More Graph clustering has many important applications in computing, but due to growing sizes of graphs, even traditionally fast clustering methods such as spectral partitioning can be computationally expensive for real-world graphs of interest. Motivated partly by this, so-called local algorithms for graph clustering have received significant interest due to the fact that they can find good clusters in a graph with work proportional to the size of the cluster rather than that of the entire graph. This feature has proven to be crucial in making such graph clustering and many of its downstream applications efficient in practice. While local clustering algorithms are already faster than traditional algorithms that touch the entire graph, they are sequential and there is an opportunity to make them even more efficient via parallelization. In this paper, we show how to parallelize many of these algorithms in the shared-memory multicore setting, and we analyze the parallel complexity of these algorithms. We present comprehensive experiments on large-scale graphs showing that our parallel algorithms achieve good parallel speedups on a modern multicore machine, thus significantly speeding up the analysis of local graph clusters in the very large-scale setting. △ Less

Submitted 8 June, 2019; v1 submitted 26 April, 2016; originally announced April 2016.

Comments: Fixed typo in Figure 5

arXiv:1602.01886 [pdf, other]

Variational Perspective on Local Graph Clustering

Authors: Kimon Fountoulakis, Farbod Roosta-Khorasan, Julian Shun, Xiang Cheng, Michael W. Mahoney

Abstract: Modern graph clustering applications require the analysis of large graphs and this can be computationally expensive. In this regard, local spectral graph clustering methods aim to identify well-connected clusters around a given "seed set" of reference nodes without accessing the entire graph. The celebrated Approximate Personalized PageRank (APPR) algorithm in the seminal paper by Andersen et al.… ▽ More Modern graph clustering applications require the analysis of large graphs and this can be computationally expensive. In this regard, local spectral graph clustering methods aim to identify well-connected clusters around a given "seed set" of reference nodes without accessing the entire graph. The celebrated Approximate Personalized PageRank (APPR) algorithm in the seminal paper by Andersen et al. is one such method. APPR was introduced and motivated purely from an algorithmic perspective. In other words, there is no a priori notion of objective function/optimality conditions that characterizes the steps taken by APPR. Here, we derive a novel variational formulation which makes explicit the actual optimization problem solved by APPR. In doing so, we draw connections between the local spectral algorithm of and an iterative shrinkage-thresholding algorithm (ISTA). In particular, we show that, appropriately initialized ISTA applied to our variational formulation can recover the sought-after local cluster in a time that only depends on the number of non-zeros of the optimal solution instead of the entire graph. In the process, we show that an optimization algorithm which apparently requires accessing the entire graph, can be made to behave in a completely local manner by accessing only a small number of nodes. This viewpoint builds a bridge across two seemingly disjoint fields of graph processing and numerical optimization, and it allows one to leverage well-studied, numerically robust, and efficient optimization algorithms for processing today's large graphs. △ Less

Submitted 6 December, 2017; v1 submitted 4 February, 2016; originally announced February 2016.

Comments: The title changed from "Exploiting Optimization for Local Graph Clustering". The abstract and introduction are written in a variational theme. Motivation and background for local graph clustering is provided. We bound the volume of the support of the optimal solution of the l1-regularized PageRank problem. This result is used to bound running time for iterative shrinkage-thresholding method

arXiv:1601.04738 [pdf, ps, other]

Sub-Sampled Newton Methods II: Local Convergence Rates

Authors: Farbod Roosta-Khorasani, Michael W. Mahoney

Abstract: Many data-fitting applications require the solution of an optimization problem involving a sum of large number of functions of high dimensional parameter. Here, we consider the problem of minimizing a sum of $n$ functions over a convex constraint set $\mathcal{X} \subseteq \mathbb{R}^{p}$ where both $n$ and $p$ are large. In such problems, sub-sampling as a way to reduce $n$ can offer great amount… ▽ More Many data-fitting applications require the solution of an optimization problem involving a sum of large number of functions of high dimensional parameter. Here, we consider the problem of minimizing a sum of $n$ functions over a convex constraint set $\mathcal{X} \subseteq \mathbb{R}^{p}$ where both $n$ and $p$ are large. In such problems, sub-sampling as a way to reduce $n$ can offer great amount of computational efficiency. Within the context of second order methods, we first give quantitative local convergence results for variants of Newton's method where the Hessian is uniformly sub-sampled. Using random matrix concentration inequalities, one can sub-sample in a way that the curvature information is preserved. Using such sub-sampling strategy, we establish locally Q-linear and Q-superlinear convergence rates. We also give additional convergence results for when the sub-sampled Hessian is regularized by modifying its spectrum or Levenberg-type regularization. Finally, in addition to Hessian sub-sampling, we consider sub-sampling the gradient as way to further reduce the computational complexity per iteration. We use approximate matrix multiplication results from randomized numerical linear algebra (RandNLA) to obtain the proper sampling strategy and we establish locally R-linear convergence rates. In such a setting, we also show that a very aggressive sample size increase results in a R-superlinearly convergent algorithm. While the sample size depends on the condition number of the problem, our convergence rates are problem-independent, i.e., they do not depend on the quantities related to the problem. Hence, our analysis here can be used to complement the results of our basic framework from the companion paper, [38], by exploring algorithmic trade-offs that are important in practice. △ Less

Submitted 25 February, 2016; v1 submitted 18 January, 2016; originally announced January 2016.

arXiv:1601.04737 [pdf, other]

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Authors: Farbod Roosta-Khorasani, Michael W. Mahoney

Abstract: Large scale optimization problems are ubiquitous in machine learning and data analysis and there is a plethora of algorithms for solving such problems. Many of these algorithms employ sub-sampling, as a way to either speed up the computations and/or to implicitly implement a form of statistical regularization. In this paper, we consider second-order iterative optimization algorithms and we provide… ▽ More Large scale optimization problems are ubiquitous in machine learning and data analysis and there is a plethora of algorithms for solving such problems. Many of these algorithms employ sub-sampling, as a way to either speed up the computations and/or to implicitly implement a form of statistical regularization. In this paper, we consider second-order iterative optimization algorithms and we provide bounds on the convergence of the variants of Newton's method that incorporate uniform sub-sampling as a means to estimate the gradient and/or Hessian. Our bounds are non-asymptotic and quantitative. Our algorithms are global and are guaranteed to converge from any initial iterate. Using random matrix concentration inequalities, one can sub-sample the Hessian to preserve the curvature information. Our first algorithm incorporates Hessian sub-sampling while using the full gradient. We also give additional convergence results for when the sub-sampled Hessian is regularized by modifying its spectrum or ridge-type regularization. Next, in addition to Hessian sub-sampling, we also consider sub-sampling the gradient as a way to further reduce the computational complexity per iteration. We use approximate matrix multiplication results from randomized numerical linear algebra to obtain the proper sampling strategy. In all these algorithms, computing the update boils down to solving a large scale linear system, which can be computationally expensive. As a remedy, for all of our algorithms, we also give global convergence results for the case of inexact updates where such linear system is solved only approximately. This paper has a more advanced companion paper, [42], in which we demonstrate that, by doing a finer-grained analysis, we can get problem-independent bounds for local convergence of these algorithms and explore trade-offs to improve upon the basic results of the present paper. △ Less

Submitted 25 February, 2016; v1 submitted 18 January, 2016; originally announced January 2016.

arXiv:1511.06468 [pdf, ps, other]

Faster Parallel Solver for Positive Linear Programs via Dynamically-Bucketed Selective Coordinate Descent

Authors: Di Wang, Michael Mahoney, Nishanth Mohan, Satish Rao

Abstract: We provide improved parallel approximation algorithms for the important class of packing and covering linear programs. In particular, we present new parallel $ε$-approximate packing and covering solvers which run in $\tilde{O}(1/ε^2)$ expected time, i.e., in expectation they take $\tilde{O}(1/ε^2)$ iterations and they do $\tilde{O}(N/ε^2)$ total work, where $N$ is the size of the constraint matrix… ▽ More We provide improved parallel approximation algorithms for the important class of packing and covering linear programs. In particular, we present new parallel $ε$-approximate packing and covering solvers which run in $\tilde{O}(1/ε^2)$ expected time, i.e., in expectation they take $\tilde{O}(1/ε^2)$ iterations and they do $\tilde{O}(N/ε^2)$ total work, where $N$ is the size of the constraint matrix and $ε$ is the error parameter, and where the $\tilde{O}$ hides logarithmic factors. To achieve our improvement, we introduce an algorithmic technique of broader interest: dynamically-bucketed selective coordinate descent (DB-SCD). At each step of the iterative optimization algorithm, the DB-SCD method dynamically buckets the coordinates of the gradient into those of roughly equal magnitude, and it updates all the coordinates in one of the buckets. This dynamically-bucketed updating permits us to take steps along several coordinates with similar-sized gradients, thereby permitting more appropriate step sizes at each step of the algorithm. In particular, this technique allows us to use in a straightforward manner the recent analysis from the breakthrough results of Allen-Zhu and Orecchia [2] to achieve our still-further improved bounds. More generally, this method addresses "interference" among coordinates, by which we mean the impact of the update of one coordinate on the gradients of other coordinates. Such interference is a core issue in parallelizing optimization routines that rely on smoothness properties. Since our DB-SCD method reduces interference via updating a selective subset of variables at each iteration, we expect it may also have more general applicability in optimization. △ Less

Submitted 19 November, 2015; originally announced November 2015.

arXiv:1510.05185 [pdf, other]

A Local Perspective on Community Structure in Multilayer Networks

Authors: Lucas G. S. Jeub, Michael W. Mahoney, Peter J. Mucha, Mason A. Porter

Abstract: The analysis of multilayer networks is among the most active areas of network science, and there are now several methods to detect dense "communities" of nodes in multilayer networks. One way to define a community is as a set of nodes that trap a diffusion-like dynamical process (usually a random walk) for a long time. In this view, communities are sets of nodes that create bottlenecks to the spre… ▽ More The analysis of multilayer networks is among the most active areas of network science, and there are now several methods to detect dense "communities" of nodes in multilayer networks. One way to define a community is as a set of nodes that trap a diffusion-like dynamical process (usually a random walk) for a long time. In this view, communities are sets of nodes that create bottlenecks to the spreading of a dynamical process on a network. We analyze the local behavior of different random walks on multiplex networks (which are multilayer networks in which different layers correspond to different types of edges) and show that they have very different bottlenecks that hence correspond to rather different notions of what it means for a set of nodes to be a good community. This has direct implications for the behavior of community-detection methods that are based on these random walks. △ Less

Submitted 22 May, 2016; v1 submitted 17 October, 2015; originally announced October 2015.

Comments: 20 pages, 5 figures (some with multiple parts)

arXiv:1509.05111

Optimal Subsampling Approaches for Large Sample Linear Regression

Authors: Rong Zhu, ** Ma, Michael W. Mahoney, Bin Yu

Abstract: A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by which one takes a random subsample from the original full sample and uses it as a surrogate for subsequent computation and estimation. In this paper, we study subsampling methods under two scenari… ▽ More A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by which one takes a random subsample from the original full sample and uses it as a surrogate for subsequent computation and estimation. In this paper, we study subsampling methods under two scenarios: approximating the full sample ordinary least-square (OLS) estimator and estimating the coefficients in linear regression. We present two algorithms, weighted estimation algorithm and unweighted estimation algorithm, and analyze asymptotic behaviors of their resulting subsample estimators under general conditions. For the weighted estimation algorithm, we propose a criterion for selecting the optimal sampling probability by making use of the asymptotic results. On the basis of the criterion, we provide two novel subsampling methods, the optimal subsampling and the predictor- length subsampling methods. The predictor-length subsampling method is based on the L2 norm of predictors rather than leverage scores. Its computational cost is scalable. For unweighted estimation algorithm, we show that its resulting subsample estimator is not consistent to the full sample OLS estimator. However, it has better performance than the weighted estimation algorithm for estimating the coefficients. Simulation studies and a real data example are used to demonstrate the effectiveness of our proposed subsampling methods. △ Less

Submitted 22 November, 2015; v1 submitted 16 September, 2015; originally announced September 2015.

Comments: This paper has been withdrawn by the author due to the incompleteness of this draft

arXiv:1508.02439 [pdf, ps, other]

Unified Acceleration Method for Packing and Covering Problems via Diameter Reduction

Authors: Di Wang, Satish Rao, Michael W. Mahoney

Abstract: The linear coupling method was introduced recently by Allen-Zhu and Orecchia for solving convex optimization problems with first order methods, and it provides a conceptually simple way to integrate a gradient descent step and mirror descent step in each iteration. The high-level approach of the linear coupling method is very flexible, and it has shown initial promise by providing improved algorit… ▽ More The linear coupling method was introduced recently by Allen-Zhu and Orecchia for solving convex optimization problems with first order methods, and it provides a conceptually simple way to integrate a gradient descent step and mirror descent step in each iteration. The high-level approach of the linear coupling method is very flexible, and it has shown initial promise by providing improved algorithms for packing and covering linear programs. Somewhat surprisingly, however, while the dependence of the convergence rate on the error parameter $ε$ for packing problems was improved to $O(1/ε)$, which corresponds to what accelerated gradient methods are designed to achieve, the dependence for covering problems was only improved to $O(1/ε^{1.5})$, and even that required a different more complicated algorithm. Given the close connections between packing and covering problems and since previous algorithms for these very related problems have led to the same $ε$ dependence, this discrepancy is surprising, and it leaves open the question of the exact role that the linear coupling is playing in coordinating the complementary gradient and mirror descent step of the algorithm. In this paper, we clarify these issues for linear coupling algorithms for packing and covering linear programs, illustrating that the linear coupling method can lead to improved $O(1/ε)$ dependence for both packing and covering problems in a unified manner, i.e., with the same algorithm and almost identical analysis. Our main technical result is a novel diameter reduction method for covering problems that is of independent interest and that may be useful in applying the accelerated linear coupling method to other combinatorial problems. △ Less

Submitted 6 October, 2015; v1 submitted 10 August, 2015; originally announced August 2015.

Comments: Fixed typo in packing LP formulation (page 1), and wrong citation in the discussion of earlier works on page 2

arXiv:1505.06659 [pdf, ps, other]

Statistical and Algorithmic Perspectives on Randomized Sketching for Ordinary Least-Squares -- ICML

Authors: Garvesh Raskutti, Michael Mahoney

Abstract: We consider statistical and algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. Prior results show that, from an \emph{algorithmic perspective}, when using sketching matrices constructed from random projections and leverage-score sampling, if the number of samples $r$ much smaller than the original sample size $n$, then the worst-case (WC)… ▽ More We consider statistical and algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. Prior results show that, from an \emph{algorithmic perspective}, when using sketching matrices constructed from random projections and leverage-score sampling, if the number of samples $r$ much smaller than the original sample size $n$, then the worst-case (WC) error is the same as solving the original problem, up to a very small relative error. From a \emph{statistical perspective}, one typically considers the mean-squared error performance of randomized sketching algorithms, when data are generated according to a statistical linear model. In this paper, we provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing, in a unified manner, algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling algorithms. Among other results, we show that the RE can be upper bounded when $r$ is much smaller than $n$, while the PE typically requires the number of samples $r$ to be substantially larger. Lower bounds developed in subsequent work show that our upper bounds on PE can not be improved. △ Less

Submitted 25 May, 2015; originally announced May 2015.

Comments: 9 pages, Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR

arXiv:1505.00398 [pdf, other]

doi 10.1137/18M1212586

Block Basis Factorization for Scalable Kernel Matrix Evaluation

Authors: Ruoxi Wang, Yingzhou Li, Michael W. Mahoney, Eric Darve

Abstract: Kernel methods are widespread in machine learning; however, they are limited by the quadratic complexity of the construction, application, and storage of kernel matrices. Low-rank matrix approximation algorithms are widely used to address this problem and reduce the arithmetic and storage cost. However, we observed that for some datasets with wide intra-class variability, the optimal kernel parame… ▽ More Kernel methods are widespread in machine learning; however, they are limited by the quadratic complexity of the construction, application, and storage of kernel matrices. Low-rank matrix approximation algorithms are widely used to address this problem and reduce the arithmetic and storage cost. However, we observed that for some datasets with wide intra-class variability, the optimal kernel parameter for smaller classes yields a matrix that is less well approximated by low-rank methods. In this paper, we propose an efficient structured low-rank approximation method -- the Block Basis Factorization (BBF) -- and its fast construction algorithm to approximate radial basis function (RBF) kernel matrices. Our approach has linear memory cost and floating-point operations for many machine learning kernels. BBF works for a wide range of kernel bandwidth parameters and extends the domain of applicability of low-rank approximation methods significantly. Our empirical results demonstrate the stability and superiority over the state-of-art kernel approximation algorithms. △ Less

Submitted 4 May, 2021; v1 submitted 3 May, 2015; originally announced May 2015.

Comments: 16 pages, 5 figures

Journal ref: SIAM Journal on Matrix Analysis and Applications, 2019, Vol. 40, No. 4 : pp. 1497-1526

arXiv:1502.03571 [pdf, ps, other]

Weighted SGD for $\ell_p$ Regression with Randomized Preconditioning

Authors: Jiyan Yang, Yin-Lam Chow, Christopher Ré, Michael W. Mahoney

Abstract: In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems---e.g., $\ell_2$ and $\ell_1$ regression problems. We propose a hybrid algorithm named pwSGD… ▽ More In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems---e.g., $\ell_2$ and $\ell_1$ regression problems. We propose a hybrid algorithm named pwSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system. We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Particularly, when solving $\ell_1$ regression with size $n$ by $d$, pwSGD returns an approximate solution with $ε$ relative error in the objective value in $\mathcal{O}(\log n \cdot \text{nnz}(A) + \text{poly}(d)/ε^2)$ time. This complexity is uniformly better than that of RLA methods in terms of both $ε$ and $d$ when the problem is unconstrained. For $\ell_2$ regression, pwSGD returns an approximate solution with $ε$ relative error in the objective value and the solution vector measured in prediction norm in $\mathcal{O}(\log n \cdot \text{nnz}(A) + \text{poly}(d) \log(1/ε) /ε)$ time. We also provide lower bounds on the coreset complexity for more general regression problems, indicating that still new ideas will be needed to extend similar RLA preconditioning ideas to weighted SGD algorithms for more general regression problems. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets. △ Less

Submitted 10 July, 2017; v1 submitted 12 February, 2015; originally announced February 2015.

Comments: A conference version of this paper appears under the same title in Proceedings of ACM-SIAM Symposium on Discrete Algorithms, Arlington, VA, 2016

arXiv:1502.03032 [pdf, ps, other]

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

Authors: Jiyan Yang, Xiangrui Meng, Michael W. Mahoney

Abstract: In this era of large-scale data, distributed systems built on top of clusters of commodity hardware provide cheap and reliable storage and scalable processing of massive data. Here, we review recent work on develo** and implementing randomized matrix algorithms in large-scale parallel and distributed environments. Randomized algorithms for matrix problems have received a great deal of attention… ▽ More In this era of large-scale data, distributed systems built on top of clusters of commodity hardware provide cheap and reliable storage and scalable processing of massive data. Here, we review recent work on develo** and implementing randomized matrix algorithms in large-scale parallel and distributed environments. Randomized algorithms for matrix problems have received a great deal of attention in recent years, thus far typically either in theory or in machine learning applications or with implementations on a single machine. Our main focus is on the underlying theory and practical implementation of random projection and random sampling algorithms for very large very overdetermined (i.e., overconstrained) $\ell_1$ and $\ell_2$ regression problems. Randomization can be used in one of two related ways: either to construct sub-sampled problems that can be solved, exactly or approximately, with traditional numerical methods; or to construct preconditioned versions of the original full problem that are easier to solve with traditional iterative algorithms. Theoretical results demonstrate that in near input-sparsity time and with only a few passes through the data one can obtain very strong relative-error approximate solutions, with high probability. Empirical results highlight the importance of various trade-offs (e.g., between the time to construct an embedding and the conditioning quality of the embedding, between the relative importance of computation versus communication, etc.) and demonstrate that $\ell_1$ and $\ell_2$ regression problems can be solved to low, medium, or high precision in existing distributed systems on up to terabyte-sized data. △ Less

Submitted 27 July, 2015; v1 submitted 10 February, 2015; originally announced February 2015.

arXiv:1412.8293 [pdf, ps, other]

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels

Authors: Haim Avron, Vikas Sindhwani, Jiyan Yang, Michael Mahoney

Abstract: We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large datasets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shift-invariant kernel functions (e.g., Gaussian kernel). In this paper, we propose to use Quasi-Monte Carlo (QMC) approximations instead… ▽ More We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large datasets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shift-invariant kernel functions (e.g., Gaussian kernel). In this paper, we propose to use Quasi-Monte Carlo (QMC) approximations instead, where the relevant integrands are evaluated on a low-discrepancy sequence of points as opposed to random point sets as in the Monte Carlo approach. We derive a new discrepancy measure called box discrepancy based on theoretical characterizations of the integration error with respect to a given sequence. We then propose to learn QMC sequences adapted to our setting based on explicit box discrepancy minimization. Our theoretical analyses are complemented with empirical results that demonstrate the effectiveness of classical and adaptive QMC techniques for this problem. △ Less

Submitted 9 August, 2015; v1 submitted 29 December, 2014; originally announced December 2014.

Comments: A short version of this paper has been presented in ICML 2014

arXiv:1411.1546 [pdf, other]

Tree decompositions and social graphs

Authors: Aaron B. Adcock, Blair D. Sullivan, Michael W. Mahoney

Abstract: Recent work has established that large informatics graphs such as social and information networks have non-trivial tree-like structure when viewed at moderate size scales. Here, we present results from the first detailed empirical evaluation of the use of tree decomposition (TD) heuristics for structure identification and extraction in social graphs. Although TDs have historically been used in str… ▽ More Recent work has established that large informatics graphs such as social and information networks have non-trivial tree-like structure when viewed at moderate size scales. Here, we present results from the first detailed empirical evaluation of the use of tree decomposition (TD) heuristics for structure identification and extraction in social graphs. Although TDs have historically been used in structural graph theory and scientific computing, we show that---even with existing TD heuristics developed for those very different areas---TD methods can identify interesting structure in a wide range of realistic informatics graphs. Our main contributions are the following: we show that TD methods can identify structures that correlate strongly with the core-periphery structure of realistic networks, even when using simple greedy heuristics; we show that the peripheral bags of these TDs correlate well with low-conductance communities (when they exist) found using local spectral computations; and we show that several types of large-scale "ground-truth" communities, defined by demographic metadata on the nodes of the network, are well-localized in the large-scale and/or peripheral structures of the TDs. Our other main contributions are the following: we provide detailed empirical results for TD heuristics on toy and synthetic networks to establish a baseline to understand better the behavior of the heuristics on more complex real-world networks; and we prove a theorem providing formal justification for the intuition that the only two impediments to low-distortion hyperbolic embedding are high tree-width and long geodesic cycles. Our results suggest future directions for improved TD heuristics that are more appropriate for realistic social graphs. △ Less

Submitted 3 May, 2016; v1 submitted 6 November, 2014; originally announced November 2014.

Comments: v2 has 44 pages, 21 figures, 7 tables, 107 references. To appear in Internet Mathematics

arXiv:1411.0306 [pdf, other]

Fast Randomized Kernel Methods With Statistical Guarantees

Authors: Ahmed El Alaoui, Michael W. Mahoney

Abstract: One approach to improving the running time of kernel-based machine learning methods is to build a small sketch of the input and use it in lieu of the full kernel matrix in the machine learning task of interest. Here, we describe a version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance. By extending the notion of \emph{statisti… ▽ More One approach to improving the running time of kernel-based machine learning methods is to build a small sketch of the input and use it in lieu of the full kernel matrix in the machine learning task of interest. Here, we describe a version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance. By extending the notion of \emph{statistical leverage scores} to the setting of kernel ridge regression, our main statistical result is to identify an importance sampling distribution that reduces the size of the sketch (i.e., the required number of columns to be sampled) to the \emph{effective dimensionality} of the problem. This quantity is often much smaller than previous bounds that depend on the \emph{maximal degrees of freedom}. Our main algorithmic result is to present a fast algorithm to compute approximations to these scores. This algorithm runs in time that is linear in the number of samples---more precisely, the running time is $O(np^2)$, where the parameter $p$ depends only on the trace of the kernel matrix and the regularization parameter---and it can be applied to the matrix of feature vectors, without having to form the full kernel matrix. This is obtained via a variant of length-squared sampling that we adapt to the kernel setting in a way that is of independent interest. Lastly, we provide empirical results illustrating our theory, and we discuss how this new notion of the statistical leverage of a data point captures in a fine way the difficulty of the original statistical learning problem. △ Less

Submitted 8 November, 2015; v1 submitted 2 November, 2014; originally announced November 2014.

Comments: Improved presentation. Technical details fixed. A conference version of this paper appears in NIPS15 under the modified title "Fast Randomized Kernel Ridge Regression with Statistical Guarantees"

arXiv:1409.7423 [pdf, other]

doi 10.1016/j.jcp.2015.05.034

High-order boundary integral equation solution of high frequency wave scattering from obstacles in an unbounded linearly stratified medium

Authors: Alex. H. Barnett, Bradley J. Nelson, J. Matthew Mahoney

Abstract: We apply boundary integral equations for the first time to the two-dimensional scattering of time-harmonic waves from a smooth obstacle embedded in a continuously-graded unbounded medium. In the case we solve the square of the wavenumber (refractive index) varies linearly in one coordinate, i.e. $(Δ+ E + x_2)u(x_1,x_2) = 0$ where $E$ is a constant; this models quantum particles of fixed energy in… ▽ More We apply boundary integral equations for the first time to the two-dimensional scattering of time-harmonic waves from a smooth obstacle embedded in a continuously-graded unbounded medium. In the case we solve the square of the wavenumber (refractive index) varies linearly in one coordinate, i.e. $(Δ+ E + x_2)u(x_1,x_2) = 0$ where $E$ is a constant; this models quantum particles of fixed energy in a uniform gravitational field, and has broader applications to stratified media in acoustics, optics and seismology. We evaluate the fundamental solution efficiently with exponential accuracy via numerical saddle-point integration, using the truncated trapezoid rule with typically 100 nodes, with an effort that is independent of the frequency parameter $E$. By combining with high-order Nystrom quadrature, we are able to solve the scattering from obstacles 50 wavelengths across to 11 digits of accuracy in under a minute on a desktop or laptop. △ Less

Submitted 25 September, 2014; originally announced September 2014.

Comments: 22 pages, 9 figures, submitted to J. Comput. Phys

MSC Class: 65N38; 65N80; 34M60; 65D20

arXiv:1406.5986 [pdf, other]

A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares

Authors: Garvesh Raskutti, Michael Mahoney

Abstract: We consider statistical as well as algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. For a LS problem with input data $(X, Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n$, sketching algorithms use a sketching matrix, $S\in\mathbb{R}^{r \times n}$ with $r \ll n$. Then, rather than solving the LS problem using the full data $(X,Y)$, ske… ▽ More We consider statistical as well as algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. For a LS problem with input data $(X, Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n$, sketching algorithms use a sketching matrix, $S\in\mathbb{R}^{r \times n}$ with $r \ll n$. Then, rather than solving the LS problem using the full data $(X,Y)$, sketching algorithms solve the LS problem using only the sketched data $(SX, SY)$. Prior work has typically adopted an algorithmic perspective, in that it has made no statistical assumptions on the input $X$ and $Y$, and instead it has been assumed that the data $(X,Y)$ are fixed and worst-case (WC). Prior results show that, when using sketching matrices such as random projections and leverage-score sampling algorithms, with $p < r \ll n$, the WC error is the same as solving the original problem, up to a small constant. From a statistical perspective, we typically consider the mean-squared error performance of randomized sketching algorithms, when data $(X, Y)$ are generated according to a statistical model $Y = X β+ ε$, where $ε$ is a noise process. We provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling sketching algorithms. Among other results, we show that the RE can be upper bounded when $p < r \ll n$ while the PE typically requires the sample size $r$ to be substantially larger. Lower bounds developed in subsequent results show that our upper bounds on PE can not be improved. △ Less

Submitted 25 August, 2015; v1 submitted 23 June, 2014; originally announced June 2014.

Comments: 27 pages, 5 figures

arXiv:1403.3795 [pdf, other]

doi 10.1103/PhysRevE.91.012821

Think Locally, Act Locally: The Detection of Small, Medium-Sized, and Large Communities in Large Networks

Authors: Lucas G. S. Jeub, Prakash Balachandran, Mason A. Porter, Peter J. Mucha, Michael W. Mahoney

Abstract: It is common in the study of networks to investigate meso-scale features to try to gain an understanding of network structure and function. For example, numerous algorithms have been developed to try to identify "communities," which are typically construed as sets of nodes with denser connections internally than with the remainder of a network. In this paper, we adopt a complementary perspective t… ▽ More It is common in the study of networks to investigate meso-scale features to try to gain an understanding of network structure and function. For example, numerous algorithms have been developed to try to identify "communities," which are typically construed as sets of nodes with denser connections internally than with the remainder of a network. In this paper, we adopt a complementary perspective that "communities" are associated with bottlenecks of locally-biased dynamical processes that begin at seed sets of nodes, and we employ several different community-identification procedures (using diffusion-based and geodesic-based dynamics) to investigate community quality as a function of community size. Using several empirical and synthetic networks, we identify several distinct scenarios for ``size-resolved community structure'' that can arise in real (and realistic) networks. Depending on which scenario holds, one may or may not be able to successfully identify ``good'' communities in a given network, the manner in which different small communities fit together to form meso-scale network structures can be very different, and processes such as viral propagation and information diffusion can exhibit very different dynamics.In addition, our results suggest that, for many large realistic networks, the output of locally-biased methods that focus on communities that are centered around a given seed node might have better conceptual grounding and greater practical utility than the output of global community-detection methods. They also illustrate subtler structural properties that are important to consider in the development of better benchmark networks to test methods for community detection. [Note: Because of space limitations in the arXiv's abstract field, this is an abridged version of the paper's abstract.] △ Less

Submitted 8 October, 2014; v1 submitted 15 March, 2014; originally announced March 2014.

Comments: 32 pages, 19 figures (many with multiple parts); the abstract is abridged because of space limitations in the arXiv's abstract field

Showing 151–200 of 237 results for author: Mahoney, M