-
Graphyti: A Semi-External Memory Graph Library for FlashGraph
Authors:
Disa Mhembere,
Da Zheng,
Carey E. Priebe,
Joshua T. Vogelstein,
Randal Burns
Abstract:
Graph datasets exceed the in-memory capacity of most standalone machines. Traditionally, graph frameworks have overcome memory limitations through scale-out, distributing computing. Emerging frameworks avoid the network bottleneck of distributed data with Semi-External Memory (SEM) that uses a single multicore node and operates on graphs larger than memory. In SEM, $\mathcal{O}(m)$ data resides on…
▽ More
Graph datasets exceed the in-memory capacity of most standalone machines. Traditionally, graph frameworks have overcome memory limitations through scale-out, distributing computing. Emerging frameworks avoid the network bottleneck of distributed data with Semi-External Memory (SEM) that uses a single multicore node and operates on graphs larger than memory. In SEM, $\mathcal{O}(m)$ data resides on disk and $\mathcal{O}(n)$ data in memory, for a graph with $n$ vertices and $m$ edges. For developers, this adds complexity because they must explicitly encode I/O within applications. We present principles that are critical for application developers to adopt in order to achieve state-of-the-art performance, while minimizing I/O and memory for algorithms in SEM. We present them in Graphyti, an extensible parallel SEM graph library built on FlashGraph and available in Python via pip. In SEM, Graphyti achieves 80% of the performance of in-memory execution and retains the performance of FlashGraph, which outperforms distributed engines, such as PowerGraph and Galois.
△ Less
Submitted 7 July, 2019;
originally announced July 2019.
-
clusterNOR: A NUMA-Optimized Clustering Framework
Authors:
Disa Mhembere,
Da Zheng,
Carey E. Priebe,
Joshua T. Vogelstein,
Randal Burns
Abstract:
Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximizes independent, asynchronous computation. We elimina…
▽ More
Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximizes independent, asynchronous computation. We eliminate many barriers, reduce remote memory accesses, and maximize cache reuse. We implement the 'Clustering NUMA Optimized Routines' (clusterNOR) extensible parallel framework that provides algorithmic building blocks. The system is generic, we demonstrate nine modern clustering algorithms that have simple implementations. clusterNOR includes (i) in-memory, (ii) semi-external memory, and (iii) distributed memory execution, enabling computation for varying memory and hardware budgets. For algorithms that rely on Euclidean distance, clusterNOR defines an updated Elkan's triangle inequality pruning algorithm that uses asymptotically less memory so that it works on billion-point data sets. clusterNOR extends and expands the scope of the 'knor' library for k-means clustering by generalizing underlying principles, providing a uniform programming interface and expanding the scope to hierarchical and linear algebraic classes of algorithms. The compound effect of our optimizations is an order of magnitude improvement in speed over other state-of-the-art solutions, such as Spark's MLlib and Apple's Turi.
△ Less
Submitted 17 January, 2021; v1 submitted 24 February, 2019;
originally announced February 2019.
-
Forest Packing: Fast, Parallel Decision Forests
Authors:
James Browne,
Tyler M. Tomita,
Disa Mhembere,
Randal Burns,
Joshua T. Vogelstein
Abstract:
Machine learning has an emerging critical role in high-performance computing to modulate simulations, extract knowledge from massive data, and replace numerical models with efficient approximations. Decision forests are a critical tool because they provide insight into model operation that is critical to interpreting learned results. While decision forests are trivially parallelizable, the travers…
▽ More
Machine learning has an emerging critical role in high-performance computing to modulate simulations, extract knowledge from massive data, and replace numerical models with efficient approximations. Decision forests are a critical tool because they provide insight into model operation that is critical to interpreting learned results. While decision forests are trivially parallelizable, the traversals of tree data structures incur many random memory accesses and are very slow. We present memory packing techniques that reorganize learned forests to minimize cache misses during classification. The resulting layout is hierarchical. At low levels, we pack the nodes of multiple trees into contiguous memory blocks so that each memory access fetches data for multiple trees. At higher levels, we use leaf cardinality to identify the most popular paths through a tree and collocate those paths in cache lines. We extend this layout with out-of-order execution and cache-line prefetching to increase memory throughput. Together, these optimizations increase the performance of classification in ensembles by a factor of four over an optimized C++ implementation and a actor of 50 over a popular R language implementation.
△ Less
Submitted 19 June, 2018;
originally announced June 2018.
-
knor: A NUMA-Optimized In-Memory, Distributed and Semi-External-Memory k-means Library
Authors:
Disa Mhembere,
Da Zheng,
Carey E. Priebe,
Joshua T. Vogelstein,
Randal Burns
Abstract:
k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The \textit{k-means NUMA Optimized Routine}…
▽ More
k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The \textit{k-means NUMA Optimized Routine} (\textsf{knor}) library has (i) in-memory (\textsf{knori}), (ii) distributed memory (\textsf{knord}), and (iii) semi-external memory (\textsf{knors}) modules that radically improve the performance of k-means for varying memory and hardware budgets. \textsf{knori} boosts performance for single machine datasets by an order of magnitude or more. \textsf{knors} improves the scalability of k-means on a memory budget using SSDs. \textsf{knors} scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. \textsf{knord} retains \textsf{knori}'s performance characteristics, while scaling in-memory through distributed computation in the cloud. \textsf{knor} modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate \textsf{knor} outperforms distributed commercial products like H$_2$O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of $10^7$ to $10^9$ points.
△ Less
Submitted 24 June, 2017; v1 submitted 28 June, 2016;
originally announced June 2016.
-
FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs
Authors:
Da Zheng,
Disa Mhembere,
Joshua T. Vogelstein,
Carey E. Priebe,
Randal Burns
Abstract:
R is one of the most popular programming languages for statistics and machine learning, but the R framework is relatively slow and unable to scale to large datasets. The general approach for speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. FlashR takes a different approach: it executes R code in parallel and scales the code beyond memory c…
▽ More
R is one of the most popular programming languages for statistics and machine learning, but the R framework is relatively slow and unable to scale to large datasets. The general approach for speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. FlashR takes a different approach: it executes R code in parallel and scales the code beyond memory capacity by utilizing solid-state drives (SSDs) automatically. It provides a small number of generalized operations (GenOps) upon which we reimplement a large number of matrix functions in the R base package. As such, FlashR parallelizes and scales existing R code with little/no modification. To reduce data movement between CPU and SSDs, FlashR evaluates matrix operations lazily, fuses operations at runtime, and uses cache-aware, two-level matrix partitioning. We evaluate FlashR on a variety of machine learning and statistics algorithms on inputs of up to four billion data points. FlashR out-of-core tracks closely the performance of FlashR in-memory. The R code for machine learning algorithms executed in FlashR outperforms the in-memory execution of H2O and Spark MLlib by a factor of 2-10 and outperforms Revolution R Open by more than an order of magnitude.
△ Less
Submitted 18 May, 2017; v1 submitted 21 April, 2016;
originally announced April 2016.
-
Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs
Authors:
Da Zheng,
Disa Mhembere,
Vince Lyzinski,
Joshua Vogelstein,
Carey E. Priebe,
Randal Burns
Abstract:
Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memor…
▽ More
Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memory. Our SEM-SpMM incorporates many in-memory optimizations for large power-law graphs. It outperforms the in-memory implementations of Trilinos and Intel MKL and scales to billion-node graphs, far beyond the limitations of memory. Furthermore, on a single large parallel machine, our SEM-SpMM operates as fast as the distributed implementations of Trilinos using five times as much processing power. We also run our implementation in memory (IM-SpMM) to quantify the overhead of kee** data on SSDs. SEM-SpMM achieves almost 100% performance of IM-SpMM on graphs when the dense matrix has more than four columns; it achieves at least 65% performance of IM-SpMM on all inputs. We apply our SpMM to three important data analysis tasks--PageRank, eigensolving, and non-negative matrix factorization--and show that our SEM implementations significantly advance the state of the art.
△ Less
Submitted 14 October, 2016; v1 submitted 9 February, 2016;
originally announced February 2016.
-
FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs
Authors:
Da Zheng,
Disa Mhembere,
Randal Burns,
Joshua Vogelstein,
Carey E. Priebe,
Alexander S. Szalay
Abstract:
Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the graph size. We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with minimal performance los…
▽ More
Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the graph size. We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with minimal performance loss. We do so by implementing a graph-processing engine on top of a user-space SSD file system designed for high IOPS and extreme parallelism. Our semi-external memory graph engine called FlashGraph stores vertex state in memory and edge lists on SSDs. It hides latency by overlap** computation with I/O. To save I/O bandwidth, FlashGraph only accesses edge lists requested by applications from SSDs; to increase I/O throughput and reduce CPU overhead for I/O, it conservatively merges I/O requests. These designs maximize performance for applications with different I/O characteristics. FlashGraph exposes a general and flexible vertex-centric programming interface that can express a wide variety of graph algorithms and their optimizations. We demonstrate that FlashGraph in semi-external memory performs many algorithms with performance up to 80% of its in-memory implementation and significantly outperforms PowerGraph, a popular distributed in-memory graph engine.
△ Less
Submitted 25 January, 2015; v1 submitted 3 August, 2014;
originally announced August 2014.
-
MIGRAINE: MRI Graph Reliability Analysis and Inference for Connectomics
Authors:
William Gray Roncal,
Zachary H. Koterba,
Disa Mhembere,
Dean M. Kleissas,
Joshua T. Vogelstein,
Randal Burns,
Anita R. Bowles,
Dimitrios K. Donavos,
Sephira Ryman,
Rex E. Jung,
Lei Wu,
Vince Calhoun,
R. Jacob Vogelstein
Abstract:
Currently, connectomes (e.g., functional or structural brain graphs) can be estimated in humans at $\approx 1~mm^3$ scale using a combination of diffusion weighted magnetic resonance imaging, functional magnetic resonance imaging and structural magnetic resonance imaging scans. This manuscript summarizes a novel, scalable implementation of open-source algorithms to rapidly estimate magnetic resona…
▽ More
Currently, connectomes (e.g., functional or structural brain graphs) can be estimated in humans at $\approx 1~mm^3$ scale using a combination of diffusion weighted magnetic resonance imaging, functional magnetic resonance imaging and structural magnetic resonance imaging scans. This manuscript summarizes a novel, scalable implementation of open-source algorithms to rapidly estimate magnetic resonance connectomes, using both anatomical regions of interest (ROIs) and voxel-size vertices. To assess the reliability of our pipeline, we develop a novel nonparametric non-Euclidean reliability metric. Here we provide an overview of the methods used, demonstrate our implementation, and discuss available user extensions. We conclude with results showing the efficacy and reliability of the pipeline over previous state-of-the-art.
△ Less
Submitted 17 December, 2013;
originally announced December 2013.
-
Computing Scalable Multivariate Glocal Invariants of Large (Brain-) Graphs
Authors:
Disa Mhembere,
William Gray Roncal,
Daniel Sussman,
Carey E. Priebe,
Rex Jung,
Sephira Ryman,
R. Jacob Vogelstein,
Joshua T. Vogelstein,
Randal Burns
Abstract:
Graphs are quickly emerging as a leading abstraction for the representation of data. One important application domain originates from an emerging discipline called "connectomics". Connectomics studies the brain as a graph; vertices correspond to neurons (or collections thereof) and edges correspond to structural or functional connections between them. To explore the variability of connectomes---to…
▽ More
Graphs are quickly emerging as a leading abstraction for the representation of data. One important application domain originates from an emerging discipline called "connectomics". Connectomics studies the brain as a graph; vertices correspond to neurons (or collections thereof) and edges correspond to structural or functional connections between them. To explore the variability of connectomes---to address both basic science questions regarding the structure of the brain, and medical health questions about psychiatry and neurology---one can study the topological properties of these brain-graphs. We define multivariate glocal graph invariants: these are features of the graph that capture various local and global topological properties of the graphs. We show that the collection of features can collectively be computed via a combination of daisy-chaining, sparse matrix representation and computations, and efficient approximations. Our custom open-source Python package serves as a back-end to a Web-service that we have created to enable researchers to upload graphs, and download the corresponding invariants in a number of different formats. Moreover, we built this package to support distributed processing on multicore machines. This is therefore an enabling technology for network science, lowering the barrier of entry by providing tools to biologists and analysts who otherwise lack these capabilities. As a demonstration, we run our code on 120 brain-graphs, each with approximately 16M vertices and up to 90M edges.
△ Less
Submitted 16 December, 2013;
originally announced December 2013.