-
Alchemist: An Apache Spark <=> MPI Interface
Authors:
Alex Gittens,
Kai Rothauge,
Shusen Wang,
Michael W. Mahoney,
Jey Kottalam,
Lisa Gerhardt,
Prabhat,
Michael Ringenburg,
Kristyn Maschhoff
Abstract:
The Apache Spark framework for distributed computation is popular in the data analytics community due to its ease of use, but its MapReduce-style programming model can incur significant overheads when performing computations that do not map directly onto this model. One way to mitigate these costs is to off-load computations onto MPI codes. In recent work, we introduced Alchemist, a system for the…
▽ More
The Apache Spark framework for distributed computation is popular in the data analytics community due to its ease of use, but its MapReduce-style programming model can incur significant overheads when performing computations that do not map directly onto this model. One way to mitigate these costs is to off-load computations onto MPI codes. In recent work, we introduced Alchemist, a system for the analysis of large-scale data sets. Alchemist calls MPI-based libraries from within Spark applications, and it has minimal coding, communication, and memory overheads. In particular, Alchemist allows users to retain the productivity benefits of working within the Spark software ecosystem without sacrificing performance efficiency in linear algebra, machine learning, and other related computations.
In this paper, we discuss the motivation behind the development of Alchemist, and we provide a detailed overview its design and usage. We also demonstrate the efficiency of our approach on medium-to-large data sets, using some standard linear algebra operations, namely matrix multiplication and the truncated singular value decomposition of a dense matrix, and we compare the performance of Spark with that of Spark+Alchemist. These computations are run on the NERSC supercomputer Cori Phase 1, a Cray XC40.
△ Less
Submitted 3 June, 2018;
originally announced June 2018.
-
Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist
Authors:
Alex Gittens,
Kai Rothauge,
Shusen Wang,
Michael W. Mahoney,
Lisa Gerhardt,
Prabhat,
Jey Kottalam,
Michael Ringenburg,
Kristyn Maschhoff
Abstract:
Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface…
▽ More
Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI).
To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation.
We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case studies: a large-scale application of the conjugate gradient method to solve very large linear systems arising in a speech classification problem, where we see an improvement of an order of magnitude; and the truncated singular value decomposition (SVD) of a 400GB three-dimensional ocean temperature data set, where we see a speedup of up to 7.9x. We also illustrate that the truncated SVD computation is easily scalable to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.
△ Less
Submitted 30 May, 2018;
originally announced May 2018.
-
Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies
Authors:
Alex Gittens,
Aditya Devarakonda,
Evan Racah,
Michael Ringenburg,
Lisa Gerhardt,
Jey Kottalam,
Jialin Liu,
Kristyn Maschhoff,
Shane Canon,
Jatin Chhugani,
Pramod Sharma,
Jiyan Yang,
James Demmel,
Jim Harrell,
Venkat Krishnamurthy,
Michael W. Mahoney,
Prabhat
Abstract:
We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity…
▽ More
We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.
△ Less
Submitted 20 September, 2016; v1 submitted 5 July, 2016;
originally announced July 2016.
-
MLI: An API for Distributed Machine Learning
Authors:
Evan R. Sparks,
Ameet Talwalkar,
Virginia Smith,
Jey Kottalam,
Xinghao Pan,
Joseph Gonzalez,
Michael J. Franklin,
Michael I. Jordan,
Tim Kraska
Abstract:
MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implement…
▽ More
MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
△ Less
Submitted 25 October, 2013; v1 submitted 21 October, 2013;
originally announced October 2013.