-
Deep Learning Scaling is Predictable, Empirically
Authors:
Joel Hestness,
Sharan Narang,
Newsha Ardalani,
Gregory Diamos,
Heewoo Jun,
Hassan Kianinejad,
Md. Mostofa Ali Patwary,
Yang Yang,
Yanqi Zhou
Abstract:
Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, comput…
▽ More
Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art.
This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents---the "steepness" of the learning curve---yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.
△ Less
Submitted 1 December, 2017;
originally announced December 2017.
-
Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies
Authors:
Brian Friesen,
Md. Mostofa Ali Patwary,
Brian Austin,
Nadathur Satish,
Zachary Slepian,
Narayanan Sundaram,
Deborah Bard,
Daniel J Eisenstein,
Jack Deslippe,
Pradeep Dubey,
Prabhat
Abstract:
The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to bill…
▽ More
The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N^2) algorithm that uses a load-balanced k-d tree and spherical harmonic expansions to compute the anisotropic 3PCF. Our implementation is optimized for the Intel Xeon Phi architecture, exploiting SIMD parallelism, instruction and thread concurrency, and significant L1 and L2 cache reuse, reaching 39% of peak performance on a single node. Galactos scales to the full Cori system, achieving 9.8PF (peak) and 5.06PF (sustained) across 9636 nodes, making the 3PCF easily computable for all galaxies in the observable universe.
△ Less
Submitted 31 August, 2017;
originally announced September 2017.
-
PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures
Authors:
Md. Mostofa Ali Patwary,
Nadathur Rajagopalan Satish,
Narayanan Sundaram,
Jialin Liu,
Peter Sadowski,
Evan Racah,
Suren Byna,
Craig Tull,
Wahid Bhimji,
Prabhat,
Pradeep Dubey
Abstract:
Computing $k$-Nearest Neighbors (KNN) is one of the core kernels used in many machine learning, data mining and scientific computing applications. Although kd-tree based $O(\log n)$ algorithms have been proposed for computing KNN, due to its inherent sequentiality, linear algorithms are being used in practice. This limits the applicability of such methods to millions of data points, with limited s…
▽ More
Computing $k$-Nearest Neighbors (KNN) is one of the core kernels used in many machine learning, data mining and scientific computing applications. Although kd-tree based $O(\log n)$ algorithms have been proposed for computing KNN, due to its inherent sequentiality, linear algorithms are being used in practice. This limits the applicability of such methods to millions of data points, with limited scalability for Big Data analytics challenges in the scientific domain. In this paper, we present parallel and highly optimized kd-tree based KNN algorithms (both construction and querying) suitable for distributed architectures. Our algorithm includes novel approaches for pruning search space and improving load balancing and partitioning among nodes and threads. Using TB-sized datasets from three science applications: astrophysics, plasma physics, and particle physics, we show that our implementation can construct kd-tree of 189 billion particles in 48 seconds on utilizing $\sim$50,000 cores. We also demonstrate computation of KNN of 19 billion queries in 12 seconds. We demonstrate almost linear speedup both for shared and distributed memory computers. Our algorithms outperforms earlier implementations by more than order of magnitude; thereby radically improving the applicability of our implementation to state-of-the-art Big Data analytics problems. In addition, we showcase performance and scalability on the recently released Intel Xeon Phi processor showing that our algorithm scales well even on massively parallel architectures.
△ Less
Submitted 27 July, 2016;
originally announced July 2016.
-
A New Parallel Algorithm for Two-Pass Connected Component Labeling
Authors:
Siddharth Gupta,
Diana Palsetia,
Md. Mostofa Ali Patwary,
Ankit Agrawal,
Alok Choudhary
Abstract:
Connected Component Labeling (CCL) is an important step in pattern recognition and image processing. It assigns labels to the pixels such that adjacent pixels sharing the same features are assigned the same label. Typically, CCL requires several passes over the data. We focus on two-pass technique where each pixel is given a provisional label in the first pass whereas an actual label is assigned i…
▽ More
Connected Component Labeling (CCL) is an important step in pattern recognition and image processing. It assigns labels to the pixels such that adjacent pixels sharing the same features are assigned the same label. Typically, CCL requires several passes over the data. We focus on two-pass technique where each pixel is given a provisional label in the first pass whereas an actual label is assigned in the second pass.
We present a scalable parallel two-pass CCL algorithm, called PAREMSP, which employs a scan strategy and the best union-find technique called REMSP, which uses REM's algorithm for storing label equivalence information of pixels in a 2-D image. In the first pass, we divide the image among threads and each thread runs the scan phase along with REMSP simultaneously. In the second phase, we assign the final labels to the pixels. As REMSP is easily parallelizable, we use the parallel version of REMSP for merging the pixels on the boundary. Our experiments show the scalability of PAREMSP achieving speedups up to $20.1$ using $24$ cores on shared memory architecture using OpenMP for an image of size $465.20$ MB. We find that our proposed parallel algorithm achieves linear scaling for a large resolution fixed problem size as the number of processing elements are increased. Additionally, the parallel algorithm does not make use of any hardware specific routines, and thus is highly portable.
△ Less
Submitted 20 June, 2016;
originally announced June 2016.
-
GraphMat: High performance graph analytics made productive
Authors:
Narayanan Sundaram,
Nadathur Rajagopalan Satish,
Md Mostofa Ali Patwary,
Subramanya R Dulloor,
Satya Gautam Vadlamudi,
Dipankar Das,
Pradeep Dubey
Abstract:
Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly graph analytics framework and native, hand-optimized code. GraphMat functions by taking vertex programs and map** them to high performance sparse matrix operat…
▽ More
Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly graph analytics framework and native, hand-optimized code. GraphMat functions by taking vertex programs and map** them to high performance sparse matrix operations in the backend. We get the productivity benefits of a vertex programming framework without sacrificing performance. GraphMat is in C++, and we have been able to write a diverse set of graph algorithms in this framework with the same effort compared to other vertex programming frameworks. GraphMat performs 1.2-7X faster than high performance frameworks such as GraphLab, CombBLAS and Galois. It achieves better multicore scalability (13-15X on 24 cores) than other frameworks and is 1.2X off native, hand-optimized code on a variety of different graph algorithms. Since GraphMat performance depends mainly on a few scalable and well-understood sparse matrix operations, GraphMatcan naturally benefit from the trend of increasing parallelism on future hardware.
△ Less
Submitted 24 March, 2015;
originally announced March 2015.
-
Scalable Bayesian Optimization Using Deep Neural Networks
Authors:
Jasper Snoek,
Oren Rippel,
Kevin Swersky,
Ryan Kiros,
Nadathur Satish,
Narayanan Sundaram,
Md. Mostofa Ali Patwary,
Prabhat,
Ryan P. Adams
Abstract:
Bayesian optimization is an effective methodology for the global optimization of functions with expensive evaluations. It relies on querying a distribution over functions defined by a relatively cheap surrogate model. An accurate model for this distribution over functions is critical to the effectiveness of the approach, and is typically fit using Gaussian processes (GPs). However, since GPs scale…
▽ More
Bayesian optimization is an effective methodology for the global optimization of functions with expensive evaluations. It relies on querying a distribution over functions defined by a relatively cheap surrogate model. An accurate model for this distribution over functions is critical to the effectiveness of the approach, and is typically fit using Gaussian processes (GPs). However, since GPs scale cubically with the number of observations, it has been challenging to handle objectives whose optimization requires many evaluations, and as such, massively parallelizing the optimization.
In this work, we explore the use of neural networks as an alternative to GPs to model distributions over functions. We show that performing adaptive basis function regression with a neural network as the parametric form performs competitively with state-of-the-art GP-based approaches, but scales linearly with the number of data rather than cubically. This allows us to achieve a previously intractable degree of parallelism, which we apply to large scale hyperparameter optimization, rapidly finding competitive models on benchmark object recognition tasks using convolutional networks, and image caption generation using neural language models.
△ Less
Submitted 13 July, 2015; v1 submitted 19 February, 2015;
originally announced February 2015.
-
Fast Algorithms for the Maximum Clique Problem on Massive Graphs with Applications to Overlap** Community Detection
Authors:
Bharath Pattabiraman,
Md. Mostofa Ali Patwary,
Assefaw H. Gebremedhin,
Wei-keng Liao,
Alok Choudhary
Abstract:
The maximum clique problem is a well known NP-Hard problem with applications in data mining, network analysis, information retrieval and many other areas related to the World Wide Web. There exist several algorithms for the problem with acceptable runtimes for certain classes of graphs, but many of them are infeasible for massive graphs. We present a new exact algorithm that employs novel pruning…
▽ More
The maximum clique problem is a well known NP-Hard problem with applications in data mining, network analysis, information retrieval and many other areas related to the World Wide Web. There exist several algorithms for the problem with acceptable runtimes for certain classes of graphs, but many of them are infeasible for massive graphs. We present a new exact algorithm that employs novel pruning techniques and is able to find maximum cliques in very large, sparse graphs quickly. Extensive experiments on different kinds of synthetic and real-world graphs show that our new algorithm can be orders of magnitude faster than existing algorithms. We also present a heuristic that runs orders of magnitude faster than the exact algorithm while providing optimal or near-optimal solutions. We illustrate a simple application of the algorithms in develo** methods for detection of overlap** communities in networks.
△ Less
Submitted 26 November, 2014;
originally announced November 2014.
-
Parallel Maximum Clique Algorithms with Applications to Network Analysis and Storage
Authors:
Ryan A. Rossi,
David F. Gleich,
Assefaw H. Gebremedhin,
Md. Mostofa Ali Patwary
Abstract:
We propose a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. The method exhibits a roughly linear runtime scaling over real-world networks ranging from 1000 to 100 million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. Our method…
▽ More
We propose a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. The method exhibits a roughly linear runtime scaling over real-world networks ranging from 1000 to 100 million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. Our method employs a branch and bound strategy with novel and aggressive pruning techniques. For instance, we use the core number of a vertex in combination with a good heuristic clique finder to efficiently remove the vast majority of the search space. In addition, we parallelize the exploration of the search tree. During the search, processes immediately communicate changes to upper and lower bounds on the size of maximum clique, which occasionally results in a super-linear speedup because vertices with large search spaces can be pruned by other processes. We apply the algorithm to two problems: to compute temporal strong components and to compress graphs.
△ Less
Submitted 25 December, 2013; v1 submitted 25 February, 2013;
originally announced February 2013.
-
What if CLIQUE were fast? Maximum Cliques in Information Networks and Strong Components in Temporal Networks
Authors:
Ryan A. Rossi,
David F. Gleich,
Assefaw H. Gebremedhin,
Md. Mostofa Ali Patwary
Abstract:
Exact maximum clique finders have progressed to the point where we can investigate cliques in million-node social and information networks, as well as find strongly connected components in temporal networks. We use one such finder to study a large collection of modern networks emanating from biological, social, and technological domains. We show inter-relationships between maximum cliques and seve…
▽ More
Exact maximum clique finders have progressed to the point where we can investigate cliques in million-node social and information networks, as well as find strongly connected components in temporal networks. We use one such finder to study a large collection of modern networks emanating from biological, social, and technological domains. We show inter-relationships between maximum cliques and several other common network properties, including network density, maximum core, and number of triangles. In temporal networks, we find that the largest temporal strong components have around 20-30% of the vertices of the entire network. These components represent groups of highly communicative individuals. In addition, we discuss and improve the performance and utility of the maximum clique finder itself.
△ Less
Submitted 30 October, 2012; v1 submitted 22 October, 2012;
originally announced October 2012.
-
Fast Algorithms for the Maximum Clique Problem on Massive Sparse Graphs
Authors:
Bharath Pattabiraman,
Md. Mostofa Ali Patwary,
Assefaw H. Gebremedhin,
Wei-keng Liao,
Alok Choudhary
Abstract:
The maximum clique problem is a well known NP-Hard problem with applications in data mining, network analysis, informatics, and many other areas. Although there exist several algorithms with acceptable runtimes for certain classes of graphs, many of them are infeasible for massive graphs. We present a new exact algorithm that employs novel pruning techniques to very quickly find maximum cliques in…
▽ More
The maximum clique problem is a well known NP-Hard problem with applications in data mining, network analysis, informatics, and many other areas. Although there exist several algorithms with acceptable runtimes for certain classes of graphs, many of them are infeasible for massive graphs. We present a new exact algorithm that employs novel pruning techniques to very quickly find maximum cliques in large sparse graphs. Extensive experiments on several types of synthetic and real-world graphs show that our new algorithm is up to several orders of magnitude faster than existing algorithms for most instances. We also present a heuristic variant that runs orders of magnitude faster than the exact algorithm, while providing optimal or near-optimal solutions.
△ Less
Submitted 14 November, 2012; v1 submitted 25 September, 2012;
originally announced September 2012.