-
Minibatching Offers Improved Generalization Performance for Second Order Optimizers
Authors:
Eric Silk,
Swarnita Chakraborty,
Nairanjana Dasgupta,
Anand D. Sarwate,
Andrew Lumsdaine,
Tony Chiang
Abstract:
Training deep neural networks (DNNs) used in modern machine learning is computationally expensive. Machine learning scientists, therefore, rely on stochastic first-order methods for training, coupled with significant hand-tuning, to obtain good performance. To better understand performance variability of different stochastic algorithms, including second-order methods, we conduct an empirical study…
▽ More
Training deep neural networks (DNNs) used in modern machine learning is computationally expensive. Machine learning scientists, therefore, rely on stochastic first-order methods for training, coupled with significant hand-tuning, to obtain good performance. To better understand performance variability of different stochastic algorithms, including second-order methods, we conduct an empirical study that treats performance as a response variable across multiple training sessions of the same model. Using 2-factor Analysis of Variance (ANOVA) with interactions, we show that batch size used during training has a statistically significant effect on the peak accuracy of the methods, and that full batch largely performed the worst. In addition, we found that second-order optimizers (SOOs) generally exhibited significantly lower variance at specific batch sizes, suggesting they may require less hyperparameter tuning, leading to a reduced overall time to solution for model training.
△ Less
Submitted 25 May, 2023;
originally announced July 2023.
-
High-order Line Graphs of Non-uniform Hypergraphs: Algorithms, Applications, and Experimental Analysis
Authors:
Xu T. Liu,
Jesun Firoz,
Sinan Aksoy,
Ilya Amburg,
Andrew Lumsdaine,
Cliff Joslyn,
Assefaw H. Gebremedhin,
Brenda Praggastis
Abstract:
Hypergraphs offer flexible and robust data representations for many applications, but methods that work directly on hypergraphs are not readily available and tend to be prohibitively expensive. Much of the current analysis of hypergraphs relies on first performing a graph expansion -- either based on the nodes (clique expansion), or on the edges (line graph) -- and then running standard graph anal…
▽ More
Hypergraphs offer flexible and robust data representations for many applications, but methods that work directly on hypergraphs are not readily available and tend to be prohibitively expensive. Much of the current analysis of hypergraphs relies on first performing a graph expansion -- either based on the nodes (clique expansion), or on the edges (line graph) -- and then running standard graph analytics on the resulting representative graph. However, this approach suffers from massive space complexity and high computational cost with increasing hypergraph size. Here, we present efficient, parallel algorithms to accelerate and reduce the memory footprint of higher-order graph expansions of hypergraphs. Our results focus on the edge-based $s$-line graph expansion, but the methods we develop work for higher-order clique expansions as well. To the best of our knowledge, ours is the first framework to enable hypergraph spectral analysis of a large dataset on a single shared-memory machine. Our methods enable the analysis of datasets from many domains that previous graph-expansion-based models are unable to provide. The proposed $s$-line graph computation algorithms are orders of magnitude faster than state-of-the-art sparse general matrix-matrix multiplication methods, and obtain approximately $5-31{\times}$ speedup over a prior state-of-the-art heuristic-based algorithm for $s$-line graph computation.
△ Less
Submitted 27 January, 2022;
originally announced January 2022.
-
Parallel Algorithms and Heuristics for Efficient Computation of High-Order Line Graphs of Hypergraphs
Authors:
Xu T. Liu,
Jesun Firoz,
Andrew Lumsdaine,
Cliff Joslyn,
Sinan Aksoy,
Brenda Praggastis,
Assefaw Gebremedhin
Abstract:
This paper considers structures of systems beyond dyadic (pairwise) interactions and investigates mathematical modeling of multi-way interactions and connections as hypergraphs, where captured relationships among system entities are set-valued. To date, in most situations, entities in a hypergraph are considered connected as long as there is at least one common "neighbor". However, minimal commona…
▽ More
This paper considers structures of systems beyond dyadic (pairwise) interactions and investigates mathematical modeling of multi-way interactions and connections as hypergraphs, where captured relationships among system entities are set-valued. To date, in most situations, entities in a hypergraph are considered connected as long as there is at least one common "neighbor". However, minimal commonality sometimes discards the "strength" of connections and interactions among groups. To this end, considering the "width" of a connection, referred to as the $s$-overlap of neighbors, provides more meaningful insights into how closely the communities or entities interact with each other. In addition, $s$-overlap computation is the fundamental kernel to construct the line graph of a hypergraph, a low-order approximation of the hypergraph which can carry significant information about the original hypergraph. Subsequent stages of a data analytics pipeline then can apply highly-tuned graph algorithms on the line graph to reveal important features. Given a hypergraph, computing the $s$-overlaps by exhaustively considering all pairwise entities can be computationally prohibitive. To tackle this challenge, we develop efficient algorithms to compute $s$-overlaps and the corresponding line graph of a hypergraph. We propose several heuristics to avoid execution of redundant work and improve performance of the $s$-overlap computation. Our parallel algorithm, combined with these heuristics, is orders of magnitude (more than $10\times$) faster than the naive algorithm in all cases and the SpGEMM algorithm with filtration in most cases (especially with large $s$ value).
△ Less
Submitted 15 July, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
Real-Time Refocusing using an FPGA-based Standard Plenoptic Camera
Authors:
Christopher Hahne,
Andrew Lumsdaine,
Amar Aggoun,
Vladan Velisavljevic
Abstract:
Plenoptic cameras are receiving increasing attention in scientific and commercial applications because they capture the entire structure of light in a scene, enabling optical transforms (such as focusing) to be applied computationally after the fact, rather than once and for all at the time a picture is taken. In many settings, real-time interactive performance is also desired, which in turn requi…
▽ More
Plenoptic cameras are receiving increasing attention in scientific and commercial applications because they capture the entire structure of light in a scene, enabling optical transforms (such as focusing) to be applied computationally after the fact, rather than once and for all at the time a picture is taken. In many settings, real-time interactive performance is also desired, which in turn requires significant computational power due to the large amount of data required to represent a plenoptic image. Although GPUs have been shown to provide acceptable performance for real-time plenoptic rendering, their cost and power requirements make them prohibitive for embedded uses (such as in-camera). On the other hand, the computation to accomplish plenoptic rendering is well-structured, suggesting the use of specialized hardware. Accordingly, this paper presents an array of switch-driven Finite Impulse Response (FIR) filters, implemented with FPGA to accomplish high-throughput spatial-domain rendering. The proposed architecture provides a power-efficient rendering hardware design suitable for full-video applications as required in broadcasting or cinematography. A benchmark assessment of the proposed hardware implementation shows that real-time performance can readily be achieved, with a one order of magnitude performance improvement over a GPU implementation and three orders of magnitude performance improvement over a general-purpose CPU implementation.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
PISCES-RF: a liquid-cooled high-power steady-state helicon plasma device
Authors:
Saikat Chakraborty Thakur,
Michael J. Simmonds,
Juan F. Caneses,
Fengjen Chang,
Eric M. Hollmann Russell P. Doerner,
Richard Goulding,
Arnold Lumsdaine,
Juergen Rapp,
George R. Tynan
Abstract:
Radio-frequency (RF) driven helicon plasma sources can produce relatively high-density plasmas (n > 10^19 m-3) at relatively moderate powers (< 2 kW) in argon. However, to produce similar high-density plasmas for fusion relevant gases such as hydrogen, deuterium and helium, much higher RF powers are needed. For very high RF powers, thermal issues of the RF-transparent dielectric window, used in th…
▽ More
Radio-frequency (RF) driven helicon plasma sources can produce relatively high-density plasmas (n > 10^19 m-3) at relatively moderate powers (< 2 kW) in argon. However, to produce similar high-density plasmas for fusion relevant gases such as hydrogen, deuterium and helium, much higher RF powers are needed. For very high RF powers, thermal issues of the RF-transparent dielectric window, used in the RF source design, limit the plasma operation timescales. To mitigate this constraint, we have designed, built and tested a novel liquid-cooled RF window which allows steady state operations at high power (up to 20 kW). De-ionized (DI) water, flowing between two concentric dielectric RF windows, is used as the coolant. We show that a full azimuthal blanket of DI water does not degrade plasma production. We obtain steady-state, high-density plasmas (n > 10^19 m-3, T_e ~ 5 eV) using both argon and hydrogen. From calorimetry on the DI water, we measure the net heat that is being removed by the coolant at steady state conditions. Using infra-red (IR) imaging, we calculate the constant plasma heat deposition and measure the final steady state temperature distribution patterns on the inner surface of the ceramic layer. We find that the heat deposition pattern follows the helical shape of the antenna. We also show the consistency between the heat absorbed by the DI water, as measured by calorimetry, and the total heat due to the combined effect of the plasma heating and the absorbed RF. These results are being used to answer critical engineering questions for the 200 kW RF device (MPEX: Materials Plasma Exposure eXperiment) being designed at the Oak Ridge National Laboratory (ORNL) as a next generation plasma material interaction (PMI) device.
△ Less
Submitted 29 December, 2020; v1 submitted 22 May, 2020;
originally announced May 2020.
-
A Comparative Study of Asynchronous Many-Tasking Runtimes: Cilk, Charm++, ParalleX and AM++
Authors:
Abhishek Kulkarni,
Andrew Lumsdaine
Abstract:
We evaluate and compare four contemporary and emerging runtimes for high-performance computing(HPC) applications: Cilk, Charm++, ParalleX and AM++. We compare along three bases: programming model, execution model and the implementation on an underlying machine model. The comparison study includes a survey of each runtime system's programming models, their corresponding execution models, their stat…
▽ More
We evaluate and compare four contemporary and emerging runtimes for high-performance computing(HPC) applications: Cilk, Charm++, ParalleX and AM++. We compare along three bases: programming model, execution model and the implementation on an underlying machine model. The comparison study includes a survey of each runtime system's programming models, their corresponding execution models, their stated features, and performance and productivity goals. We first qualitatively compare these runtimes with programmability in mind. The differences in expressivity and programmability arising from their syntax and semantics are clearly enunciated through examples common to all runtimes. Then, the execution model of each runtime, which acts as a bridge between the programming model and the underlying machine model, is compared and contrasted to that of the others. We also evaluate four mature implementations of these runtimes, namely: Intel Cilk++, Charm++ 6.5.1, AM++ and HPX, that embody the principles dictated by these models. With the emergence of the next generation of supercomputers, it is imperative for parallel programming models to evolve and address the integral challenges introduced by the increasing scale. Rather than picking a winner out of these four models under consideration, we end with a discussion on lessons learned, and how such a study is instructive in the evolution of parallel programming frameworks to address the said challenges.
△ Less
Submitted 31 March, 2019;
originally announced April 2019.
-
Families of Distributed Memory Parallel Graph Algorithms from Self-Stabilizing Kernels-An SSSP Case Study
Authors:
Thejaka Kanewala,
Marcin Zalewski,
Martina Barnas,
Andrew Lumsdaine
Abstract:
Self-stabilizing algorithms are an important because of their robustness and guaranteed convergence. Starting from any arbitrary state, a self-stabilizing algorithm is guaranteed to converge to a legitimate state.Those algorithms are not directly amenable to solving distributed graph processing problems when performance and scalability are important. In this paper, we show the "Abstract Graph Mach…
▽ More
Self-stabilizing algorithms are an important because of their robustness and guaranteed convergence. Starting from any arbitrary state, a self-stabilizing algorithm is guaranteed to converge to a legitimate state.Those algorithms are not directly amenable to solving distributed graph processing problems when performance and scalability are important. In this paper, we show the "Abstract Graph Machine" (AGM) model that can be used to convert self-stabilizing algorithms into forms suitable for distributed graph processing. An AGM is a mathematical model of parallel computation on graphs that adds work dependency and ordering to self-stabilizing algorithms. Using the AGM model we show that some of the existing distributed Single Source Shortest Path (SSSP) algorithms are actually specializations of self-stabilizing SSSP. We extend the AGM model to apply more fine-grained orderings at different spatial levels to derive additional scalable variants of SSSP algorithms, essentially enabling the algorithm to be generated for a specific target architecture. Experimental results show that this approach can generate new algorithmic variants that out-perform standard distributed algorithms for SSSP.
△ Less
Submitted 18 June, 2017;
originally announced June 2017.
-
A Survey of Methods for Collective Communication Optimization and Tuning
Authors:
Udayanga Wickramasinghe,
Andrew Lumsdaine
Abstract:
New developments in HPC technology in terms of increasing computing power on multi/many core processors, high-bandwidth memory/IO subsystems and communication interconnects, pose a direct impact on software and runtime system development. These advancements have become useful in producing high-performance collective communication interfaces that integrate efficiently on a wide variety of platforms…
▽ More
New developments in HPC technology in terms of increasing computing power on multi/many core processors, high-bandwidth memory/IO subsystems and communication interconnects, pose a direct impact on software and runtime system development. These advancements have become useful in producing high-performance collective communication interfaces that integrate efficiently on a wide variety of platforms and environments. However, number of optimization options that shows up with each new technology or software framework has resulted in a \emph{combinatorial explosion} in feature space for tuning collective parameters such that finding the optimal set has become a nearly impossible task. Applicability of algorithmic choices available for optimizing collective communication depends largely on the scalability requirement for a particular usecase. This problem can be further exasperated by any requirement to run collective problems at very large scales such as in the case of exascale computing, at which impractical tuning by brute force may require many months of resources. Therefore application of statistical, data mining and artificial Intelligence or more general hybrid learning models seems essential in many collectives parameter optimization problems. We hope to explore current and the cutting edge of collective communication optimization and tuning methods and culminate with possible future directions towards this problem.
△ Less
Submitted 19 November, 2016;
originally announced November 2016.
-
Mathematical Foundations of the GraphBLAS
Authors:
Jeremy Kepner,
Peter Aaltonen,
David Bader,
Aydın Buluc,
Franz Franchetti,
John Gilbert,
Dylan Hutchison,
Manoj Kumar,
Andrew Lumsdaine,
Henning Meyerhenke,
Scott McMillan,
Jose Moreira,
John D. Owens,
Carl Yang,
Marcin Zalewski,
Timothy Mattson
Abstract:
The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of th…
▽ More
The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of the GraphBLAS. Graphs represent connections between vertices with edges. Matrices can represent a wide range of graphs using adjacency matrices or incidence matrices. Adjacency matrices are often easier to analyze while incidence matrices are often better for representing data. Fortunately, the two are easily connected by matrix mul- tiplication. A key feature of matrix mathematics is that a very small number of matrix operations can be used to manipulate a very wide range of graphs. This composability of small number of operations is the foundation of the GraphBLAS. A standard such as the GraphBLAS can only be effective if it has low performance overhead. Performance measurements of prototype GraphBLAS implementations indicate that the overhead is low.
△ Less
Submitted 13 July, 2016; v1 submitted 18 June, 2016;
originally announced June 2016.
-
Abstract Graph Machine
Authors:
Thejaka Amila Kanewala,
Marcin Zalewski,
Andrew Lumsdaine
Abstract:
An Abstract Graph Machine(AGM) is an abstract model for distributed memory parallel stabilizing graph algorithms. A stabilizing algorithm starts from a particular initial state and goes through series of different state changes until it converges. The AGM adds work dependency to the stabilizing algorithm. The work is processed within the processing function. All processes in the system execute the…
▽ More
An Abstract Graph Machine(AGM) is an abstract model for distributed memory parallel stabilizing graph algorithms. A stabilizing algorithm starts from a particular initial state and goes through series of different state changes until it converges. The AGM adds work dependency to the stabilizing algorithm. The work is processed within the processing function. All processes in the system execute the same processing function. Before feeding work into the processing function, work is ordered using a strict weak ordering relation. The strict weak ordering relation divides work into equivalence classes, hence work within a single equivalence class can be processed in parallel, but work in different equivalence classes must be executed in the order they appear in equivalence classes. The paper presents the AGM model, semantics and AGM models for several existing distributed memory parallel graph algorithms.
△ Less
Submitted 28 April, 2016; v1 submitted 16 April, 2016;
originally announced April 2016.
-
The Anatomy of Large-Scale Distributed Graph Algorithms
Authors:
Jesun Sahariar Firoz,
Thejaka Amila Kanewala,
Marcin Zalewski,
Martina Barnas,
Andrew Lumsdaine
Abstract:
The increasing complexity of the software/hardware stack of modern supercomputers results in explosion of parameters. The performance analysis becomes a truly experimental science, even more challenging in the presence of massive irregularity and data dependency. We analyze how the existing body of research handles the experimental aspect in the context of distributed graph algorithms (DGAs). We d…
▽ More
The increasing complexity of the software/hardware stack of modern supercomputers results in explosion of parameters. The performance analysis becomes a truly experimental science, even more challenging in the presence of massive irregularity and data dependency. We analyze how the existing body of research handles the experimental aspect in the context of distributed graph algorithms (DGAs). We distinguish algorithm-level contributions, often prioritized by authors, from runtime-level concerns that are harder to place. We show that the runtime is such an integral part of DGAs that experimental results are difficult to interpret and extrapolate without understanding the properties of the runtime used. We argue that in order to gain understanding about the impact of runtimes, more information needs to be gathered. To begin this process, we provide an initial set of recommendations for describing DGA results based on our analysis of the current state of the field.
△ Less
Submitted 23 July, 2015;
originally announced July 2015.
-
Standards for Graph Algorithm Primitives
Authors:
Tim Mattson,
David Bader,
Jon Berry,
Aydin Buluc,
Jack Dongarra,
Christos Faloutsos,
John Feo,
John Gilbert,
Joseph Gonzalez,
Bruce Hendrickson,
Jeremy Kepner,
Charles Leiserson,
Andrew Lumsdaine,
David Padua,
Stephen Poole,
Steve Reinhardt,
Mike Stonebraker,
Steve Wallach,
Andrew Yoo
Abstract:
It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.
It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.
△ Less
Submitted 2 August, 2014;
originally announced August 2014.
-
What Makes Code Hard to Understand?
Authors:
Michael Hansen,
Robert L. Goldstone,
Andrew Lumsdaine
Abstract:
What factors impact the comprehensibility of code? Previous research suggests that expectation-congruent programs should take less time to understand and be less prone to errors. We present an experiment in which participants with programming experience predict the exact output of ten small Python programs. We use subtle differences between program versions to demonstrate that seemingly insignific…
▽ More
What factors impact the comprehensibility of code? Previous research suggests that expectation-congruent programs should take less time to understand and be less prone to errors. We present an experiment in which participants with programming experience predict the exact output of ten small Python programs. We use subtle differences between program versions to demonstrate that seemingly insignificant notational changes can have profound effects on correctness and response times. Our results show that experience increases performance in most cases, but may hurt performance significantly when underlying assumptions about related code statements are violated.
△ Less
Submitted 26 April, 2013; v1 submitted 18 April, 2013;
originally announced April 2013.
-
Extending Task Parallelism for Frequent Pattern Mining
Authors:
Prabhanjan Kambadur,
Amol Ghoting,
Anshul Gupta,
Andrew Lumsdaine
Abstract:
Algorithms for frequent pattern mining, a popular informatics application, have unique requirements that are not met by any of the existing parallel tools. In particular, such applications operate on extremely large data sets and have irregular memory access patterns. For efficient parallelization of such applications, it is necessary to support dynamic load balancing along with scheduling mechani…
▽ More
Algorithms for frequent pattern mining, a popular informatics application, have unique requirements that are not met by any of the existing parallel tools. In particular, such applications operate on extremely large data sets and have irregular memory access patterns. For efficient parallelization of such applications, it is necessary to support dynamic load balancing along with scheduling mechanisms that allow users to exploit data locality. Given these requirements, task parallelism is the most promising of the available parallel programming models. However, existing solutions for task parallelism schedule tasks implicitly and hence, custom scheduling policies that can exploit data locality cannot be easily employed. In this paper we demonstrate and characterize the speedup obtained in a frequent pattern mining application using a custom clustered scheduling policy in place of the popular Cilk-style policy. We present PFunc, a novel task parallel library whose customizable task scheduling and task priorities facilitated the implementation of our clustered scheduling policy.
△ Less
Submitted 7 November, 2012;
originally announced November 2012.
-
Lazy Evaluation and Delimited Control
Authors:
Ronald Garcia,
Andrew Lumsdaine,
Amr Sabry
Abstract:
The call-by-need lambda calculus provides an equational framework for reasoning syntactically about lazy evaluation. This paper examines its operational characteristics. By a series of reasoning steps, we systematically unpack the standard-order reduction relation of the calculus and discover a novel abstract machine definition which, like the calculus, goes "under lambdas." We prove that machine…
▽ More
The call-by-need lambda calculus provides an equational framework for reasoning syntactically about lazy evaluation. This paper examines its operational characteristics. By a series of reasoning steps, we systematically unpack the standard-order reduction relation of the calculus and discover a novel abstract machine definition which, like the calculus, goes "under lambdas." We prove that machine evaluation is equivalent to standard-order evaluation. Unlike traditional abstract machines, delimited control plays a significant role in the machine's behavior. In particular, the machine replaces the manipulation of a heap using store-based effects with disciplined management of the evaluation stack using control-based effects. In short, state is replaced with control. To further articulate this observation, we present a simulation of call-by-need in a call-by-value language using delimited control operations.
△ Less
Submitted 11 July, 2010; v1 submitted 26 March, 2010;
originally announced March 2010.
-
A Language for Generic Programming in the Large
Authors:
Jeremy G. Siek,
Andrew Lumsdaine
Abstract:
Generic programming is an effective methodology for develo** reusable software libraries. Many programming languages provide generics and have features for describing interfaces, but none completely support the idioms used in generic programming. To address this need we developed the language G. The central feature of G is the concept, a mechanism for organizing constraints on generics that is…
▽ More
Generic programming is an effective methodology for develo** reusable software libraries. Many programming languages provide generics and have features for describing interfaces, but none completely support the idioms used in generic programming. To address this need we developed the language G. The central feature of G is the concept, a mechanism for organizing constraints on generics that is inspired by the needs of modern C++ libraries. G provides modular type checking and separate compilation (even of generics). These characteristics support modular software development, especially the smooth integration of independently developed components. In this article we present the rationale for the design of G and demonstrate the expressiveness of G with two case studies: porting the Standard Template Library and the Boost Graph Library from C++ to G. The design of G shares much in common with the concept extension proposed for the next C++ Standard (the authors participated in its design) but there are important differences described in this article.
△ Less
Submitted 16 August, 2007;
originally announced August 2007.