Enabling particle applications for exascale computing platforms
Authors:
Susan M Mniszewski,
James Belak,
Jean-Luc Fattebert,
Christian FA Negre,
Stuart R Slattery,
Adetokunbo A Adedoyin,
Robert F Bird,
Choongseok Chang,
Guangye Chen,
Stephane Ethier,
Shane Fogerty,
Salman Habib,
Christoph Junghans,
Damien Lebrun-Grandie,
Jamaludin Mohd-Yusof,
Stan G Moore,
Daniel Osei-Kuffuor,
Steven J Plimpton,
Adrian Pope,
Samuel Temple Reeve,
Lee Ricketson,
Aaron Scheinberg,
Amil Y Sharma,
Michael E Wall
Abstract:
The Exascale Computing Project (ECP) is invested in co-design to assure that key applications are ready for exascale computing. Within ECP, the Co-design Center for Particle Applications (CoPA) is addressing challenges faced by particle-based applications across four sub-motifs: short-range particle-particle interactions (e.g., those which often dominate molecular dynamics (MD) and smoothed partic…
▽ More
The Exascale Computing Project (ECP) is invested in co-design to assure that key applications are ready for exascale computing. Within ECP, the Co-design Center for Particle Applications (CoPA) is addressing challenges faced by particle-based applications across four sub-motifs: short-range particle-particle interactions (e.g., those which often dominate molecular dynamics (MD) and smoothed particle hydrodynamics (SPH) methods), long-range particle-particle interactions (e.g., electrostatic MD and gravitational N-body), particle-in-cell (PIC) methods, and linear-scaling electronic structure and quantum molecular dynamics (QMD) algorithms. Our crosscutting co-designed technologies fall into two categories: proxy applications (or apps) and libraries. Proxy apps are vehicles used to evaluate the viability of incorporating various types of algorithms, data structures, and architecture-specific optimizations and the associated trade-offs; examples include ExaMiniMD, CabanaMD, CabanaPIC, and ExaSP2. Libraries are modular instantiations that multiple applications can utilize or be built upon; CoPA has developed the Cabana particle library, PROGRESS/BML libraries for QMD, and the SWFFT and fftMPI parallel FFT libraries. Success is measured by identifiable lessons learned that are translated either directly into parent production application codes or into libraries, with demonstrated performance and/or productivity improvement. The libraries and their use in CoPA's ECP application partner codes are also addressed.
△ Less
Submitted 19 September, 2021;
originally announced September 2021.
Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator
Authors:
Matthew J. Marinella,
Sapan Agarwal,
Alexander Hsia,
Isaac Richter,
Robin Jacobs-Gedrim,
John Niroula,
Steven J. Plimpton,
Engin Ipek,
Conrad D. James
Abstract:
Neural networks are an increasingly attractive algorithm for natural language processing and pattern recognition. Deep networks with >50M parameters are made possible by modern GPU clusters operating at <50 pJ per op and more recently, production accelerators capable of <5pJ per operation at the board level. However, with the slowing of CMOS scaling, new paradigms will be required to achieve the n…
▽ More
Neural networks are an increasingly attractive algorithm for natural language processing and pattern recognition. Deep networks with >50M parameters are made possible by modern GPU clusters operating at <50 pJ per op and more recently, production accelerators capable of <5pJ per operation at the board level. However, with the slowing of CMOS scaling, new paradigms will be required to achieve the next several orders of magnitude in performance per watt gains. Using an analog resistive memory (ReRAM) crossbar to perform key matrix operations in an accelerator is an attractive option. This work presents a detailed design using a state of the art 14/16 nm PDK for of an analog crossbar circuit block designed to process three key kernels required in training and inference of neural networks. A detailed circuit and device-level analysis of energy, latency, area, and accuracy are given and compared to relevant designs using standard digital ReRAM and SRAM operations. It is shown that the analog accelerator has a 270x energy and 540x latency advantage over a similar block utilizing only digital ReRAM and takes only 11 fJ per multiply and accumulate (MAC). Compared to an SRAM based accelerator, the energy is 430X better and latency is 34X better. Although training accuracy is degraded in the analog accelerator, several options to improve this are presented. The possible gains over a similar digital-only version of this accelerator block suggest that continued optimization of analog resistive memories is valuable. This detailed circuit and device analysis of a training accelerator may serve as a foundation for further architecture-level studies.
△ Less
Submitted 16 February, 2018; v1 submitted 31 July, 2017;
originally announced July 2017.