Search | arXiv e-print repository

doi 10.1007/978-3-031-50684-0_10

Leveraging HPC Profiling & Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations

Authors: Jeremy J. Williams, David Tskhakaya, Stefan Costea, Ivy B. Peng, Marta Garcia-Gasulla, Stefano Markidis

Abstract: Large-scale plasma simulations are critical for designing and develo** next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work… ▽ More Large-scale plasma simulations are critical for designing and develo** next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work, we characterize single node, multiple nodes, and I/O performances of the BIT1 code in two realistic cases by using several HPC profilers, such as perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting function on-node performance is the main performance bottleneck. Strong scaling tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two test cases. We demonstrate that communication, load imbalance and self-synchronization are important factors impacting the performance of the BIT1 on large-scale runs. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: Accepted by the Euro-Par 2023 workshops (TDLPP 2023), prepared in the standardized Springer LNCS format and consists of 12 pages, which includes the main text, references, and figures

arXiv:2304.03748 [pdf, other]

Perspectives on AI Architectures and Co-design for Earth System Predictability

Authors: Maruti K. Mudunuru, James A. Ang, Mahantesh Halappanavar, Simon D. Hammond, Maya B. Gokhale, James C. Hoe, Tushar Krishna, Sarat S. Sreepathi, Matthew R. Norman, Ivy B. Peng, Philip W. Jones

Abstract: Recently, the U.S. Department of Energy (DOE), Office of Science, Biological and Environmental Research (BER), and Advanced Scientific Computing Research (ASCR) programs organized and held the Artificial Intelligence for Earth System Predictability (AI4ESP) workshop series. From this workshop, a critical conclusion that the DOE BER and ASCR community came to is the requirement to develop a new par… ▽ More Recently, the U.S. Department of Energy (DOE), Office of Science, Biological and Environmental Research (BER), and Advanced Scientific Computing Research (ASCR) programs organized and held the Artificial Intelligence for Earth System Predictability (AI4ESP) workshop series. From this workshop, a critical conclusion that the DOE BER and ASCR community came to is the requirement to develop a new paradigm for Earth system predictability focused on enabling artificial intelligence (AI) across the field, lab, modeling, and analysis activities, called ModEx. The BER's `Model-Experimentation', ModEx, is an iterative approach that enables process models to generate hypotheses. The developed hypotheses inform field and laboratory efforts to collect measurement and observation data, which are subsequently used to parameterize, drive, and test model (e.g., process-based) predictions. A total of 17 technical sessions were held in this AI4ESP workshop series. This paper discusses the topic of the `AI Architectures and Co-design' session and associated outcomes. The AI Architectures and Co-design session included two invited talks, two plenary discussion panels, and three breakout rooms that covered specific topics, including: (1) DOE HPC Systems, (2) Cloud HPC Systems, and (3) Edge computing and Internet of Things (IoT). We also provide forward-looking ideas and perspectives on potential research in this co-design area that can be achieved by synergies with the other 16 session topics. These ideas include topics such as: (1) reimagining co-design, (2) data acquisition to distribution, (3) heterogeneous HPC solutions for integration of AI/ML and other data analytics like uncertainty quantification with earth system modeling and simulation, and (4) AI-enabled sensor integration into earth system measurements and observations. Such perspectives are a distinguishing aspect of this paper. △ Less

Submitted 7 April, 2023; originally announced April 2023.

Comments: 23 pages, 1 figure

arXiv:2211.02682 [pdf, other]

doi 10.1109/MCHPC56545.2022.00007

Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems

Authors: Jacob Wahlgren, Maya Gokhale, Ivy B. Peng

Abstract: Current HPC systems provide memory resources that are statically configured and tightly coupled with compute nodes. However, workloads on HPC systems are evolving. Diverse workloads lead to a need for configurable memory resources to achieve high performance and utilization. In this study, we evaluate a memory subsystem design leveraging CXL-enabled memory pooling. Two promising use cases of compo… ▽ More Current HPC systems provide memory resources that are statically configured and tightly coupled with compute nodes. However, workloads on HPC systems are evolving. Diverse workloads lead to a need for configurable memory resources to achieve high performance and utilization. In this study, we evaluate a memory subsystem design leveraging CXL-enabled memory pooling. Two promising use cases of composable memory subsystems are studied -- fine-grained capacity provisioning and scalable bandwidth provisioning. We developed an emulator to explore the performance impact of various memory compositions. We also provide a profiler to identify the memory usage patterns in applications and their optimization opportunities. Seven scientific and six graph applications are evaluated on various emulated memory configurations. Three out of seven scientific applications had less than 10% performance impact when the pooled memory backed 75% of their memory footprint. The results also show that a dynamically configured high-bandwidth system can effectively support bandwidth-intensive unstructured mesh-based applications like OpenFOAM. Finally, we identify interference through shared memory pools as a practical challenge for adoption on HPC systems. △ Less

Submitted 4 November, 2022; originally announced November 2022.

Comments: 10 pages, 13 figures. Accepted for publication in Workshop on Memory Centric High Performance Computing (MCHPC'22) at SC22

arXiv:2106.05373 [pdf, other]

doi 10.1145/3468044.3468052

StreamBrain: An HPC Framework for Brain-like Neural Networks on CPUs, GPUs and FPGAs

Authors: Artur Podobas, Martin Svedin, Steven W. D. Chien, Ivy B. Peng, Naresh Balaji Ravichandran, Pawel Herman, Anders Lansner, Stefano Markidis

Abstract: The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other -- less-known -- machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation Neural Network (BCPNN). In… ▽ More The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other -- less-known -- machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation Neural Network (BCPNN). In this paper, we introduce StreamBrain -- a framework that allows neural networks based on BCPNN to be practically deployed in High-Performance Computing systems. StreamBrain is a domain-specific language (DSL), similar in concept to existing machine learning (ML) frameworks, and supports backends for CPUs, GPUs, and even FPGAs. We empirically demonstrate that StreamBrain can train the well-known ML benchmark dataset MNIST within seconds, and we are the first to demonstrate BCPNN on STL-10 size networks. We also show how StreamBrain can be used to train with custom floating-point formats and illustrate the impact of using different bfloat variations on BCPNN using FPGAs. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: Accepted for publication at the International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2021)

arXiv:2008.04397 [pdf, other]

doi 10.1109/SBAC-PAD49847.2020.00030

sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems

Authors: Steven W. D. Chien, Jonas Nylund, Gabriel Bengtsson, Ivy B. Peng, Artur Podobas, Stefano Markidis

Abstract: Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes re… ▽ More Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlap** communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters. △ Less

Submitted 10 August, 2020; originally announced August 2020.

Comments: Accepted for publication at the 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2020)

arXiv:2008.04395 [pdf, other]

doi 10.1109/CLUSTER49012.2020.00046

tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads

Authors: Steven W. D. Chien, Artur Podobas, Ivy B. Peng, Stefano Markidis

Abstract: Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and al… ▽ More Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures during TensorFlow execution to enable analysis through the TensorFlow profiler. We visualize the performance results through TensorBoard, the web-based TensorFlow visualization tool. At the same time, we do not alter Darshan's existing implementation. We illustrate tf-Darshan by performing two case studies on ImageNet image and Malware classification. We show that by guiding optimization using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by selecting data for staging on fast tier storage. We also show that Darshan has the potential of being used as a runtime library for profiling and providing information for future optimization. △ Less

Submitted 11 August, 2020; v1 submitted 10 August, 2020; originally announced August 2020.

Comments: Accepted for publication at the 2020 International Conference on Cluster Computing (CLUSTER 2020)

arXiv:1910.09598 [pdf, other]

doi 10.1109/MCHPC49590.2019.00014

Performance Evaluation of Advanced Features in CUDA Unified Memory

Authors: Steven W. D. Chien, Ivy B. Peng, Stefano Markidis

Abstract: CUDA Unified Memory improves the GPU programmability and also enables GPU memory oversubscription. Recently, two advanced memory features, memory advises and asynchronous prefetch, have been introduced. In this work, we evaluate the new features on two platforms that feature different CPUs, GPUs, and interconnects. We derive a benchmark suite for the experiments and stress the memory system to eva… ▽ More CUDA Unified Memory improves the GPU programmability and also enables GPU memory oversubscription. Recently, two advanced memory features, memory advises and asynchronous prefetch, have been introduced. In this work, we evaluate the new features on two platforms that feature different CPUs, GPUs, and interconnects. We derive a benchmark suite for the experiments and stress the memory system to evaluate both in-memory and oversubscription performance. The results show that memory advises on the Intel-Volta/Pascal-PCIe platform bring negligible improvement for in-memory executions. However, when GPU memory is oversubscribed by about 50%, using memory advises results in up to 25% performance improvement compared to the basic CUDA Unified Memory. In contrast, the Power9-Volta-NVLink platform can substantially benefit from memory advises, achieving up to 34% performance gain for in-memory executions. However, when GPU memory is oversubscribed on this platform, using memory advises increases GPU page faults and results in considerable performance loss. The CUDA prefetch also shows different performance impact on the two platforms. It improves performance by up to 50% on the Intel-Volta/Pascal-PCI-E platform but brings little benefit to the Power9-Volta-NVLink platform. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: Accepted for publication at Workshop on Memory Centric High Performance Computing (MCHPC'19) in SC19

arXiv:1910.07566 [pdf]

UMap: Enabling Application-driven Optimizations for Page Management

Authors: Ivy B. Peng, Marty McFadden, Eric Green, Keita Iwabuchi, Kai Wu, Dong Li, Roger Pearce, Maya Gokhale

Abstract: Leadership supercomputers feature a diversity of storage, from node-local persistent memory and NVMe SSDs to network-interconnected flash memory and HDD. Memory map** files on different tiers of storage provides a uniform interface in applications. However, system-wide services like mmap are optimized for generality and lack flexibility for enabling application-specific optimizations. In this wo… ▽ More Leadership supercomputers feature a diversity of storage, from node-local persistent memory and NVMe SSDs to network-interconnected flash memory and HDD. Memory map** files on different tiers of storage provides a uniform interface in applications. However, system-wide services like mmap are optimized for generality and lack flexibility for enabling application-specific optimizations. In this work, we present Umap to enable user-space page management that can be easily adapted to access patterns in applications and storage characteristics. Umap uses the userfaultfd mechanism to handle page faults in multi-threaded applications efficiently. By providing a data object abstraction layer, Umap is extensible to support various backing stores. The design of Umap supports dynamic load balancing and I/O decoupling for scalable performance. Umap also uses application hints to improve the selection of caching, prefetching, and eviction policies. We evaluate Umap in five benchmarks and real applications on two systems. Our results show that leveraging application knowledge for page management could substantially improve performance. On average, Umap achieved 1.25 to 2.5 times improvement using the adapted configurations compared to the system service. △ Less

Submitted 16 October, 2019; originally announced October 2019.

arXiv:1908.06503 [pdf]

doi 10.1145/3357526.3357568

System Evaluation of the Intel Optane Byte-addressable NVM

Authors: Ivy B. Peng, Maya B. Gokhale, Eric W. Green

Abstract: Byte-addressable non-volatile memory (NVM) features high density, DRAM comparable performance, and persistence. These characteristics position NVM as a promising new tier in the memory hierarchy. Nevertheless, NVM has asymmetric read and write performance, and considerably higher write energy than DRAM. Our work provides an in-depth evaluation of the first commercially available byte-addressable N… ▽ More Byte-addressable non-volatile memory (NVM) features high density, DRAM comparable performance, and persistence. These characteristics position NVM as a promising new tier in the memory hierarchy. Nevertheless, NVM has asymmetric read and write performance, and considerably higher write energy than DRAM. Our work provides an in-depth evaluation of the first commercially available byte-addressable NVM -- the Intel Optane DC persistent memory. The first part of our study quantifies the latency, bandwidth, power efficiency, and energy consumption under eight memory configurations. We also evaluate the real impact on in-memory graph processing workloads. Our results show that augmenting NVM with DRAM is essential, and the combination can effectively bridge the performance gap and provide reasonable performance with higher capacity. We also identify NUMA-related performance characteristics for accesses to memory on a remote socket. In the second part, we employ two fine-grained allocation policies to control traffic distribution between DRAM and NVM. Our results show that bandwidth spilling between DRAM and NVM could provide 2.0x bandwidth and enable $20\%$ larger problems than using DRAM as a cache. Also, write isolation between DRAM and NVM could save up to 3.9x energy and improves bandwidth by 3.1x compared to DRAM-cached NVM. We establish a roofline model to explore power and energy efficiency at various distributions of read-only traffic. Our results show that NVM requires 1.8x lower power than DRAM for data-intensive workloads. Overall, applications can significantly optimize performance and power efficiency by adapting traffic distribution to NVM and DRAM through memory configurations and fine-grained policies to fully exploit the new memory device. △ Less

Submitted 18 August, 2019; originally announced August 2019.

Journal ref: In Proceedings of the International Symposium on Memory Systems, 2019

arXiv:1907.05917 [pdf, other]

doi 10.1007/978-3-030-43229-4_26

Posit NPB: Assessing the Precision Improvement in HPC Scientific Applications

Authors: Steven W. D. Chien, Ivy B. Peng, Stefano Markidis

Abstract: Floating-point operations can significantly impact the accuracy and performance of scientific applications on large-scale parallel systems. Recently, an emerging floating-point format called Posit has attracted attention as an alternative to the standard IEEE floating-point formats because it could enable higher precision than IEEE formats using the same number of bits. In this work, we first expl… ▽ More Floating-point operations can significantly impact the accuracy and performance of scientific applications on large-scale parallel systems. Recently, an emerging floating-point format called Posit has attracted attention as an alternative to the standard IEEE floating-point formats because it could enable higher precision than IEEE formats using the same number of bits. In this work, we first explored the feasibility of Posit encoding in representative HPC applications by providing a 32-bit Posit NAS Parallel Benchmark (NPB) suite. Then, we evaluate the accuracy improvement in different HPC kernels compared to the IEEE 754 format. Our results indicate that using Posit encoding achieves optimized precision, ranging from 0.6 to 1.4 decimal digit, for all tested kernels and proxy-applications. Also, we quantified the overhead of the current software implementation of Posit encoding as 4x-19x that of IEEE 754 hardware implementation. Our study highlights the potential of hardware implementations of Posit to benefit a broad range of HPC applications. △ Less

Submitted 12 July, 2019; originally announced July 2019.

Comments: Accepted for publication in PPAM 2019 conference

arXiv:1810.04110 [pdf, other]

MPI Windows on Storage for HPC Applications

Authors: Sergio Rivas-Gomez, Roberto Gioiosa, Ivy Bo Peng, Gokcen Kestor, Sai Narasimhamurthy, Erwin Laure, Stefano Markidis

Abstract: Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI storage windows, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. In addition, we e… ▽ More Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI storage windows, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. In addition, we explore the integration of heterogeneous window allocations, where memory and storage share a unified virtual address space. When performing large, irregular memory operations, we verify that MPI windows on local storage incurs a 55% performance penalty on average. When using a Lustre parallel file system, asymmetric performance is observed with over 90% degradation in writing operations. Nonetheless, experimental results of a Distributed Hash Table, the HACC I/O kernel mini-application, and a novel MapReduce implementation based on the use of MPI one-sided communication, indicate that the overall penalty of MPI windows on storage can be negligible in most cases in real-world applications. △ Less

Submitted 9 October, 2018; originally announced October 2018.

arXiv:1807.03632 [pdf, other]

doi 10.1145/3203217.3205341

The SAGE Project: a Storage Centric Approach for Exascale Computing

Authors: Sai Narasimhamurthy, Nikita Danilov, Sining Wu, Ganesan Umanesan, Steven Wei-der Chien, Sergio Rivas-Gomez, Ivy Bo Peng, Erwin Laure, Shaun de Witt, Dirk Pleiter, Stefano Markidis

Abstract: SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a "storage centric" approach as it is capable of storing and processing large data volumes at the Exascale r… ▽ More SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a "storage centric" approach as it is capable of storing and processing large data volumes at the Exascale regime. SAGE addresses the convergence of Big Data Analysis and HPC in an era of next-generation data centric computing. This convergence is driven by the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors where data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. A first prototype of the SAGE system has been been implemented and installed at the Julich Supercomputing Center. The SAGE storage system consists of multiple types of storage device technologies in a multi-tier I/O hierarchy, including flash, disk, and non-volatile memory technologies. The main SAGE software component is the Seagate Mero Object Storage that is accessible via the Clovis API and higher level interfaces. The SAGE project also includes scientific applications for the validation of the SAGE concepts. The objective of this paper is to present the SAGE project concepts, the prototype of the SAGE platform and discuss the software architecture of the SAGE system. △ Less

Submitted 6 July, 2018; originally announced July 2018.

Comments: Submitted to Computing Frontiers 2018. arXiv admin note: substantial text overlap with arXiv:1805.00556

arXiv:1805.00556 [pdf, other]

doi 10.1016/j.parco.2018.03.002

SAGE: Percipient Storage for Exascale Data Centric Computing

Authors: Sai Narasimhamurthy, Nikita Danilov, Sining Wu, Ganesan Umanesan, Stefano Markidis, Sergio Rivas-Gomez, Ivy Bo Peng, Erwin Laure, Dirk Pleiter, Shaun de Witt

Abstract: We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infras… ▽ More We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infrastructure. SAGE addresses the increasing overlaps between Big Data Analysis and HPC in an era of next-generation data centric computing that has developed due to the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors, whose data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. Indeed, Exascale I/O, as a problem that has not been sufficiently dealt with for simulation codes, is appropriately addressed by the SAGE platform. The objective of this paper is to discuss the software architecture of the SAGE system and look at early results we have obtained employing some of its key methodologies, as the system continues to evolve. △ Less

Submitted 1 May, 2018; originally announced May 2018.

Journal ref: Parallel Computing, 23 March 2018

arXiv:1803.04014 [pdf, other]

doi 10.1109/IPDPSW.2018.00091

NVIDIA Tensor Core Programmability, Performance & Precision

Authors: Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, Jeffrey S. Vetter

Abstract: The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to pro… ▽ More The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores. △ Less

Submitted 11 March, 2018; originally announced March 2018.

Comments: This paper has been accepted by the Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2018

arXiv:1708.01306 [pdf, other]

MPI Streams for HPC Applications

Authors: Ivy Bo Peng, Stefano Markidis, Roberto Gioiosa, Gokcen Kestor, Erwin Laure

Abstract: Data streams are a sequence of data flowing between source and destination processes. Streaming is widely used for signal, image and video processing for its efficiency in pipelining and effectiveness in reducing demand for memory. The goal of this work is to extend the use of data streams to support both conventional scientific applications and emerging data analytic applications running on HPC p… ▽ More Data streams are a sequence of data flowing between source and destination processes. Streaming is widely used for signal, image and video processing for its efficiency in pipelining and effectiveness in reducing demand for memory. The goal of this work is to extend the use of data streams to support both conventional scientific applications and emerging data analytic applications running on HPC platforms. We introduce an extension called MPIStream to the de-facto programming standard on HPC, MPI. MPIStream supports data streams either within a single application or among multiple applications. We present three use cases using MPI streams in HPC applications together with their parallel performance. We show the convenience of using MPI streams to support the needs from both traditional HPC and emerging data analytics applications running on supercomputers. △ Less

Submitted 3 August, 2017; originally announced August 2017.

Comments: Advances in Parallel Computing

arXiv:1708.01304 [pdf, other]

Preparing HPC Applications for the Exascale Era: A Decoupling Strategy

Authors: Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Erwin Laure, Stefano Markidis

Abstract: Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propo… ▽ More Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems. Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and data-analytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4x performance improvement. △ Less

Submitted 3 August, 2017; originally announced August 2017.

Comments: The 46th International Conference on Parallel Processing (ICPP-2017)

arXiv:1704.08492 [pdf]

Extending Message Passing Interface Windows to Storage

Authors: Sergio Rivas-Gomez, Stefano Markidis, Ivy Bo Peng, Erwin Laure, Gokcen Kestor, Roberto Gioiosa

Abstract: This work presents an extension to MPI supporting the one-sided communication model and window allocations in storage. Our design transparently integrates with the current MPI implementations, enabling applications to target MPI windows in storage, memory or both simultaneously, without major modifications. Initial performance results demonstrate that the presented MPI window extension could poten… ▽ More This work presents an extension to MPI supporting the one-sided communication model and window allocations in storage. Our design transparently integrates with the current MPI implementations, enabling applications to target MPI windows in storage, memory or both simultaneously, without major modifications. Initial performance results demonstrate that the presented MPI window extension could potentially be helpful for a wide-range of use-cases and with low-overhead. △ Less

Submitted 27 April, 2017; originally announced April 2017.

arXiv:1704.08273 [pdf, other]

doi 10.1109/IPDPSW.2017.115

Exploring the Performance Benefit of Hybrid Memory System on HPC Environments

Authors: Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Erwin Laure, Stefano Markidis

Abstract: Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the computing cores. For example, the Intel Knights Landing (KNL) processor is equipped with 16 GB of high-bandwidth memory (HBM) that works together with conventi… ▽ More Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the computing cores. For example, the Intel Knights Landing (KNL) processor is equipped with 16 GB of high-bandwidth memory (HBM) that works together with conventional DRAM memory. Theoretically, HBM can provide 5x higher bandwidth than conventional DRAM. However, many factors impact the effective performance achieved by applications, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. In this paper, we analyze the Intel KNL system and quantify the impact of the most important factors on the application performance by using a set of applications that are representative of scientific and data-analytics workloads. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to 3x performance when compared to the performance obtained using only DRAM. On the contrary, applications with random memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. For those applications, the use of additional hardware threads may help hide latency and achieve higher aggregated bandwidth when using HBM. △ Less

Submitted 26 April, 2017; originally announced April 2017.

arXiv:1704.08244 [pdf, other]

doi 10.1109/HPCC-SmartCity-DSS.2016.0134

Idle Period Propagation in Message-Passing Applications

Authors: Ivy Bo Peng, Stefano Markidis, Erwin Laure, Gokcen Kestor, Roberto Gioiosa

Abstract: Idle periods on different processes of Message Passing applications are unavoidable. While the origin of idle periods on a single process is well understood as the effect of system and architectural random delays, yet it is unclear how these idle periods propagate from one process to another. It is important to understand idle period propagation in Message Passing applications as it allows applica… ▽ More Idle periods on different processes of Message Passing applications are unavoidable. While the origin of idle periods on a single process is well understood as the effect of system and architectural random delays, yet it is unclear how these idle periods propagate from one process to another. It is important to understand idle period propagation in Message Passing applications as it allows application developers to design communication patterns avoiding idle period propagation and the consequent performance degradation in their applications. To understand idle period propagation, we introduce a methodology to trace idle periods when a process is waiting for data from a remote delayed process in MPI applications. We apply this technique in an MPI application that solves the heat equation to study idle period propagation on three different systems. We confirm that idle periods move between processes in the form of waves and that there are different stages in idle period propagation. Our methodology enables us to identify a self-synchronization phenomenon that occurs on two systems where some processes run slower than the other processes. △ Less

Submitted 26 April, 2017; originally announced April 2017.

Comments: 18th International Conference on High Performance Computing and Communications, IEEE, 2016

arXiv:1704.08239 [pdf, other]

doi 10.1109/HPCC-SmartCity-DSS.2016.0074

Exploring Application Performance on Emerging Hybrid-Memory Supercomputers

Authors: Ivy Bo Peng, Stefano Markidis, Erwin Laure, Gokcen Kestor, Roberto Gioiosa

Abstract: Next-generation supercomputers will feature more hierarchical and heterogeneous memory systems with different memory technologies working side-by-side. A critical question is whether at large scale existing HPC applications and emerging data-analytics workloads will have performance improvement or degradation on these systems. We propose a systematic and fair methodology to identify the trend of a… ▽ More Next-generation supercomputers will feature more hierarchical and heterogeneous memory systems with different memory technologies working side-by-side. A critical question is whether at large scale existing HPC applications and emerging data-analytics workloads will have performance improvement or degradation on these systems. We propose a systematic and fair methodology to identify the trend of application performance on emerging hybrid-memory systems. We model the memory system of next-generation supercomputers as a combination of "fast" and "slow" memories. We then analyze performance and dynamic execution characteristics of a variety of workloads, from traditional scientific applications to emerging data analytics to compare traditional and hybrid-memory systems. Our results show that data analytics applications can clearly benefit from the new system design, especially at large scale. Moreover, hybrid-memory systems do not penalize traditional scientific applications, which may also show performance improvement. △ Less

Submitted 26 April, 2017; originally announced April 2017.

Comments: 18th International Conference on High Performance Computing and Communications, IEEE, 2016

arXiv:1704.03803 [pdf, other]

doi 10.1002/2017JA024186

Global three-dimensional simulation of Earth's dayside reconnection using a two-way coupled magnetohydrodynamics with embedded particle-in-cell model: initial results

Authors: Yuxi Chen, Gabor Toth, Paul Cassak, Xianzhe Jia, Tamas I. Gombosi, James A. Slavin, Stefano Markidis, Ivy Bo Peng, Vania K. Jordanova

Abstract: We perform a three-dimensional (3D) global simulation of Earth's magnetosphere with kinetic reconnection physics to study the flux transfer events (FTEs) and dayside magnetic reconnection with the recently developed magnetohydrodynamics with embedded particle-in-cell model (MHD-EPIC). During the one-hour long simulation, the FTEs are generated quasi-periodically near the subsolar point and move to… ▽ More We perform a three-dimensional (3D) global simulation of Earth's magnetosphere with kinetic reconnection physics to study the flux transfer events (FTEs) and dayside magnetic reconnection with the recently developed magnetohydrodynamics with embedded particle-in-cell model (MHD-EPIC). During the one-hour long simulation, the FTEs are generated quasi-periodically near the subsolar point and move toward the poles. We find the magnetic field signature of FTEs at their early formation stage is similar to a `crater FTE', which is characterized by a magnetic field strength dip at the FTE center. After the FTE core field grows to a significant value, it becomes an FTE with typical flux rope structure. When an FTE moves across the cusp, reconnection between the FTE field lines and the cusp field lines can dissipate the FTE. The kinetic features are also captured by our model. A crescent electron phase space distribution is found near the reconnection site. A similar distribution is found for ions at the location where the Larmor electric field appears. The lower hybrid drift instability (LHDI) along the current sheet direction also arises at the interface of magnetosheath and magnetosphere plasma. The LHDI electric field is about 8 mV/m and its dominant wavelength relative to the electron gyroradius agrees reasonably with MMS observations. △ Less

Submitted 12 April, 2017; originally announced April 2017.

arXiv:1512.02018 [pdf, other]

doi 10.3847/0004-637X/819/1/52

Magnetic null points in kinetic simulations of space plasmas

Authors: Vyacheslav Olshevsky, Jan Deca, Andrey Divin, Ivy Bo Peng, Stefano Markidis, Maria Elena Innocenti, Emanuele Cazzola, Giovanni Lapenta

Abstract: We present a systematic attempt to study magnetic null points and the associated magnetic energy conversion in kinetic Particle-in-Cell simulations of various plasma configurations. We address three-dimensional simulations performed with the semi-implicit kinetic electromagnetic code iPic3D in different setups: variations of a Harris current sheet, dipolar and quadrupolar magnetospheres interactin… ▽ More We present a systematic attempt to study magnetic null points and the associated magnetic energy conversion in kinetic Particle-in-Cell simulations of various plasma configurations. We address three-dimensional simulations performed with the semi-implicit kinetic electromagnetic code iPic3D in different setups: variations of a Harris current sheet, dipolar and quadrupolar magnetospheres interacting with the solar wind; and a relaxing turbulent configuration with multiple null points. Spiral nulls are more likely created in space plasmas: in all our simulations except lunar magnetic anomaly and quadrupolar mini-magnetosphere the number of spiral nulls prevails over the number of radial nulls by a factor of 3-9. We show that often magnetic nulls do not indicate the regions of intensive energy dissipation. Energy dissipation events caused by topological bifurcations at radial nulls are rather rare and short-lived. The so-called X-lines formed by the radial nulls in the Harris current sheet and lunar magnetic anomaly simulations are rather stable and do not exhibit any energy dissipation. Energy dissipation is more powerful in the vicinity of spiral nulls enclosed by magnetic flux ropes with strong currents at their axes (their cross-sections resemble 2D magnetic islands). These null lines reminiscent of Z-pinches efficiently dissipate magnetic energy due to secondary instabilities such as the two-stream or kinking instability, accompanied by changes in magnetic topology. Current enhancements accompanied by spiral nulls may signal magnetic energy conversion sites in the observational data. △ Less

Submitted 7 December, 2015; originally announced December 2015.

Comments: Nordita program on Magnetic Reconnection in Plasmas 2015

Report number: NORDITA-2015-127

Journal ref: The Astrophysical Journal 2016, Volume 819, Number 1

Showing 1–22 of 22 results for author: Peng, I B