-
Understanding Large-Scale Plasma Simulation Challenges for Fusion Energy on Supercomputers
Authors:
Jeremy J. Williams,
Ashish Bhole,
Dylan Kierans,
Matthias Hoelzl,
Ihor Holod,
Weikang Tang,
David Tskhakaya,
Stefan Costea,
Leon Kos,
Ales Podolnik,
Jakub Hromadka,
JOREK Team,
Erwin Laure,
Stefano Markidis
Abstract:
Understanding plasma instabilities is essential for achieving sustainable fusion energy, with large-scale plasma simulations playing a crucial role in both the design and development of next-generation fusion energy devices and the modelling of industrial plasmas. To achieve sustainable fusion energy, it is essential to accurately model and predict plasma behavior under extreme conditions, requiri…
▽ More
Understanding plasma instabilities is essential for achieving sustainable fusion energy, with large-scale plasma simulations playing a crucial role in both the design and development of next-generation fusion energy devices and the modelling of industrial plasmas. To achieve sustainable fusion energy, it is essential to accurately model and predict plasma behavior under extreme conditions, requiring sophisticated simulation codes capable of capturing the complex interaction between plasma dynamics, magnetic fields, and material surfaces. In this work, we conduct a comprehensive HPC analysis of two prominent plasma simulation codes, BIT1 and JOREK, to advance understanding of plasma behavior in fusion energy applications. Our focus is on evaluating JOREK's computational efficiency and scalability for simulating non-linear MHD phenomena in tokamak fusion devices. The motivation behind this work stems from the urgent need to advance our understanding of plasma instabilities in magnetically confined fusion devices. Enhancing JOREK's performance on supercomputers improves fusion plasma code predictability, enabling more accurate modelling and faster optimization of fusion designs, thereby contributing to sustainable fusion energy. In prior studies, we analysed BIT1, a massively parallel Particle-in-Cell (PIC) code for studying plasma-material interactions in fusion devices. Our investigations into BIT1's computational requirements and scalability on advanced supercomputing architectures yielded valuable insights. Through detailed profiling and performance analysis, we have identified the primary bottlenecks and implemented optimization strategies, significantly enhancing parallel performance. This previous work serves as a foundation for our present endeavours.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Understanding the Impact of openPMD on BIT1, a Particle-in-Cell Monte Carlo Code, through Instrumentation, Monitoring, and In-Situ Analysis
Authors:
Jeremy J. Williams,
Stefan Costea,
Allen D. Malony,
David Tskhakaya,
Leon Kos,
Ales Podolnik,
Jakub Hromadka,
Kevin Huck,
Erwin Laure,
Stefano Markidis
Abstract:
Particle-in-Cell Monte Carlo simulations on large-scale systems play a fundamental role in understanding the complexities of plasma dynamics in fusion devices. Efficient handling and analysis of vast datasets are essential for advancing these simulations. Previously, we addressed this challenge by integrating openPMD with BIT1, a Particle-in-Cell Monte Carlo code, streamlining data streaming and s…
▽ More
Particle-in-Cell Monte Carlo simulations on large-scale systems play a fundamental role in understanding the complexities of plasma dynamics in fusion devices. Efficient handling and analysis of vast datasets are essential for advancing these simulations. Previously, we addressed this challenge by integrating openPMD with BIT1, a Particle-in-Cell Monte Carlo code, streamlining data streaming and storage. This integration not only enhanced data management but also improved write throughput and storage efficiency. In this work, we delve deeper into the impact of BIT1 openPMD BP4 instrumentation, monitoring, and in-situ analysis. Utilizing cutting-edge profiling and monitoring tools such as gprof, CrayPat, Cray Apprentice2, IPM, and Darshan, we dissect BIT1's performance post-integration, shedding light on computation, communication, and I/O operations. Fine-grained instrumentation offers insights into BIT1's runtime behavior, while immediate monitoring aids in understanding system dynamics and resource utilization patterns, facilitating proactive performance optimization. Advanced visualization techniques further enrich our understanding, enabling the optimization of BIT1 simulation workflows aimed at controlling plasma-material interfaces with improved data analysis and visualization at every checkpoint without causing any interruption to the simulation.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Strong Scaling of OpenACC enabled Nek5000 on several GPU based HPC systems
Authors:
Jonathan Vincent,
**g Gong,
Martin Karp,
Adam Peplinski,
Niclas Jansson,
Artur Podobas,
Andreas Jocksch,
Jie Yao,
Fazle Hussain,
Stefano Markidis,
Matts Karlsson,
Dirk Pleiter,
Erwin Laure,
Philipp Schlatter
Abstract:
We present new results on the strong parallel scaling for the OpenACC-accelerated implementation of the high-order spectral element fluid dynamics solver Nek5000. The test case considered consists of a direct numerical simulation of fully-developed turbulent flow in a straight pipe, at two different Reynolds numbers $Re_τ=360$ and $Re_τ=550$, based on friction velocity and pipe radius. The strong…
▽ More
We present new results on the strong parallel scaling for the OpenACC-accelerated implementation of the high-order spectral element fluid dynamics solver Nek5000. The test case considered consists of a direct numerical simulation of fully-developed turbulent flow in a straight pipe, at two different Reynolds numbers $Re_τ=360$ and $Re_τ=550$, based on friction velocity and pipe radius. The strong scaling is tested on several GPU-enabled HPC systems, including the Swiss Piz Daint system, TACC's Longhorn, Jülich's JUWELS Booster, and Berzelius in Sweden. The performance results show that speed-up between 3-5 can be achieved using the GPU accelerated version compared with the CPU version on these different systems. The run-time for 20 timesteps reduces from 43.5 to 13.2 seconds with increasing the number of GPUs from 64 to 512 for $Re_τ=550$ case on JUWELS Booster system. This illustrates the GPU accelerated version the potential for high throughput. At the same time, the strong scaling limit is significantly larger for GPUs, at about $2000-5000$ elements per rank; compared to about $50-100$ for a CPU-rank.
△ Less
Submitted 4 November, 2021; v1 submitted 8 September, 2021;
originally announced September 2021.
-
Multi-GPU Acceleration of the iPIC3D Implicit Particle-in-Cell Code
Authors:
Chaitanya Prasad Sishtla,
Steven W. D. Chien,
Vyacheslav Olshevsky,
Erwin Laure,
Stefano Markidis
Abstract:
iPIC3D is a widely used massively parallel Particle-in-Cell code for the simulation of space plasmas. However, its current implementation does not support execution on multiple GPUs. In this paper, we describe the porting of iPIC3D particle mover to GPUs and the optimization steps to increase the performance and parallel scaling on multiple GPUs. We analyze the strong scaling of the mover on two G…
▽ More
iPIC3D is a widely used massively parallel Particle-in-Cell code for the simulation of space plasmas. However, its current implementation does not support execution on multiple GPUs. In this paper, we describe the porting of iPIC3D particle mover to GPUs and the optimization steps to increase the performance and parallel scaling on multiple GPUs. We analyze the strong scaling of the mover on two GPU clusters and evaluate its performance and acceleration. The optimized GPU version which uses pinned memory and asynchronous data prefetching outperform their corresponding CPU versions by 5-10x on two different systems equipped with NVIDIA K80 and V100 GPUs.
△ Less
Submitted 7 April, 2019;
originally announced April 2019.
-
TensorFlow Doing HPC
Authors:
Steven W. D. Chien,
Stefano Markidis,
Vyacheslav Olshevsky,
Yaroslav Bulatov,
Erwin Laure,
Jeffrey S. Vetter
Abstract:
TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for develo** Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include HP…
▽ More
TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for develo** Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include HPC applications. However, very few experiments have been conducted to evaluate TensorFlow performance when running HPC workloads on supercomputers. This work addresses this lack by designing four traditional HPC benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG) solver and Fast Fourier Transform (FFT). We analyze their performance on two supercomputers with accelerators and evaluate the potential of TensorFlow for develo** HPC applications. Our tests show that TensorFlow can fully take advantage of high performance networks and accelerators on supercomputers. Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical communication bandwidth on our testing platform. We find an approximately 2x, 1.7x and 1.8x performance improvement when increasing the number of GPUs from two to four in the matrix-matrix multiply, CG and FFT applications respectively. All our performance results demonstrate that TensorFlow has high potential of emerging also as HPC programming framework for heterogeneous supercomputers.
△ Less
Submitted 11 March, 2019;
originally announced March 2019.
-
Particle-in-Cell Simulations of Plasma Dynamics in Cometary Environment
Authors:
Chaitanya Prasad Sishtla,
Vyacheslav Olshevsky,
Steven W. D. Chien,
Stefano Markidis,
Erwin Laure
Abstract:
We perform and analyze global Particle-in-Cell (PIC) simulations of the interaction between solar wind and an outgassing comet with the goal of studying the plasma kinetic dynamics of a cometary environment. To achieve this, we design and implement a new numerical method in the iPIC3D code to model outgassing from the comet: new plasma particles are ejected from the comet "surface" at each computa…
▽ More
We perform and analyze global Particle-in-Cell (PIC) simulations of the interaction between solar wind and an outgassing comet with the goal of studying the plasma kinetic dynamics of a cometary environment. To achieve this, we design and implement a new numerical method in the iPIC3D code to model outgassing from the comet: new plasma particles are ejected from the comet "surface" at each computational cycle. Our simulations show that a bow shock is formed as a result of the interaction between solar wind and outgassed particles. The analysis of distribution functions for the PIC simulations shows that at the bow shock part of the incoming solar wind, ions are reflected while electrons are heated. This work attempts to reveal kinetic effects in the atmosphere of an outgassing comet using a fully kinetic Particle-in-Cell model.
△ Less
Submitted 28 January, 2019;
originally announced January 2019.
-
Exploring the Vision Processing Unit as Co-processor for Inference
Authors:
Sergio Rivas-Gomez,
Antonio J. Peña,
David Moloney,
Erwin Laure,
Stefano Markidis
Abstract:
The success of the exascale supercomputer is largely debated to remain dependent on novel breakthroughs in technology that effectively reduce the power consumption and thermal dissipation requirements. In this work, we consider the integration of co-processors in high-performance computing (HPC) to enable low-power, seamless computation offloading of certain operations. In particular, we explore t…
▽ More
The success of the exascale supercomputer is largely debated to remain dependent on novel breakthroughs in technology that effectively reduce the power consumption and thermal dissipation requirements. In this work, we consider the integration of co-processors in high-performance computing (HPC) to enable low-power, seamless computation offloading of certain operations. In particular, we explore the so-called Vision Processing Unit (VPU), a highly-parallel vector processor with a power envelope of less than 1W. We evaluate this chip during inference using a pre-trained GoogLeNet convolutional network model and a large image dataset from the ImageNet ILSVRC challenge. Preliminary results indicate that a multi-VPU configuration provides similar performance compared to reference CPU and GPU implementations, while reducing the thermal-design power (TDP) up to 8x in comparison.
△ Less
Submitted 9 October, 2018;
originally announced October 2018.
-
Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks
Authors:
Sergio Rivas-Gomez,
Sai Narasimhamurthy,
Keeran Brabazon,
Oliver Perks,
Erwin Laure,
Stefano Markidis
Abstract:
In this work, we consider the integration of MPI one-sided communication and non-blocking I/O in HPC-centric MapReduce frameworks. Using a decoupled strategy, we aim to overlap the Map and Reduce phases of the algorithm by allowing processes to communicate and synchronize using solely one-sided operations. Hence, we effectively increase the performance in situations where the workload per process…
▽ More
In this work, we consider the integration of MPI one-sided communication and non-blocking I/O in HPC-centric MapReduce frameworks. Using a decoupled strategy, we aim to overlap the Map and Reduce phases of the algorithm by allowing processes to communicate and synchronize using solely one-sided operations. Hence, we effectively increase the performance in situations where the workload per process is unexpectedly unbalanced. Using a Word-Count implementation and a large dataset from the Purdue MapReduce Benchmarks Suite (PUMA), we demonstrate that our approach can provide up to 23% performance improvement on average compared to a reference MapReduce implementation that uses state-of-the-art MPI collective communication and I/O.
△ Less
Submitted 9 October, 2018;
originally announced October 2018.
-
MPI Windows on Storage for HPC Applications
Authors:
Sergio Rivas-Gomez,
Roberto Gioiosa,
Ivy Bo Peng,
Gokcen Kestor,
Sai Narasimhamurthy,
Erwin Laure,
Stefano Markidis
Abstract:
Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI storage windows, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. In addition, we e…
▽ More
Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI storage windows, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. In addition, we explore the integration of heterogeneous window allocations, where memory and storage share a unified virtual address space. When performing large, irregular memory operations, we verify that MPI windows on local storage incurs a 55% performance penalty on average. When using a Lustre parallel file system, asymmetric performance is observed with over 90% degradation in writing operations. Nonetheless, experimental results of a Distributed Hash Table, the HACC I/O kernel mini-application, and a novel MapReduce implementation based on the use of MPI one-sided communication, indicate that the overall penalty of MPI windows on storage can be negligible in most cases in real-world applications.
△ Less
Submitted 9 October, 2018;
originally announced October 2018.
-
Characterizing Deep-Learning I/O Workloads in TensorFlow
Authors:
Steven W. D. Chien,
Stefano Markidis,
Chaitanya Prasad Sishtla,
Luis Santos,
Pawel Herman,
Sai Narasimhamurthy,
Erwin Laure
Abstract:
The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to resta…
▽ More
The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications. In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer. We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6x with respect to checkpointing directly to slower storage on our benchmark environment.
△ Less
Submitted 6 October, 2018;
originally announced October 2018.
-
PolyPIC: the Polymorphic-Particle-in-Cell Method for Fluid-Kinetic Coupling
Authors:
Stefano Markidis,
Vyacheslav Olshevsky,
Chaitanya Prasad Sishtla,
Steven Wei-der Chien,
Erwin Laure,
Giovanni Lapenta
Abstract:
Particle-in-Cell (PIC) methods are widely used computational tools for fluid and kinetic plasma modeling. While both the fluid and kinetic PIC approaches have been successfully used to target either kinetic or fluid simulations, little was done to combine fluid and kinetic particles under the same PIC framework. This work addresses this issue by proposing a new PIC method, PolyPIC, that uses polym…
▽ More
Particle-in-Cell (PIC) methods are widely used computational tools for fluid and kinetic plasma modeling. While both the fluid and kinetic PIC approaches have been successfully used to target either kinetic or fluid simulations, little was done to combine fluid and kinetic particles under the same PIC framework. This work addresses this issue by proposing a new PIC method, PolyPIC, that uses polymorphic computational particles. In this numerical scheme, particles can be either kinetic or fluid, and fluid particles can become kinetic when necessary, e.g. particles undergoing a strong acceleration. We design and implement the PolyPIC method, and test it against the Landau dam** of Langmuir and ion acoustic waves, two stream instability and sheath formation. We unify the fluid and kinetic PIC methods under one common framework comprising both fluid and kinetic particles, providing a tool for adaptive fluid-kinetic coupling in plasma simulations.
△ Less
Submitted 13 July, 2018;
originally announced July 2018.
-
The SAGE Project: a Storage Centric Approach for Exascale Computing
Authors:
Sai Narasimhamurthy,
Nikita Danilov,
Sining Wu,
Ganesan Umanesan,
Steven Wei-der Chien,
Sergio Rivas-Gomez,
Ivy Bo Peng,
Erwin Laure,
Shaun de Witt,
Dirk Pleiter,
Stefano Markidis
Abstract:
SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a "storage centric" approach as it is capable of storing and processing large data volumes at the Exascale r…
▽ More
SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a "storage centric" approach as it is capable of storing and processing large data volumes at the Exascale regime.
SAGE addresses the convergence of Big Data Analysis and HPC in an era of next-generation data centric computing. This convergence is driven by the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors where data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. A first prototype of the SAGE system has been been implemented and installed at the Julich Supercomputing Center. The SAGE storage system consists of multiple types of storage device technologies in a multi-tier I/O hierarchy, including flash, disk, and non-volatile memory technologies. The main SAGE software component is the Seagate Mero Object Storage that is accessible via the Clovis API and higher level interfaces. The SAGE project also includes scientific applications for the validation of the SAGE concepts.
The objective of this paper is to present the SAGE project concepts, the prototype of the SAGE platform and discuss the software architecture of the SAGE system.
△ Less
Submitted 6 July, 2018;
originally announced July 2018.
-
Exploring Scientific Application Performance Using Large Scale Object Storage
Authors:
Steven Wei-der Chien,
Stefano Markidis,
Rami Karim,
Erwin Laure,
Sai Narasimhamurthy
Abstract:
One of the major performance and scalability bottlenecks in large scientific applications is parallel reading and writing to supercomputer I/O systems. The usage of parallel file systems and consistency requirements of POSIX, that all the traditional HPC parallel I/O interfaces adhere to, pose limitations to the scalability of scientific applications. Object storage is a widely used storage techno…
▽ More
One of the major performance and scalability bottlenecks in large scientific applications is parallel reading and writing to supercomputer I/O systems. The usage of parallel file systems and consistency requirements of POSIX, that all the traditional HPC parallel I/O interfaces adhere to, pose limitations to the scalability of scientific applications. Object storage is a widely used storage technology in cloud computing and is more frequently proposed for HPC workload to address and improve the current scalability and performance of I/O in scientific applications. While object storage is a promising technology, it is still unclear how scientific applications will use object storage and what the main performance benefits will be. This work addresses these questions, by emulating an object storage used by a traditional scientific application and evaluating potential performance benefits. We show that scientific applications can benefit from the usage of object storage on large scales.
△ Less
Submitted 6 July, 2018;
originally announced July 2018.
-
SAGE: Percipient Storage for Exascale Data Centric Computing
Authors:
Sai Narasimhamurthy,
Nikita Danilov,
Sining Wu,
Ganesan Umanesan,
Stefano Markidis,
Sergio Rivas-Gomez,
Ivy Bo Peng,
Erwin Laure,
Dirk Pleiter,
Shaun de Witt
Abstract:
We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infras…
▽ More
We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infrastructure. SAGE addresses the increasing overlaps between Big Data Analysis and HPC in an era of next-generation data centric computing that has developed due to the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors, whose data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. Indeed, Exascale I/O, as a problem that has not been sufficiently dealt with for simulation codes, is appropriately addressed by the SAGE platform. The objective of this paper is to discuss the software architecture of the SAGE system and look at early results we have obtained employing some of its key methodologies, as the system continues to evolve.
△ Less
Submitted 1 May, 2018;
originally announced May 2018.
-
NVIDIA Tensor Core Programmability, Performance & Precision
Authors:
Stefano Markidis,
Steven Wei Der Chien,
Erwin Laure,
Ivy Bo Peng,
Jeffrey S. Vetter
Abstract:
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to pro…
▽ More
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.
△ Less
Submitted 11 March, 2018;
originally announced March 2018.
-
MPI Streams for HPC Applications
Authors:
Ivy Bo Peng,
Stefano Markidis,
Roberto Gioiosa,
Gokcen Kestor,
Erwin Laure
Abstract:
Data streams are a sequence of data flowing between source and destination processes. Streaming is widely used for signal, image and video processing for its efficiency in pipelining and effectiveness in reducing demand for memory. The goal of this work is to extend the use of data streams to support both conventional scientific applications and emerging data analytic applications running on HPC p…
▽ More
Data streams are a sequence of data flowing between source and destination processes. Streaming is widely used for signal, image and video processing for its efficiency in pipelining and effectiveness in reducing demand for memory. The goal of this work is to extend the use of data streams to support both conventional scientific applications and emerging data analytic applications running on HPC platforms. We introduce an extension called MPIStream to the de-facto programming standard on HPC, MPI. MPIStream supports data streams either within a single application or among multiple applications. We present three use cases using MPI streams in HPC applications together with their parallel performance. We show the convenience of using MPI streams to support the needs from both traditional HPC and emerging data analytics applications running on supercomputers.
△ Less
Submitted 3 August, 2017;
originally announced August 2017.
-
Preparing HPC Applications for the Exascale Era: A Decoupling Strategy
Authors:
Ivy Bo Peng,
Roberto Gioiosa,
Gokcen Kestor,
Erwin Laure,
Stefano Markidis
Abstract:
Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propo…
▽ More
Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems.
Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and data-analytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4x performance improvement.
△ Less
Submitted 3 August, 2017;
originally announced August 2017.
-
Extending Message Passing Interface Windows to Storage
Authors:
Sergio Rivas-Gomez,
Stefano Markidis,
Ivy Bo Peng,
Erwin Laure,
Gokcen Kestor,
Roberto Gioiosa
Abstract:
This work presents an extension to MPI supporting the one-sided communication model and window allocations in storage. Our design transparently integrates with the current MPI implementations, enabling applications to target MPI windows in storage, memory or both simultaneously, without major modifications. Initial performance results demonstrate that the presented MPI window extension could poten…
▽ More
This work presents an extension to MPI supporting the one-sided communication model and window allocations in storage. Our design transparently integrates with the current MPI implementations, enabling applications to target MPI windows in storage, memory or both simultaneously, without major modifications. Initial performance results demonstrate that the presented MPI window extension could potentially be helpful for a wide-range of use-cases and with low-overhead.
△ Less
Submitted 27 April, 2017;
originally announced April 2017.
-
Exploring the Performance Benefit of Hybrid Memory System on HPC Environments
Authors:
Ivy Bo Peng,
Roberto Gioiosa,
Gokcen Kestor,
Erwin Laure,
Stefano Markidis
Abstract:
Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the computing cores. For example, the Intel Knights Landing (KNL) processor is equipped with 16 GB of high-bandwidth memory (HBM) that works together with conventi…
▽ More
Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the computing cores. For example, the Intel Knights Landing (KNL) processor is equipped with 16 GB of high-bandwidth memory (HBM) that works together with conventional DRAM memory. Theoretically, HBM can provide 5x higher bandwidth than conventional DRAM. However, many factors impact the effective performance achieved by applications, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. In this paper, we analyze the Intel KNL system and quantify the impact of the most important factors on the application performance by using a set of applications that are representative of scientific and data-analytics workloads. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to 3x performance when compared to the performance obtained using only DRAM. On the contrary, applications with random memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. For those applications, the use of additional hardware threads may help hide latency and achieve higher aggregated bandwidth when using HBM.
△ Less
Submitted 26 April, 2017;
originally announced April 2017.
-
Idle Period Propagation in Message-Passing Applications
Authors:
Ivy Bo Peng,
Stefano Markidis,
Erwin Laure,
Gokcen Kestor,
Roberto Gioiosa
Abstract:
Idle periods on different processes of Message Passing applications are unavoidable. While the origin of idle periods on a single process is well understood as the effect of system and architectural random delays, yet it is unclear how these idle periods propagate from one process to another. It is important to understand idle period propagation in Message Passing applications as it allows applica…
▽ More
Idle periods on different processes of Message Passing applications are unavoidable. While the origin of idle periods on a single process is well understood as the effect of system and architectural random delays, yet it is unclear how these idle periods propagate from one process to another. It is important to understand idle period propagation in Message Passing applications as it allows application developers to design communication patterns avoiding idle period propagation and the consequent performance degradation in their applications. To understand idle period propagation, we introduce a methodology to trace idle periods when a process is waiting for data from a remote delayed process in MPI applications. We apply this technique in an MPI application that solves the heat equation to study idle period propagation on three different systems. We confirm that idle periods move between processes in the form of waves and that there are different stages in idle period propagation. Our methodology enables us to identify a self-synchronization phenomenon that occurs on two systems where some processes run slower than the other processes.
△ Less
Submitted 26 April, 2017;
originally announced April 2017.
-
Exploring Application Performance on Emerging Hybrid-Memory Supercomputers
Authors:
Ivy Bo Peng,
Stefano Markidis,
Erwin Laure,
Gokcen Kestor,
Roberto Gioiosa
Abstract:
Next-generation supercomputers will feature more hierarchical and heterogeneous memory systems with different memory technologies working side-by-side. A critical question is whether at large scale existing HPC applications and emerging data-analytics workloads will have performance improvement or degradation on these systems. We propose a systematic and fair methodology to identify the trend of a…
▽ More
Next-generation supercomputers will feature more hierarchical and heterogeneous memory systems with different memory technologies working side-by-side. A critical question is whether at large scale existing HPC applications and emerging data-analytics workloads will have performance improvement or degradation on these systems. We propose a systematic and fair methodology to identify the trend of application performance on emerging hybrid-memory systems. We model the memory system of next-generation supercomputers as a combination of "fast" and "slow" memories. We then analyze performance and dynamic execution characteristics of a variety of workloads, from traditional scientific applications to emerging data analytics to compare traditional and hybrid-memory systems. Our results show that data analytics applications can clearly benefit from the new system design, especially at large scale. Moreover, hybrid-memory systems do not penalize traditional scientific applications, which may also show performance improvement.
△ Less
Submitted 26 April, 2017;
originally announced April 2017.
-
Design and implementation of the advanced cloud privacy threat modeling
Authors:
Ali Gholami,
Anna-Sara Lind,
Jane Reichel,
Jan-Eric Litton,
Ake Edlund,
Erwin Laure
Abstract:
Privacy-preservation for sensitive data has become a challenging issue in cloud computing. Threat modeling as a part of requirements engineering in secure software development provides a structured approach for identifying attacks and proposing countermeasures against the exploitation of vulnerabilities in a system . This paper describes an extension of Cloud Privacy Threat Modeling (CPTM) methodo…
▽ More
Privacy-preservation for sensitive data has become a challenging issue in cloud computing. Threat modeling as a part of requirements engineering in secure software development provides a structured approach for identifying attacks and proposing countermeasures against the exploitation of vulnerabilities in a system . This paper describes an extension of Cloud Privacy Threat Modeling (CPTM) methodology for privacy threat modeling in relation to processing sensitive data in cloud computing environments. It describes the modeling methodology that involved applying Method Engineering to specify characteristics of a cloud privacy threat modeling methodology, different steps in the proposed methodology and corresponding products. In addition, a case study has been implemented as a proof of concept to demonstrate the usability of the proposed methodology. We believe that the extended methodology facilitates the application of a privacy-preserving cloud software development approach from requirements engineering to design.
△ Less
Submitted 3 April, 2016;
originally announced April 2016.
-
Advanced Cloud Privacy Threat Modeling
Authors:
Ali Gholami,
Erwin Laure
Abstract:
Privacy-preservation for sensitive data has become a challenging issue in cloud computing. Threat modeling as a part of requirements engineering in secure software development provides a structured approach for identifying attacks and proposing countermeasures against the exploitation of vulnerabilities in a system . This paper describes an extension of Cloud Privacy Threat Modeling (CPTM) methodo…
▽ More
Privacy-preservation for sensitive data has become a challenging issue in cloud computing. Threat modeling as a part of requirements engineering in secure software development provides a structured approach for identifying attacks and proposing countermeasures against the exploitation of vulnerabilities in a system . This paper describes an extension of Cloud Privacy Threat Modeling (CPTM) methodology for privacy threat modeling in relation to processing sensitive data in cloud computing environments. It describes the modeling methodology that involved applying Method Engineering to specify characteristics of a cloud privacy threat modeling methodology, different steps in the proposed methodology and corresponding products. We believe that the extended methodology facilitates the application of a privacy-preserving cloud software development approach from requirements engineering to design.
△ Less
Submitted 7 January, 2016;
originally announced January 2016.
-
Security and Privacy of Sensitive Data in Cloud Computing: A Survey of Recent Developments
Authors:
Ali Gholami,
Erwin Laure
Abstract:
Cloud computing is revolutionizing many ecosystems by providing organizations with computing resources featuring easy deployment, connectivity, configuration, automation and scalability. This paradigm shift raises a broad range of security and privacy issues that must be taken into consideration. Multi-tenancy, loss of control, and trust are key challenges in cloud computing environments. This pap…
▽ More
Cloud computing is revolutionizing many ecosystems by providing organizations with computing resources featuring easy deployment, connectivity, configuration, automation and scalability. This paradigm shift raises a broad range of security and privacy issues that must be taken into consideration. Multi-tenancy, loss of control, and trust are key challenges in cloud computing environments. This paper reviews the existing technologies and a wide array of both earlier and state-of-the-art projects on cloud security and privacy. We categorize the existing research according to the cloud reference architecture orchestration, resource control, physical resource, and cloud service management layers, in addition to reviewing the existing developments in privacy-preserving sensitive data approaches in cloud computing such as privacy threat modeling and privacy enhancing protocols and solutions.
△ Less
Submitted 7 January, 2016;
originally announced January 2016.
-
Signatures of Secondary Collisionless Magnetic Reconnection Driven by Kink Instability of a Flux Rope
Authors:
S. Markidis,
G. Lapenta,
G. L. Delzanno,
P. Henri,
M. V. Goldman,
D. L. Newman,
T. Intrator,
E. Laure
Abstract:
The kinetic features of secondary magnetic reconnection in a single flux rope undergoing internal kink instability are studied by means of three-dimensional Particle-in-Cell simulations. Several signatures of secondary magnetic reconnection are identified in the plane perpendicular to the flux rope: a quadrupolar electron and ion density structure and a bipolar Hall magnetic field develop in proxi…
▽ More
The kinetic features of secondary magnetic reconnection in a single flux rope undergoing internal kink instability are studied by means of three-dimensional Particle-in-Cell simulations. Several signatures of secondary magnetic reconnection are identified in the plane perpendicular to the flux rope: a quadrupolar electron and ion density structure and a bipolar Hall magnetic field develop in proximity of the reconnection region. The most intense electric fields form perpendicularly to the local magnetic field, and a reconnection electric field is identified in the plane perpendicular to the flux rope. An electron current develops along the reconnection line in the opposite direction of the electron current supporting the flux rope magnetic field structure. Along the reconnection line, several bipolar structures of the electric field parallel to the magnetic field occur making the magnetic reconnection region turbulent. The reported signatures of secondary magnetic reconnection can help to localize magnetic reconnection events in space, astrophysical and fusion plasmas.
△ Less
Submitted 5 August, 2014;
originally announced August 2014.
-
Performance Analysis of Irregular Collective Communication with the Crystal Router Algorithm
Authors:
Michael Schliephake,
Erwin Laure
Abstract:
In order to achieve exascale performance it is important to detect potential bottlenecks and identify strategies to overcome them. For this, both applications and system software must be analysed and potentially improved. The EU FP7 project Collaborative Research into Exascale Systemware, Tools & Applications (CRESTA) chose the approach to co-design advanced simulation applications and system soft…
▽ More
In order to achieve exascale performance it is important to detect potential bottlenecks and identify strategies to overcome them. For this, both applications and system software must be analysed and potentially improved. The EU FP7 project Collaborative Research into Exascale Systemware, Tools & Applications (CRESTA) chose the approach to co-design advanced simulation applications and system software as well as development tools. In this paper, we present the results of a co-design activity focused on the simulation code NEK5000 that aims at performance improvements of collective communication operations. We have analysed the algorithms that form the core of NEK5000's communication module in order to assess its viability on recent computer architectures before starting to improve its performance. Our results show that the crystal router algorithm performs well in sparse, irregular collective operations for medium and large processor number but improvements for even larger system sizes of the future will be needed. We sketch the needed improvements, which will make the communication algorithms also beneficial for other applications that need to implement latency-dominated communication schemes with short messages. The latency-optimised communication operations will also become used in a runtime-system providing dynamic load balancing, under development within CRESTA.
△ Less
Submitted 5 August, 2014; v1 submitted 12 April, 2014;
originally announced April 2014.
-
The Fluid-Kinetic Particle-in-Cell Solver for Plasma Simulations
Authors:
Stefano Markidis,
Pierre Henri,
Giovanni Lapenta,
Kjell Ronnmark,
Maria Hamrin,
Zakaria Meliani,
Erwin Laure
Abstract:
A new method that solves concurrently the multi-fluid and Maxwell's equations has been developed for plasma simulations. By calculating the stress tensor in the multi-fluid momentum equation by means of computational particles moving in a self-consistent electromagnetic field, the kinetic effects are retained while solving the multi-fluid equations. The Maxwell's and multi-fluid equations are disc…
▽ More
A new method that solves concurrently the multi-fluid and Maxwell's equations has been developed for plasma simulations. By calculating the stress tensor in the multi-fluid momentum equation by means of computational particles moving in a self-consistent electromagnetic field, the kinetic effects are retained while solving the multi-fluid equations. The Maxwell's and multi-fluid equations are discretized implicitly in time enabling kinetic simulations over time scales typical of the fluid simulations. The fluid-kinetic Particle-in-Cell solver has been implemented in a three-dimensional electromagnetic code, and tested against the ion cyclotron resonance and magnetic reconnection problems. The new method is a promising approach for coupling fluid and kinetic methods in a unified framework.
△ Less
Submitted 5 June, 2013;
originally announced June 2013.
-
Kinetic Simulations of Plasmoid Chain Dynamics
Authors:
Stefano Markidis,
Pierre Henri,
Giovanni Lapenta,
Andrey Divin,
Martin Goldman,
David Newman,
Erwin Laure
Abstract:
The dynamics of a plasmoid chain is studied with three dimensional Particle-in-Cell simulations. The evolution of the system with and without a uniform guide field, whose strength is 1/3 the asymptotic magnetic field, is investigated. The plasmoid chain forms by spontaneous magnetic reconnection: the tearing instability rapidly disrupts the initial current sheet generating several small-scale plas…
▽ More
The dynamics of a plasmoid chain is studied with three dimensional Particle-in-Cell simulations. The evolution of the system with and without a uniform guide field, whose strength is 1/3 the asymptotic magnetic field, is investigated. The plasmoid chain forms by spontaneous magnetic reconnection: the tearing instability rapidly disrupts the initial current sheet generating several small-scale plasmoids, that rapidly grow in size coalescing and kinking. The plasmoid kink is mainly driven by the coalescence process. It is found that the presence of guide field strongly influences the evolution of the plasmoid chain. Without a guide field, a main reconnection site dominates and smaller reconnection regions are included in larger ones, leading to an hierarchical structure of the plasmoid-dominated current sheet. On the contrary in presence of a guide field, plasmoids have approximately the same size and the hierarchical structure does not emerge, a strong core magnetic field develops in the center of the plasmoid in the direction of the existing guide field, and bump-on-tail instability, leading to the formation of electron holes, is detected in proximity of the plasmoids.
△ Less
Submitted 5 June, 2013;
originally announced June 2013.
-
Rethinking Electrostatic Solvers in Particle Simulations for the Exascale Era
Authors:
Stefano Markidis,
Giovanni Lapenta,
Rossen Apostolov,
Erwin Laure
Abstract:
In preparation to the exascale era, an alternative approach to calculate the electrostatic forces in Particle Mesh (PM) methods is proposed. While the traditional techniques are based on the calculation of the electrostatic potential by solving the Poisson equation, in the new approach the electric field is calculated by solving the Ampere's law. When the Ampere's law is discretized explicitly in…
▽ More
In preparation to the exascale era, an alternative approach to calculate the electrostatic forces in Particle Mesh (PM) methods is proposed. While the traditional techniques are based on the calculation of the electrostatic potential by solving the Poisson equation, in the new approach the electric field is calculated by solving the Ampere's law. When the Ampere's law is discretized explicitly in time, the electric field values on the mesh are simply updated from the previous values. In this way, the electrostatic solver becomes an embarrassingly parallel problem, making the algorithm extremely scalable and suitable for exascale computing platforms. An implementation of a one dimensional PM code is presented to show that the proposed method produces correct results, and it is a very promising algorithm for exascale PM simulations.
△ Less
Submitted 28 May, 2012; v1 submitted 10 May, 2012;
originally announced May 2012.
-
Monitoring field soil suction using a miniature tensiometer
Authors:
Yu-Jun Cui,
Anh-Minh Tang,
Altin Theodore Mantho,
Emmanuel De Laure
Abstract:
An experimental device was developed to monitor the field soil suction using miniature tensiometer. This device consists of a double tube system that ensures a good contact between the tensiometer and the soil surface at the bottom of the testing borehole. This system also ensures the tensiometer periodical retrieving without disturbing the surrounding soil. This device was used to monitor the s…
▽ More
An experimental device was developed to monitor the field soil suction using miniature tensiometer. This device consists of a double tube system that ensures a good contact between the tensiometer and the soil surface at the bottom of the testing borehole. This system also ensures the tensiometer periodical retrieving without disturbing the surrounding soil. This device was used to monitor the soil suction at the site of Boissy-le-Châtel, France. The measurement was performed at two depths (25 and 45 cm) during two months (May and June 2004). The recorded suction data are analyzed by comparing with the volumetric water content data recorded using TDR (Time Domain Reflectometer) probes as well as the meteorological data. A good agreement between these results was observed, showing a satisfactory performance of the developed device.
△ Less
Submitted 15 January, 2008;
originally announced January 2008.
-
Running CMS software on GRID Testbeds
Authors:
D. Bonacorsi,
P. Capiluppi,
A. Fanfani,
C. Grandi,
M. Corvo,
F. Fanzago,
M. Sgaravatto,
M. Verlato,
C. Charlot,
I. Semeniuok,
D. Colling,
B. MacEvoy,
H. Tallini,
M. Biasotto,
S. Fantinel,
E. Leonardi,
A. Sciaba',
O. Maroney,
I. Augustin,
E. Laure,
M. Schulz,
H. Stockinger,
V. Lefebure,
S. Burke,
J. J. Blaising
, et al. (5 additional authors not shown)
Abstract:
Starting in the middle of November 2002, the CMS experiment undertook an evaluation of the European DataGrid Project (EDG) middleware using its event simulation programs. A joint CMS-EDG task force performed a "stress test" by submitting a large number of jobs to many distributed sites. The EDG testbed was complemented with additional CMS-dedicated resources. A total of ~ 10000 jobs consisting o…
▽ More
Starting in the middle of November 2002, the CMS experiment undertook an evaluation of the European DataGrid Project (EDG) middleware using its event simulation programs. A joint CMS-EDG task force performed a "stress test" by submitting a large number of jobs to many distributed sites. The EDG testbed was complemented with additional CMS-dedicated resources. A total of ~ 10000 jobs consisting of two different computational types were submitted from four different locations in Europe over a period of about one month. Nine sites were active, providing integrated resources of more than 500 CPUs and about 5 TB of disk space (with the additional use of two Mass Storage Systems). Descriptions of the adopted procedures, the problems encountered and the corresponding solutions are reported. Results and evaluations of the test, both from the CMS and the EDG perspectives, are described.
△ Less
Submitted 4 June, 2003;
originally announced June 2003.
-
Grid Data Management in Action: Experience in Running and Supporting Data Management Services in the EU DataGrid Project
Authors:
Heinz Stockinger,
Flavia Donno,
Erwin Laure,
Shahzad Muzaffar,
Peter Kunszt,
Giuseppe Andronico,
Paul Millar
Abstract:
In the first phase of the EU DataGrid (EDG) project, a Data Management System has been implemented and provided for deployment. The components of the current EDG Testbed are: a prototype of a Replica Manager Service built around the basic services provided by Globus, a centralised Replica Catalogue to store information about physical locations of files, and the Grid Data Mirroring Package (GDMP)…
▽ More
In the first phase of the EU DataGrid (EDG) project, a Data Management System has been implemented and provided for deployment. The components of the current EDG Testbed are: a prototype of a Replica Manager Service built around the basic services provided by Globus, a centralised Replica Catalogue to store information about physical locations of files, and the Grid Data Mirroring Package (GDMP) that is widely used in various HEP collaborations in Europe and the US for data mirroring. During this year these services have been refined and made more robust so that they are fit to be used in a pre-production environment. Application users have been using this first release of the Data Management Services for more than a year. In the paper we present the components and their interaction, our implementation and experience as well as the feedback received from our user communities. We have resolved not only issues regarding integration with other EDG service components but also many of the interoperability issues with components of our partner projects in Europe and the U.S. The paper concludes with the basic lessons learned during this operation. These conclusions provide the motivation for the architecture of the next generation of Data Management Services that will be deployed in EDG during 2003.
△ Less
Submitted 2 June, 2003;
originally announced June 2003.
-
Next-Generation EU DataGrid Data Management Services
Authors:
Diana Bosio,
James Casey,
Akos Frohner,
Leanne Guy,
Peter Kunszt,
Erwin Laure,
Sophie Lemaitre,
Levi Lucio,
Heinz Stockinger,
Kurt Stockinger,
William Bell,
David Cameron,
Gavin McCance,
Paul Millar,
Joni Hahkala,
Niklas Karlsson,
Ville Nenonen,
Mika Silander,
Olle Mulmo,
Gian-Luca Volpato,
Giuseppe Andronico
Abstract:
We describe the architecture and initial implementation of the next-generation of Grid Data Management Middleware in the EU DataGrid (EDG) project.
The new architecture stems out of our experience and the users requirements gathered during the two years of running our initial set of Grid Data Management Services. All of our new services are based on the Web Service technology paradigm, very mu…
▽ More
We describe the architecture and initial implementation of the next-generation of Grid Data Management Middleware in the EU DataGrid (EDG) project.
The new architecture stems out of our experience and the users requirements gathered during the two years of running our initial set of Grid Data Management Services. All of our new services are based on the Web Service technology paradigm, very much in line with the emerging Open Grid Services Architecture (OGSA). We have modularized our components and invested a great amount of effort towards a secure, extensible and robust service, starting from the design but also using a streamlined build and testing framework.
Our service components are: Replica Location Service, Replica Metadata Service, Replica Optimization Service, Replica Subscription and high-level replica management. The service security infrastructure is fully GSI-enabled, hence compatible with the existing Globus Toolkit 2-based services; moreover, it allows for fine-grained authorization mechanisms that can be adjusted depending on the service semantics.
△ Less
Submitted 12 June, 2003; v1 submitted 30 May, 2003;
originally announced May 2003.