Search | arXiv e-print repository

The Case for Transport-Level Encryption in Datacenter Networks

Authors: Tianyi Gao, Xinshu Ma, Suhas Narreddy, Eugenio Luo, Steven W. D. Chien, Michio Honda

Abstract: Cloud applications need network data encryption to isolate from other tenants and protect their data from potential eavesdroppers in the network infrastructure. This paper presents SDP, a protocol design for emerging datacenter transport protocols, such as pHost, NDP, and Homa, to integrate data encryption with the use of existing NIC offloading of cryptographic operations designed for TLS over TC… ▽ More Cloud applications need network data encryption to isolate from other tenants and protect their data from potential eavesdroppers in the network infrastructure. This paper presents SDP, a protocol design for emerging datacenter transport protocols, such as pHost, NDP, and Homa, to integrate data encryption with the use of existing NIC offloading of cryptographic operations designed for TLS over TCP. Therefore, SDP could enable a deployment path of new transport protocols in datacenters without giving up hardware offloading support, which would otherwise make encryption on those protocols even slower than TLS over TCP. SDP is based on Homa, and outperforms TLS over TCP by up to 29 % in throughput. SDP currently supports two real-world applications, Redis, improving throughput by up to 24 %, and in-kernel NVMe-oF, cutting P99 latency by up to 21 %. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2401.14576 [pdf]

Accelerating Scientific Application through Transparent I/O Interposition

Authors: Steven W. D. Chien, Kento Sato, Artur Podobas, Niclas Jansson, Stefano Markidis, Michio Honda

Abstract: The ability to handle a large volume of data generated by scientific applications is crucial. We have seen an increase in the heterogeneity of storage technologies available to scientific applications, such as burst buffers, local temporary block storage, managed cloud parallel file systems (PFS), and non-POSIX object stores. However, scientific applications designed for traditional HPC systems ca… ▽ More The ability to handle a large volume of data generated by scientific applications is crucial. We have seen an increase in the heterogeneity of storage technologies available to scientific applications, such as burst buffers, local temporary block storage, managed cloud parallel file systems (PFS), and non-POSIX object stores. However, scientific applications designed for traditional HPC systems can not easily exploit those storage systems due to cost, throughput, and programming model challenges. We present iFast, a new library-level approach to transparently accelerating scientific applications based on MPI-IO. It decouples application I/O, data caching, and data storage to support heterogeneous storage models. Design decisions of iFast are based on a strong emphasis on deployability. It is highly general with only MPI as a core dependency, allowing users to run unmodified MPI-based applications with unmodified MPI implementations - even proprietary ones like IntelMPI and Cray MPICH. Our approach supports a wide range of networked storage, including traditional PFS, ordinary NFS, and S3-based cloud storage. Unlike previous approaches, iFast ensures crash consistency even across compute nodes. We demonstrate iFast in cloud HPC platform, small local cluster, and hybrid of both to show its generality. Our results show that iFast reduces end-to-end execution time by 13-26% for three popular scientific applications on the cloud. It also outperforms the state-of-the-art system, SymphonyFS, a filesystem-based approach for similar goals but without crash consistency, by 12-23%. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: Submitted to HPDC 2024

arXiv:2107.06676 [pdf, other]

Higgs Boson Classification: Brain-inspired BCPNN Learning with StreamBrain

Authors: Martin Svedin, Artur Podobas, Steven W. D. Chien, Stefano Markidis

Abstract: One of the most promising approaches for data analysis and exploration of large data sets is Machine Learning techniques that are inspired by brain models. Such methods use alternative learning rules potentially more efficiently than established learning rules. In this work, we focus on the potential of brain-inspired ML for exploiting High-Performance Computing (HPC) resources to solve ML problem… ▽ More One of the most promising approaches for data analysis and exploration of large data sets is Machine Learning techniques that are inspired by brain models. Such methods use alternative learning rules potentially more efficiently than established learning rules. In this work, we focus on the potential of brain-inspired ML for exploiting High-Performance Computing (HPC) resources to solve ML problems: we discuss the BCPNN and an HPC implementation, called StreamBrain, its computational cost, suitability to HPC systems. As an example, we use StreamBrain to analyze the Higgs Boson dataset from High Energy Physics and discriminate between background and signal classes in collisions of high-energy particle colliders. Overall, we reach up to 69.15% accuracy and 76.4% Area Under the Curve (AUC) performance. △ Less

Submitted 17 August, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

Comments: Accepted for publication at The 2nd Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S 2021)

arXiv:2106.05373 [pdf, other]

doi 10.1145/3468044.3468052

StreamBrain: An HPC Framework for Brain-like Neural Networks on CPUs, GPUs and FPGAs

Authors: Artur Podobas, Martin Svedin, Steven W. D. Chien, Ivy B. Peng, Naresh Balaji Ravichandran, Pawel Herman, Anders Lansner, Stefano Markidis

Abstract: The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other -- less-known -- machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation Neural Network (BCPNN). In… ▽ More The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other -- less-known -- machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation Neural Network (BCPNN). In this paper, we introduce StreamBrain -- a framework that allows neural networks based on BCPNN to be practically deployed in High-Performance Computing systems. StreamBrain is a domain-specific language (DSL), similar in concept to existing machine learning (ML) frameworks, and supports backends for CPUs, GPUs, and even FPGAs. We empirically demonstrate that StreamBrain can train the well-known ML benchmark dataset MNIST within seconds, and we are the first to demonstrate BCPNN on STL-10 size networks. We also show how StreamBrain can be used to train with custom floating-point formats and illustrate the impact of using different bfloat variations on BCPNN using FPGAs. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: Accepted for publication at the International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2021)

arXiv:2106.04979 [pdf]

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

Authors: Martin Svedin, Steven W. D. Chien, Gibson Chikafa, Niclas Jansson, Artur Podobas

Abstract: For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100, claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non… ▽ More For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100, claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI benchmarks, and can we expect the A100 to deliver the application improvements we have grown used to with previous GPU generations? In this paper, we benchmark the A100 GPU and compare it to four previous generations of GPUs, with particular focus on empirically quantifying our derived performance expectations, and -- should those expectations be undelivered -- investigate whether the introduced data-movement features can offset any eventual loss in performance? We find that the A100 delivers less performance increase than previous generations for the well-known Rodinia benchmark suite; we show that some of these performance anomalies can be remedied through clever use of the new data-movement features, which we microbenchmark and demonstrate where (and more importantly, how) they should be used. △ Less

Submitted 3 July, 2021; v1 submitted 9 June, 2021; originally announced June 2021.

Comments: 7 pages

arXiv:2008.04397 [pdf, other]

doi 10.1109/SBAC-PAD49847.2020.00030

sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems

Authors: Steven W. D. Chien, Jonas Nylund, Gabriel Bengtsson, Ivy B. Peng, Artur Podobas, Stefano Markidis

Abstract: Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes re… ▽ More Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlap** communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters. △ Less

Submitted 10 August, 2020; originally announced August 2020.

Comments: Accepted for publication at the 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2020)

arXiv:2008.04395 [pdf, other]

doi 10.1109/CLUSTER49012.2020.00046

tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads

Authors: Steven W. D. Chien, Artur Podobas, Ivy B. Peng, Stefano Markidis

Abstract: Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and al… ▽ More Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures during TensorFlow execution to enable analysis through the TensorFlow profiler. We visualize the performance results through TensorBoard, the web-based TensorFlow visualization tool. At the same time, we do not alter Darshan's existing implementation. We illustrate tf-Darshan by performing two case studies on ImageNet image and Malware classification. We show that by guiding optimization using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by selecting data for staging on fast tier storage. We also show that Darshan has the potential of being used as a runtime library for profiling and providing information for future optimization. △ Less

Submitted 11 August, 2020; v1 submitted 10 August, 2020; originally announced August 2020.

Comments: Accepted for publication at the 2020 International Conference on Cluster Computing (CLUSTER 2020)

arXiv:1910.09598 [pdf, other]

doi 10.1109/MCHPC49590.2019.00014

Performance Evaluation of Advanced Features in CUDA Unified Memory

Authors: Steven W. D. Chien, Ivy B. Peng, Stefano Markidis

Abstract: CUDA Unified Memory improves the GPU programmability and also enables GPU memory oversubscription. Recently, two advanced memory features, memory advises and asynchronous prefetch, have been introduced. In this work, we evaluate the new features on two platforms that feature different CPUs, GPUs, and interconnects. We derive a benchmark suite for the experiments and stress the memory system to eva… ▽ More CUDA Unified Memory improves the GPU programmability and also enables GPU memory oversubscription. Recently, two advanced memory features, memory advises and asynchronous prefetch, have been introduced. In this work, we evaluate the new features on two platforms that feature different CPUs, GPUs, and interconnects. We derive a benchmark suite for the experiments and stress the memory system to evaluate both in-memory and oversubscription performance. The results show that memory advises on the Intel-Volta/Pascal-PCIe platform bring negligible improvement for in-memory executions. However, when GPU memory is oversubscribed by about 50%, using memory advises results in up to 25% performance improvement compared to the basic CUDA Unified Memory. In contrast, the Power9-Volta-NVLink platform can substantially benefit from memory advises, achieving up to 34% performance gain for in-memory executions. However, when GPU memory is oversubscribed on this platform, using memory advises increases GPU page faults and results in considerable performance loss. The CUDA prefetch also shows different performance impact on the two platforms. It improves performance by up to 50% on the Intel-Volta/Pascal-PCI-E platform but brings little benefit to the Power9-Volta-NVLink platform. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: Accepted for publication at Workshop on Memory Centric High Performance Computing (MCHPC'19) in SC19

arXiv:1908.05715 [pdf, other]

doi 10.1029/2021JA029620

Automated classification of plasma regions using 3D particle energy distributions

Authors: Vyacheslav Olshevsky, Yuri V. Khotyaintsev, Ahmad Lalti, Andrey Divin, Gian Luca Delzanno, Sven Anderzen, Pawel Herman, Steven W. D. Chien, Levon Avanov, Andrew P. Dimmock, Stefano Markidis

Abstract: We investigate the properties of the ion sky maps produced by the Dual Ion Spectrometers (DIS) from the Fast Plasma Investigation (FPI). We have trained a convolutional neural network classifier to predict four regions crossed by the MMS on the dayside magnetosphere: solar wind, ion foreshock, magnetosheath, and magnetopause using solely DIS spectrograms. The accuracy of the classifier is >98%. We… ▽ More We investigate the properties of the ion sky maps produced by the Dual Ion Spectrometers (DIS) from the Fast Plasma Investigation (FPI). We have trained a convolutional neural network classifier to predict four regions crossed by the MMS on the dayside magnetosphere: solar wind, ion foreshock, magnetosheath, and magnetopause using solely DIS spectrograms. The accuracy of the classifier is >98%. We use the classifier to detect mixed plasma regions, in particular to find the bow shock regions. A similar approach can be used to identify the magnetopause crossings and reveal regions prone to magnetic reconnection. Data processing through the trained classifier is fast and efficient and thus can be used for classification for the whole MMS database. △ Less

Submitted 21 September, 2021; v1 submitted 15 August, 2019; originally announced August 2019.

Comments: Accepted to JGR: Space Physics

arXiv:1907.05917 [pdf, other]

doi 10.1007/978-3-030-43229-4_26

Posit NPB: Assessing the Precision Improvement in HPC Scientific Applications

Authors: Steven W. D. Chien, Ivy B. Peng, Stefano Markidis

Abstract: Floating-point operations can significantly impact the accuracy and performance of scientific applications on large-scale parallel systems. Recently, an emerging floating-point format called Posit has attracted attention as an alternative to the standard IEEE floating-point formats because it could enable higher precision than IEEE formats using the same number of bits. In this work, we first expl… ▽ More Floating-point operations can significantly impact the accuracy and performance of scientific applications on large-scale parallel systems. Recently, an emerging floating-point format called Posit has attracted attention as an alternative to the standard IEEE floating-point formats because it could enable higher precision than IEEE formats using the same number of bits. In this work, we first explored the feasibility of Posit encoding in representative HPC applications by providing a 32-bit Posit NAS Parallel Benchmark (NPB) suite. Then, we evaluate the accuracy improvement in different HPC kernels compared to the IEEE 754 format. Our results indicate that using Posit encoding achieves optimized precision, ranging from 0.6 to 1.4 decimal digit, for all tested kernels and proxy-applications. Also, we quantified the overhead of the current software implementation of Posit encoding as 4x-19x that of IEEE 754 hardware implementation. Our study highlights the potential of hardware implementations of Posit to benefit a broad range of HPC applications. △ Less

Submitted 12 July, 2019; originally announced July 2019.

Comments: Accepted for publication in PPAM 2019 conference

arXiv:1904.03684 [pdf, other]

doi 10.1007/978-3-030-22750-0_58

Multi-GPU Acceleration of the iPIC3D Implicit Particle-in-Cell Code

Authors: Chaitanya Prasad Sishtla, Steven W. D. Chien, Vyacheslav Olshevsky, Erwin Laure, Stefano Markidis

Abstract: iPIC3D is a widely used massively parallel Particle-in-Cell code for the simulation of space plasmas. However, its current implementation does not support execution on multiple GPUs. In this paper, we describe the porting of iPIC3D particle mover to GPUs and the optimization steps to increase the performance and parallel scaling on multiple GPUs. We analyze the strong scaling of the mover on two G… ▽ More iPIC3D is a widely used massively parallel Particle-in-Cell code for the simulation of space plasmas. However, its current implementation does not support execution on multiple GPUs. In this paper, we describe the porting of iPIC3D particle mover to GPUs and the optimization steps to increase the performance and parallel scaling on multiple GPUs. We analyze the strong scaling of the mover on two GPU clusters and evaluate its performance and acceleration. The optimized GPU version which uses pinned memory and asynchronous data prefetching outperform their corresponding CPU versions by 5-10x on two different systems equipped with NVIDIA K80 and V100 GPUs. △ Less

Submitted 7 April, 2019; originally announced April 2019.

Comments: Accepted for publication in ICCS 2019

arXiv:1903.04364 [pdf, other]

doi 10.1109/IPDPSW.2019.00092

TensorFlow Doing HPC

Authors: Steven W. D. Chien, Stefano Markidis, Vyacheslav Olshevsky, Yaroslav Bulatov, Erwin Laure, Jeffrey S. Vetter

Abstract: TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for develo** Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include HP… ▽ More TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for develo** Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include HPC applications. However, very few experiments have been conducted to evaluate TensorFlow performance when running HPC workloads on supercomputers. This work addresses this lack by designing four traditional HPC benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG) solver and Fast Fourier Transform (FFT). We analyze their performance on two supercomputers with accelerators and evaluate the potential of TensorFlow for develo** HPC applications. Our tests show that TensorFlow can fully take advantage of high performance networks and accelerators on supercomputers. Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical communication bandwidth on our testing platform. We find an approximately 2x, 1.7x and 1.8x performance improvement when increasing the number of GPUs from two to four in the matrix-matrix multiply, CG and FFT applications respectively. All our performance results demonstrate that TensorFlow has high potential of emerging also as HPC programming framework for heterogeneous supercomputers. △ Less

Submitted 11 March, 2019; originally announced March 2019.

Comments: Accepted for publication at The Ninth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES'19)

arXiv:1901.09638 [pdf, other]

doi 10.1088/1742-6596/1225/1/012009

Particle-in-Cell Simulations of Plasma Dynamics in Cometary Environment

Authors: Chaitanya Prasad Sishtla, Vyacheslav Olshevsky, Steven W. D. Chien, Stefano Markidis, Erwin Laure

Abstract: We perform and analyze global Particle-in-Cell (PIC) simulations of the interaction between solar wind and an outgassing comet with the goal of studying the plasma kinetic dynamics of a cometary environment. To achieve this, we design and implement a new numerical method in the iPIC3D code to model outgassing from the comet: new plasma particles are ejected from the comet "surface" at each computa… ▽ More We perform and analyze global Particle-in-Cell (PIC) simulations of the interaction between solar wind and an outgassing comet with the goal of studying the plasma kinetic dynamics of a cometary environment. To achieve this, we design and implement a new numerical method in the iPIC3D code to model outgassing from the comet: new plasma particles are ejected from the comet "surface" at each computational cycle. Our simulations show that a bow shock is formed as a result of the interaction between solar wind and outgassed particles. The analysis of distribution functions for the PIC simulations shows that at the bow shock part of the incoming solar wind, ions are reflected while electrons are heated. This work attempts to reveal kinetic effects in the atmosphere of an outgassing comet using a fully kinetic Particle-in-Cell model. △ Less

Submitted 28 January, 2019; originally announced January 2019.

Comments: 11 pages, 5 figures, ASTRONUM-2018

Journal ref: Journal of Physics: Conference Series, Volume 1225 (2019), 012009

arXiv:1810.03035 [pdf, other]

doi 10.1109/PDSW-DISCS.2018.00011

Characterizing Deep-Learning I/O Workloads in TensorFlow

Authors: Steven W. D. Chien, Stefano Markidis, Chaitanya Prasad Sishtla, Luis Santos, Pawel Herman, Sai Narasimhamurthy, Erwin Laure

Abstract: The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to resta… ▽ More The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications. In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer. We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6x with respect to checkpointing directly to slower storage on our benchmark environment. △ Less

Submitted 6 October, 2018; originally announced October 2018.

Comments: Accepted for publication at pdsw-DISCS 2018

arXiv:1807.05183 [pdf, other]

doi 10.3389/fphy.2018.00100

PolyPIC: the Polymorphic-Particle-in-Cell Method for Fluid-Kinetic Coupling

Authors: Stefano Markidis, Vyacheslav Olshevsky, Chaitanya Prasad Sishtla, Steven Wei-der Chien, Erwin Laure, Giovanni Lapenta

Abstract: Particle-in-Cell (PIC) methods are widely used computational tools for fluid and kinetic plasma modeling. While both the fluid and kinetic PIC approaches have been successfully used to target either kinetic or fluid simulations, little was done to combine fluid and kinetic particles under the same PIC framework. This work addresses this issue by proposing a new PIC method, PolyPIC, that uses polym… ▽ More Particle-in-Cell (PIC) methods are widely used computational tools for fluid and kinetic plasma modeling. While both the fluid and kinetic PIC approaches have been successfully used to target either kinetic or fluid simulations, little was done to combine fluid and kinetic particles under the same PIC framework. This work addresses this issue by proposing a new PIC method, PolyPIC, that uses polymorphic computational particles. In this numerical scheme, particles can be either kinetic or fluid, and fluid particles can become kinetic when necessary, e.g. particles undergoing a strong acceleration. We design and implement the PolyPIC method, and test it against the Landau dam** of Langmuir and ion acoustic waves, two stream instability and sheath formation. We unify the fluid and kinetic PIC methods under one common framework comprising both fluid and kinetic particles, providing a tool for adaptive fluid-kinetic coupling in plasma simulations. △ Less

Submitted 13 July, 2018; originally announced July 2018.

Comments: Submitted to Frontiers

Journal ref: Frontiers in Physics, 6 (2018), 100

arXiv:1807.03632 [pdf, other]

doi 10.1145/3203217.3205341

The SAGE Project: a Storage Centric Approach for Exascale Computing

Authors: Sai Narasimhamurthy, Nikita Danilov, Sining Wu, Ganesan Umanesan, Steven Wei-der Chien, Sergio Rivas-Gomez, Ivy Bo Peng, Erwin Laure, Shaun de Witt, Dirk Pleiter, Stefano Markidis

Abstract: SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a "storage centric" approach as it is capable of storing and processing large data volumes at the Exascale r… ▽ More SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a "storage centric" approach as it is capable of storing and processing large data volumes at the Exascale regime. SAGE addresses the convergence of Big Data Analysis and HPC in an era of next-generation data centric computing. This convergence is driven by the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors where data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. A first prototype of the SAGE system has been been implemented and installed at the Julich Supercomputing Center. The SAGE storage system consists of multiple types of storage device technologies in a multi-tier I/O hierarchy, including flash, disk, and non-volatile memory technologies. The main SAGE software component is the Seagate Mero Object Storage that is accessible via the Clovis API and higher level interfaces. The SAGE project also includes scientific applications for the validation of the SAGE concepts. The objective of this paper is to present the SAGE project concepts, the prototype of the SAGE platform and discuss the software architecture of the SAGE system. △ Less

Submitted 6 July, 2018; originally announced July 2018.

Comments: Submitted to Computing Frontiers 2018. arXiv admin note: substantial text overlap with arXiv:1805.00556

arXiv:1807.02562 [pdf, other]

doi 10.1007/978-3-030-02465-9_8

Exploring Scientific Application Performance Using Large Scale Object Storage

Authors: Steven Wei-der Chien, Stefano Markidis, Rami Karim, Erwin Laure, Sai Narasimhamurthy

Abstract: One of the major performance and scalability bottlenecks in large scientific applications is parallel reading and writing to supercomputer I/O systems. The usage of parallel file systems and consistency requirements of POSIX, that all the traditional HPC parallel I/O interfaces adhere to, pose limitations to the scalability of scientific applications. Object storage is a widely used storage techno… ▽ More One of the major performance and scalability bottlenecks in large scientific applications is parallel reading and writing to supercomputer I/O systems. The usage of parallel file systems and consistency requirements of POSIX, that all the traditional HPC parallel I/O interfaces adhere to, pose limitations to the scalability of scientific applications. Object storage is a widely used storage technology in cloud computing and is more frequently proposed for HPC workload to address and improve the current scalability and performance of I/O in scientific applications. While object storage is a promising technology, it is still unclear how scientific applications will use object storage and what the main performance benefits will be. This work addresses these questions, by emulating an object storage used by a traditional scientific application and evaluating potential performance benefits. We show that scientific applications can benefit from the usage of object storage on large scales. △ Less

Submitted 6 July, 2018; originally announced July 2018.

Comments: Preprint submitted to WOPSSS workshop at ISC 2018

arXiv:1803.04014 [pdf, other]

doi 10.1109/IPDPSW.2018.00091

NVIDIA Tensor Core Programmability, Performance & Precision

Authors: Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, Jeffrey S. Vetter

Abstract: The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to pro… ▽ More The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores. △ Less

Submitted 11 March, 2018; originally announced March 2018.

Comments: This paper has been accepted by the Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2018

Showing 1–18 of 18 results for author: Chien, S W