Search | arXiv e-print repository

Kidnap** Deep Learning-based Multirotors using Optimized Flying Adversarial Patches

Authors: Pia Hanfeld, Khaled Wahba, Marina M. -C. Höhne, Michael Bussmann, Wolfgang Hönig

Abstract: Autonomous flying robots, such as multirotors, often rely on deep learning models that make predictions based on a camera image, e.g. for pose estimation. These models can predict surprising results if applied to input images outside the training domain. This fault can be exploited by adversarial attacks, for example, by computing small images, so-called adversarial patches, that can be placed in… ▽ More Autonomous flying robots, such as multirotors, often rely on deep learning models that make predictions based on a camera image, e.g. for pose estimation. These models can predict surprising results if applied to input images outside the training domain. This fault can be exploited by adversarial attacks, for example, by computing small images, so-called adversarial patches, that can be placed in the environment to manipulate the neural network's prediction. We introduce flying adversarial patches, where multiple images are mounted on at least one other flying robot and therefore can be placed anywhere in the field of view of a victim multirotor. By introducing the attacker robots, the system is extended to an adversarial multi-robot system. For an effective attack, we compare three methods that simultaneously optimize multiple adversarial patches and their position in the input image. We show that our methods scale well with the number of adversarial patches. Moreover, we demonstrate physical flights with two robots, where we employ a novel attack policy that uses the computed adversarial patches to kidnap a robot that was supposed to follow a human. △ Less

Submitted 23 October, 2023; v1 submitted 1 August, 2023; originally announced August 2023.

Comments: Accepted at MRS 2023, 7 pages, 5 figures. arXiv admin note: substantial text overlap with arXiv:2305.12859

arXiv:2305.12859 [pdf, other]

Flying Adversarial Patches: Manipulating the Behavior of Deep Learning-based Autonomous Multirotors

Authors: Pia Hanfeld, Marina M. -C. Höhne, Michael Bussmann, Wolfgang Hönig

Abstract: Autonomous flying robots, e.g. multirotors, often rely on a neural network that makes predictions based on a camera image. These deep learning (DL) models can compute surprising results if applied to input images outside the training domain. Adversarial attacks exploit this fault, for example, by computing small images, so-called adversarial patches, that can be placed in the environment to manipu… ▽ More Autonomous flying robots, e.g. multirotors, often rely on a neural network that makes predictions based on a camera image. These deep learning (DL) models can compute surprising results if applied to input images outside the training domain. Adversarial attacks exploit this fault, for example, by computing small images, so-called adversarial patches, that can be placed in the environment to manipulate the neural network's prediction. We introduce flying adversarial patches, where an image is mounted on another flying robot and therefore can be placed anywhere in the field of view of a victim multirotor. For an effective attack, we compare three methods that simultaneously optimize the adversarial patch and its position in the input image. We perform an empirical validation on a publicly available DL model and dataset for autonomous multirotors. Ultimately, our attacking multirotor would be able to gain full control over the motions of the victim multirotor. △ Less

Submitted 31 July, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: 6 pages, 5 figures, Workshop on Multi-Robot Learning, International Conference on Robotics and Automation (ICRA)

arXiv:2303.00657 [pdf, other]

Learning Electron Bunch Distribution along a FEL Beamline by Normalising Flows

Authors: Anna Willmann, Jurjen Couperus Cabadağ, Yen-Yu Chang, Richard Pausch, Amin Ghaith, Alexander Debus, Arie Irman, Michael Bussmann, Ulrich Schramm, Nico Hoffmann

Abstract: Understanding and control of Laser-driven Free Electron Lasers remain to be difficult problems that require highly intensive experimental and theoretical research. The gap between simulated and experimentally collected data might complicate studies and interpretation of obtained results. In this work we developed a deep learning based surrogate that could help to fill in this gap. We introduce a s… ▽ More Understanding and control of Laser-driven Free Electron Lasers remain to be difficult problems that require highly intensive experimental and theoretical research. The gap between simulated and experimentally collected data might complicate studies and interpretation of obtained results. In this work we developed a deep learning based surrogate that could help to fill in this gap. We introduce a surrogate model based on normalising flows for conditional phase-space representation of electron clouds in a FEL beamline. Achieved results let us discuss further benefits and limitations in exploitability of the models to gain deeper understanding of fundamental processes within a beamline. △ Less

Submitted 27 February, 2023; originally announced March 2023.

Comments: 7 pages, 5 figures

Journal ref: Machine Learning and the Physical Sciences 2022 workshop, NeurIPS

arXiv:2211.04770 [pdf, other]

Continual learning autoencoder training for a particle-in-cell simulation via streaming

Authors: Patrick Stiller, Varun Makdani, Franz Pöschel, Richard Pausch, Alexander Debus, Michael Bussmann, Nico Hoffmann

Abstract: The upcoming exascale era will provide a new generation of physics simulations. These simulations will have a high spatiotemporal resolution, which will impact the training of machine learning models since storing a high amount of simulation data on disk is nearly impossible. Therefore, we need to rethink the training of machine learning models for simulations for the upcoming exascale era. This w… ▽ More The upcoming exascale era will provide a new generation of physics simulations. These simulations will have a high spatiotemporal resolution, which will impact the training of machine learning models since storing a high amount of simulation data on disk is nearly impossible. Therefore, we need to rethink the training of machine learning models for simulations for the upcoming exascale era. This work presents an approach that trains a neural network concurrently to a running simulation without storing data on a disk. The training pipeline accesses the training data by in-memory streaming. Furthermore, we apply methods from the domain of continual learning to enhance the generalization of the model. We tested our pipeline on the training of a 3d autoencoder trained concurrently to laser wakefield acceleration particle-in-cell simulation. Furthermore, we experimented with various continual learning methods and their effect on the generalization. △ Less

Submitted 9 November, 2022; originally announced November 2022.

arXiv:2209.09731 [pdf]

Application Experiences on a GPU-Accelerated Arm-based HPC Testbed

Authors: Wael Elwasif, William Godoy, Nick Hagerty, J. Austin Harris, Oscar Hernandez, Balint Joo, Paul Kent, Damien Lebrun-Grandie, Elijah Maccarthy, Veronica G. Melesse Vergara, Bronson Messer, Ross Miller, Sarp Opal, Sergei Bastrakov, Michael Bussmann, Alexander Debus, Klaus Steinger, Jan Stephan, Rene Widera, Spencer H. Bryngelson, Henry Le Berre, Anand Radhakrishnan, Jefferey Young, Sunita Chandrasekaran, Florina Ciorba , et al. (6 additional authors not shown)

Abstract: This paper assesses and reports the experience of ten teams working to port,validate, and benchmark several High Performance Computing applications on a novel GPU-accelerated Arm testbed system. The testbed consists of eight NVIDIA Arm HPC Developer Kit systems built by GIGABYTE, each one equipped with a server-class Arm CPU from Ampere Computing and A100 data center GPU from NVIDIA Corp. The syst… ▽ More This paper assesses and reports the experience of ten teams working to port,validate, and benchmark several High Performance Computing applications on a novel GPU-accelerated Arm testbed system. The testbed consists of eight NVIDIA Arm HPC Developer Kit systems built by GIGABYTE, each one equipped with a server-class Arm CPU from Ampere Computing and A100 data center GPU from NVIDIA Corp. The systems are connected together using Infiniband high-bandwidth low-latency interconnect. The selected applications and mini-apps are written using several programming languages and use multiple accelerator-based programming models for GPUs such as CUDA, OpenACC, and OpenMP offloading. Working on application porting requires a robust and easy-to-access programming environment, including a variety of compilers and optimized scientific libraries. The goal of this work is to evaluate platform readiness and assess the effort required from developers to deploy well-established scientific workloads on current and future generation Arm-based GPU-accelerated HPC systems. The reported case studies demonstrate that the current level of maturity and diversity of software and tools is already adequate for large-scale production deployments. △ Less

Submitted 19 December, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

arXiv:2110.08650 [pdf, ps, other]

doi 10.1007/978-3-030-97759-7_5

Challenges Porting a C++ Template-Metaprogramming Abstraction Layer to Directive-based Offloading

Authors: Jeffrey Kelling, Sergei Bastrakov, Alexander Debus, Thomas Kluge, Matt Leinhauser, Richard Pausch, Klaus Steiniger, Jan Stephan, René Widera, Jeff Young, Michael Bussmann, Sunita Chandrasekaran, Guido Juckeland

Abstract: HPC systems employ a growing variety of compute accelerators with different architectures and from different vendors. Large scientific applications are required to run efficiently across these systems but need to retain a single code-base in order to not stifle development. Directive-based offloading programming models set out to provide the required portability, but, to existing codes, they thems… ▽ More HPC systems employ a growing variety of compute accelerators with different architectures and from different vendors. Large scientific applications are required to run efficiently across these systems but need to retain a single code-base in order to not stifle development. Directive-based offloading programming models set out to provide the required portability, but, to existing codes, they themselves represent yet another API to port to. Here, we present our approach of porting the GPU-accelerated particle-in-cell code PIConGPU to OpenACC and OpenMP target by adding two new backends to its existing C++-template metaprogramming-based offloading abstraction layer alpaka and avoiding other modifications to the application code. We introduce our approach in the face of conflicts between requirements and available features in the standards as well as practical hurdles posed by immature compiler support. △ Less

Submitted 24 January, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

Comments: 20 pages, 1 figure, 3 tables, WACCPD@SC21

ACM Class: D.1.3; D.2.1; D.3.3

arXiv:2110.08221 [pdf, other]

Metrics and Design of an Instruction Roofline Model for AMD GPUs

Authors: Matthew Leinhauser, René Widera, Sergei Bastrakov, Alexander Debus, Michael Bussmann, Sunita Chandrasekaran

Abstract: Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD architectures (CPU-GPU), which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs… ▽ More Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD architectures (CPU-GPU), which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this paper, we design an instruction roofline model for AMD GPUs using AMD's ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application's performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell (PIC) simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work. △ Less

Submitted 10 November, 2021; v1 submitted 15 October, 2021; originally announced October 2021.

Comments: 14 pages, 7 figures, 2 tables, 4 equations, explains how to create an instruction roofline model for an AMD GPU as of Oct. 2021

arXiv:2107.13067 [pdf, other]

A Deep Learning Algorithm for Piecewise Linear Interface Construction (PLIC)

Authors: Mohammadmehdi Ataei, Erfan Pirmorad, Franco Costa, Se** Han, Chul B Park, Markus Bussmann

Abstract: Piecewise Linear Interface Construction (PLIC) is frequently used to geometrically reconstruct fluid interfaces in Computational Fluid Dynamics (CFD) modeling of two-phase flows. PLIC reconstructs interfaces from a scalar field that represents the volume fraction of each phase in each computational cell. Given the volume fraction and interface normal, the location of a linear interface is uniquely… ▽ More Piecewise Linear Interface Construction (PLIC) is frequently used to geometrically reconstruct fluid interfaces in Computational Fluid Dynamics (CFD) modeling of two-phase flows. PLIC reconstructs interfaces from a scalar field that represents the volume fraction of each phase in each computational cell. Given the volume fraction and interface normal, the location of a linear interface is uniquely defined. For a cubic computational cell (3D), the position of the planar interface is determined by intersecting the cube with a plane, such that the volume of the resulting truncated polyhedron cell is equal to the volume fraction. Yet it is geometrically complex to find the exact position of the plane, and it involves calculations that can be a computational bottleneck of many CFD models. However, while the forward problem of 3D PLIC is challenging, the inverse problem, of finding the volume of the truncated polyhedron cell given a defined plane, is simple. In this work, we propose a deep learning model for the solution to the forward problem of PLIC by only making use of its inverse problem. The proposed model is up to several orders of magnitude faster than traditional schemes, which significantly reduces the computational bottleneck of PLIC in CFD simulations. △ Less

Submitted 27 July, 2021; originally announced July 2021.

MSC Class: 65Dxx

arXiv:2107.06108 [pdf]

doi 10.1007/978-3-030-96498-6_6

Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2

Authors: Franz Poeschel, Juncheng E, William F. Godoy, Norbert Podhorszki, Scott Klasky, Greg Eisenhauer, Philip E. Davis, Lipeng Wan, Ana Gainaru, Junmin Gu, Fabian Koller, René Widera, Michael Bussmann, Axel Huebl

Abstract: This paper aims to create a transition path from file-based IO to streaming-based workflows for scientific applications in an HPC environment. By using the openPMP-api, traditional workflows limited by filesystem bottlenecks can be overcome and flexibly extended for in situ analysis. The openPMD-api is a library for the description of scientific data according to the Open Standard for Particle-Mes… ▽ More This paper aims to create a transition path from file-based IO to streaming-based workflows for scientific applications in an HPC environment. By using the openPMP-api, traditional workflows limited by filesystem bottlenecks can be overcome and flexibly extended for in situ analysis. The openPMD-api is a library for the description of scientific data according to the Open Standard for Particle-Mesh Data (openPMD). Its approach towards recent challenges posed by hardware heterogeneity lies in the decoupling of data description in domain sciences, such as plasma physics simulations, from concrete implementations in hardware and IO. The streaming backend is provided by the ADIOS2 framework, developed at Oak Ridge National Laboratory. This paper surveys two openPMD-based loosely-coupled setups to demonstrate flexible applicability and to evaluate performance. In loose coupling, as opposed to tight coupling, two (or more) applications are executed separately, e.g. in individual MPI contexts, yet cooperate by exchanging data. This way, a streaming-based workflow allows for standalone codes instead of tightly-coupled plugins, using a unified streaming-aware API and leveraging high-speed communication infrastructure available in modern compute clusters for massive data exchange. We determine new challenges in resource allocation and in the need of strategies for a flexible data distribution, demonstrating their influence on efficiency and scaling on the Summit compute system. The presented setups show the potential for a more flexible use of compute resources brought by streaming IO as well as the ability to increase throughput by avoiding filesystem bottlenecks. △ Less

Submitted 19 January, 2022; v1 submitted 13 July, 2021; originally announced July 2021.

Comments: 18 pages, 9 figures, SMC2021, supplementary material at https://zenodo.org/record/4906276

arXiv:2106.12894 [pdf, other]

InFlow: Robust outlier detection utilizing Normalizing Flows

Authors: Nishant Kumar, Pia Hanfeld, Michael Hecht, Michael Bussmann, Stefan Gumhold, Nico Hoffmann

Abstract: Normalizing flows are prominent deep generative models that provide tractable probability distributions and efficient density estimation. However, they are well known to fail while detecting Out-of-Distribution (OOD) inputs as they directly encode the local features of the input representations in their latent space. In this paper, we solve this overconfidence issue of normalizing flows by demonst… ▽ More Normalizing flows are prominent deep generative models that provide tractable probability distributions and efficient density estimation. However, they are well known to fail while detecting Out-of-Distribution (OOD) inputs as they directly encode the local features of the input representations in their latent space. In this paper, we solve this overconfidence issue of normalizing flows by demonstrating that flows, if extended by an attention mechanism, can reliably detect outliers including adversarial attacks. Our approach does not require outlier data for training and we showcase the efficiency of our method for OOD detection by reporting state-of-the-art performance in diverse experimental settings. Code available at https://github.com/ComputationalRadiationPhysics/InFlow . △ Less

Submitted 16 November, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

arXiv:2106.04284 [pdf, other]

doi 10.1002/spe.3077

LLAMA: The Low-Level Abstraction For Memory Access

Authors: Bernhard Manfred Gruber, Guilherme Amadio, Jakob Blomer, Alexander Matthes, René Widera, Michael Bussmann

Abstract: The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished v… ▽ More The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged. We present the Low-Level Abstraction of Memory Access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. The library is extensible with third-party allocators. Providing two close-to-life examples, we show that the LLAMA-generated AoS (Array of Structs) and SoA (Struct of Arrays) layouts produce identical code with the same performance characteristics as manually written data structures. Integrations into the SPEC CPU\textsuperscript{\textregistered} lbm benchmark and the particle-in-cell simulation PIConGPU demonstrate LLAMA's abilities in real-world applications. LLAMA's layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element-wise copying. LLAMA provides a novel tool for the development of high-performance C++ applications in a heterogeneous environment. △ Less

Submitted 9 March, 2022; v1 submitted 8 June, 2021; originally announced June 2021.

Comments: 39 pages, 10 figures, 11 listings

Journal ref: Softw Pract Exper. 2022; 1- 27

arXiv:2106.00432 [pdf, other]

Invertible Surrogate Models: Joint surrogate modelling and reconstruction of Laser-Wakefield Acceleration by invertible neural networks

Authors: Friedrich Bethke, Richard Pausch, Patrick Stiller, Alexander Debus, Michael Bussmann, Nico Hoffmann

Abstract: Invertible neural networks are a recent technique in machine learning promising neural network architectures that can be run in forward and reverse mode. In this paper, we will be introducing invertible surrogate models that approximate complex forward simulation of the physics involved in laser plasma accelerators: iLWFA. The bijective design of the surrogate model also provides all means for rec… ▽ More Invertible neural networks are a recent technique in machine learning promising neural network architectures that can be run in forward and reverse mode. In this paper, we will be introducing invertible surrogate models that approximate complex forward simulation of the physics involved in laser plasma accelerators: iLWFA. The bijective design of the surrogate model also provides all means for reconstruction of experimentally acquired diagnostics. The quality of our invertible laser wakefield acceleration network will be verified on a large set of numerical LWFA simulations. △ Less

Submitted 1 June, 2021; originally announced June 2021.

arXiv:2106.00317 [pdf, other]

Data-Driven Shadowgraph Simulation of a 3D Object

Authors: Anna Willmann, Patrick Stiller, Alexander Debus, Arie Irman, Richard Pausch, Yen-Yu Chang, Michael Bussmann, Nico Hoffmann

Abstract: In this work we propose a deep neural network based surrogate model for a plasma shadowgraph - a technique for visualization of perturbations in a transparent medium. We are substituting the numerical code by a computationally cheaper projection based surrogate model that is able to approximate the electric fields at a given time without computing all preceding electric fields as required by numer… ▽ More In this work we propose a deep neural network based surrogate model for a plasma shadowgraph - a technique for visualization of perturbations in a transparent medium. We are substituting the numerical code by a computationally cheaper projection based surrogate model that is able to approximate the electric fields at a given time without computing all preceding electric fields as required by numerical methods. This means that the projection based surrogate model allows to recover the solution of the governing 3D partial differential equation, 3D wave equation, at any point of a given compute domain and configuration without the need to run a full simulation. This model has shown a good quality of reconstruction in a problem of interpolation of data within a narrow range of simulation parameters and can be used for input data of large size. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Comments: 9 pages, 9 figures. Published as a workshop paper at ICLR 2021 SimDL Workshop

arXiv:2009.03730 [pdf, other]

Large-scale Neural Solvers for Partial Differential Equations

Authors: Patrick Stiller, Friedrich Bethke, Maximilian Böhme, Richard Pausch, Sunna Torge, Alexander Debus, Jan Vorberger, Michael Bussmann, Nico Hoffmann

Abstract: Solving partial differential equations (PDE) is an indispensable part of many branches of science as many processes can be modelled in terms of PDEs. However, recent numerical solvers require manual discretization of the underlying equation as well as sophisticated, tailored code for distributed computing. Scanning the parameters of the underlying model significantly increases the runtime as the s… ▽ More Solving partial differential equations (PDE) is an indispensable part of many branches of science as many processes can be modelled in terms of PDEs. However, recent numerical solvers require manual discretization of the underlying equation as well as sophisticated, tailored code for distributed computing. Scanning the parameters of the underlying model significantly increases the runtime as the simulations have to be cold-started for each parameter configuration. Machine Learning based surrogate models denote promising ways for learning complex relationship among input, parameter and solution. However, recent generative neural networks require lots of training data, i.e. full simulation runs making them costly. In contrast, we examine the applicability of continuous, mesh-free neural solvers for partial differential equations, physics-informed neural networks (PINNs) solely requiring initial/boundary values and validation points for training but no simulation data. The induced curse of dimensionality is approached by learning a domain decomposition that steers the number of neurons per unit volume and significantly improves runtime. Distributed training on large-scale cluster systems also promises great utilization of large quantities of GPUs which we assess by a comprehensive evaluation study. Finally, we discuss the accuracy of GatedPINN with respect to analytical solutions -- as well as state-of-the-art numerical solvers, such as spectral solvers. △ Less

Submitted 8 September, 2020; originally announced September 2020.

arXiv:2007.04244 [pdf, other]

doi 10.1016/j.compfluid.2021.104950

NPLIC: A Machine Learning Approach to Piecewise Linear Interface Construction

Authors: Mohammadmehdi Ataei, Markus Bussmann, Vahid Shaayegan, Franco Costa, Se** Han, Chul B. Park

Abstract: Volume of fluid (VOF) methods are extensively used to track fluid interfaces in numerical simulations, and many VOF algorithms require that the interface be reconstructed geometrically. For this purpose, the Piecewise Linear Interface Construction (PLIC) technique is most frequently used, which for reasons of geometric complexity can be slow and difficult to implement. Here, we propose an alternat… ▽ More Volume of fluid (VOF) methods are extensively used to track fluid interfaces in numerical simulations, and many VOF algorithms require that the interface be reconstructed geometrically. For this purpose, the Piecewise Linear Interface Construction (PLIC) technique is most frequently used, which for reasons of geometric complexity can be slow and difficult to implement. Here, we propose an alternative neural network based method called NPLIC to perform PLIC calculations. The model is trained on a large synthetic dataset of PLIC solutions for square, cubic, triangular, and tetrahedral meshes. We show that this data-driven approach results in accurate calculations at a fraction of the usual computational cost, and a single neural network system can be used for interface reconstruction of different mesh types. △ Less

Submitted 24 January, 2021; v1 submitted 26 June, 2020; originally announced July 2020.

ACM Class: I.2; I.3; G.0

arXiv:1706.10086 [pdf, other]

doi 10.1007/978-3-319-67630-2_36

Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library

Authors: Alexander Matthes, René Widera, Erik Zenker, Benjamin Worpitz, Axel Huebl, Michael Bussmann

Abstract: We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this e… ▽ More We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL) and Haswell architecture as well as IBM's Power8 system. On some of these we are able to reach almost 50\% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific #pragmas we are able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system. △ Less

Submitted 30 June, 2017; originally announced June 2017.

Comments: Accepted paper for the P\^{}3MA workshop at the ISC 2017 in Frankfurt

Journal ref: J.M. Kunkel et al. (Eds.): ISC High Performance Workshops 2017, LNCS 10524, pp. 496-514, 2017

arXiv:1706.00522 [pdf, other]

doi 10.1007/978-3-319-67630-2_2

On the Scalability of Data Reduction Techniques in Current and Upcoming HPC Systems from an Application Perspective

Authors: Axel Huebl, Rene Widera, Felix Schmitt, Alexander Matthes, Norbert Podhorszki, Jong Youl Choi, Scott Klasky, Michael Bussmann

Abstract: We implement and benchmark parallel I/O methods for the fully-manycore driven particle-in-cell code PIConGPU. Identifying throughput and overall I/O size as a major challenge for applications on today's and future HPC systems, we present a scaling law characterizing performance bottlenecks in state-of-the-art approaches for data reduction. Consequently, we propose, implement and verify multi-threa… ▽ More We implement and benchmark parallel I/O methods for the fully-manycore driven particle-in-cell code PIConGPU. Identifying throughput and overall I/O size as a major challenge for applications on today's and future HPC systems, we present a scaling law characterizing performance bottlenecks in state-of-the-art approaches for data reduction. Consequently, we propose, implement and verify multi-threaded data-transformations for the I/O library ADIOS as a feasible way to trade underutilized host-side compute potential on heterogeneous systems for reduced I/O latency. △ Less

Submitted 1 June, 2017; originally announced June 2017.

Comments: 15 pages, 5 figures, accepted for DRBSD-1 in conjunction with ISC'17

ACM Class: D.4.8; B.4.3; I.6.6

Journal ref: J.M. Kunkel et al. (Eds.): ISC High Performance Workshops 2017, LNCS 10524, pp. 15-29, 2017

arXiv:1611.09048 [pdf, other]

doi 10.14529/jsfi160403

In situ, steerable, hardware-independent and data-structure agnostic visualization with ISAAC

Authors: Alexander Matthes, Axel Huebl, René Widera, Sebastian Grottel, Stefan Gumhold, Michael Bussmann

Abstract: The computation power of supercomputers grows faster than the bandwidth of their storage and network. Especially applications using hardware accelerators like Nvidia GPUs cannot save enough data to be analyzed in a later step. There is a high risk of loosing important scientific information. We introduce the in situ template library ISAAC which enables arbitrary applications like scientific simula… ▽ More The computation power of supercomputers grows faster than the bandwidth of their storage and network. Especially applications using hardware accelerators like Nvidia GPUs cannot save enough data to be analyzed in a later step. There is a high risk of loosing important scientific information. We introduce the in situ template library ISAAC which enables arbitrary applications like scientific simulations to live visualize their data without the need of deep copy operations or data transformation using the very same compute node and hardware accelerator the data is already residing on. Arbitrary meta data can be added to the renderings and user defined steering commands can be asynchronously sent back to the running application. Using an aggregating server, ISAAC streams the interactive visualization video and enables user to access their applications from everywhere. △ Less

Submitted 28 November, 2016; originally announced November 2016.

Journal ref: Supercomputing Frontiers and Innovations, [S.l.], v. 3, n. 4, p. 30-48, oct. 2016

arXiv:1606.02862 [pdf, other]

doi 10.1007/978-3-319-46079-6_21

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond

Authors: Erik Zenker, René Widera, Axel Huebl, Guido Juckeland, Andreas Knüpfer, Wolfgang E. Nagel, Michael Bussmann

Abstract: With the appearance of the heterogeneous platform OpenPower,many-core accelerator devices have been coupled with Power host processors for the first time. Towards utilizing their full potential, it is worth investigating performance portable algorithms that allow to choose the best-fitting hardware for each domain-specific compute task. Suiting even the high level of parallelism on modern GPGPUs,… ▽ More With the appearance of the heterogeneous platform OpenPower,many-core accelerator devices have been coupled with Power host processors for the first time. Towards utilizing their full potential, it is worth investigating performance portable algorithms that allow to choose the best-fitting hardware for each domain-specific compute task. Suiting even the high level of parallelism on modern GPGPUs, our presented approach relies heavily on abstract meta-programming techniques, which are essential to focus on fine-grained tuning rather than code porting. With this in mind, the CUDA-based open-source plasma simulation code PIConGPU is currently being abstracted to support the heterogeneous OpenPower platform using our fast porting interface cupla, which wraps the abstract parallel C++11 kernel acceleration library Alpaka. We demonstrate how PIConGPU can benefit from the tunable kernel execution strategies of the Alpaka library, achieving portability and performance with single-source kernels on conventional CPUs, Power8 CPUs and NVIDIA GPUs. △ Less

Submitted 12 June, 2016; v1 submitted 9 June, 2016; originally announced June 2016.

Comments: 9 pages, 3 figures, accepted on IWOPH 2016

Journal ref: Lecture Notes in Computer Science, 9945, pp 293-301, 2016

arXiv:1602.08477 [pdf, other]

doi 10.1109/IPDPSW.2016.50

Alpaka - An Abstraction Library for Parallel Kernel Acceleration

Authors: Erik Zenker, Benjamin Worpitz, René Widera, Axel Huebl, Guido Juckeland, Andreas Knüpfer, Wolfgang E. Nagel, Michael Bussmann

Abstract: Porting applications to new hardware or programming models is a tedious and error prone process. Every help that eases these burdens is saving developer time that can then be invested into the advancement of the application itself instead of preserving the status-quo on a new platform. The Alpaka library defines and implements an abstract hierarchical redundant parallelism model. The model explo… ▽ More Porting applications to new hardware or programming models is a tedious and error prone process. Every help that eases these burdens is saving developer time that can then be invested into the advancement of the application itself instead of preserving the status-quo on a new platform. The Alpaka library defines and implements an abstract hierarchical redundant parallelism model. The model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. By doing so, it allows to achieve platform and performance portability across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi- and many-core CPUs, GPUs and other accelerators) are supported for and can be programmed in the same way. The Alpaka C++ template interface allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization. Running Alpaka applications on a new (and supported) platform requires the change of only one source code line instead of a lot of \#ifdefs. △ Less

Submitted 26 February, 2016; originally announced February 2016.

Comments: 10 pages, 10 figures

Showing 1–20 of 20 results for author: Bussmann, M