-
GPU-Acceleration of the ELPA2 Distributed Eigensolver for Dense Symmetric and Hermitian Eigenproblems
Authors:
Victor Wen-zhe Yu,
Jonathan Moussa,
Pavel Kůs,
Andreas Marek,
Peter Messmer,
Mina Yoon,
Hermann Lederer,
Volker Blum
Abstract:
The solution of eigenproblems is often a key computational bottleneck that limits the tractable system size of numerical algorithms, among them electronic structure theory in chemistry and in condensed matter physics. Large eigenproblems can easily exceed the capacity of a single compute node, thus must be solved on distributed-memory parallel computers. We here present GPU-oriented optimizations…
▽ More
The solution of eigenproblems is often a key computational bottleneck that limits the tractable system size of numerical algorithms, among them electronic structure theory in chemistry and in condensed matter physics. Large eigenproblems can easily exceed the capacity of a single compute node, thus must be solved on distributed-memory parallel computers. We here present GPU-oriented optimizations of the ELPA two-stage tridiagonalization eigensolver (ELPA2). On top of cuBLAS-based GPU offloading, we add a CUDA kernel to speed up the back-transformation of eigenvectors, which can be the computationally most expensive part of the two-stage tridiagonalization algorithm. We benchmark the performance of this GPU-accelerated eigensolver on two hybrid CPU-GPU architectures, namely a compute cluster based on Intel Xeon Gold CPUs and NVIDIA Volta GPUs, and the Summit supercomputer based on IBM POWER9 CPUs and NVIDIA Volta GPUs. Consistent with previous benchmarks on CPU-only architectures, the GPU-accelerated two-stage solver exhibits a parallel performance superior to the one-stage counterpart. Finally, we demonstrate the performance of the GPU-accelerated eigensolver developed in this work for routine semi-local KS-DFT calculations comprising thousands of atoms.
△ Less
Submitted 14 January, 2021; v1 submitted 25 February, 2020;
originally announced February 2020.
-
Performance report and optimized implementations of Weather & Climate dwarfs on multi-node systems
Authors:
Louis Douriez,
Alan Gray,
David Guibert,
Peter Messmer,
Erwan Raffin
Abstract:
This document is one of the deliverable reports created for the ESCAPE project. ESCAPE stands for Energy-efficient Scalable Algorithms for Weather Prediction at Exascale. The project develops world-class, extreme-scale computing capabilities for European operational numerical weather prediction and future climate models. This is done by identifying Weather & Climate dwarfs which are key patterns i…
▽ More
This document is one of the deliverable reports created for the ESCAPE project. ESCAPE stands for Energy-efficient Scalable Algorithms for Weather Prediction at Exascale. The project develops world-class, extreme-scale computing capabilities for European operational numerical weather prediction and future climate models. This is done by identifying Weather & Climate dwarfs which are key patterns in terms of computation and communication (in the spirit of the Berkeley dwarfs). These dwarfs are then optimised for different hardware architectures (single and multi-node) and alternative algorithms are explored. Performance portability is addressed through the use of domain specific languages.
Here we summarize the work performed on optimizations of the dwarfs focusing on CPU multi-nodes and multi-GPUs. We limit ourselves to a subset of the dwarf configurations chosen by the consortium. Intra-node optimizations of the dwarfs and energy-specific optimizations have been described in Deliverable D3.3. To cover the important algorithmic motifs we picked dwarfs related to the dynamical core as well as column physics. Specifically, we focused on the formulation relevant to spectral codes like ECMWF's IFS code.
The main findings of this report are: (a) Up-to 30% performance gain with CPU based multi-node systems compared to optimized version of dwarfs from task 3.3 (see D3.3), (b) up to 10X performance gain on multiple GPUs from optimizations to keep data resident on the GPU and enable fast inter-GPU communication mechanisms, and (c) multi-GPU systems which feature a high-bandwidth all-to-all interconnect topology with NVLink/NVSwitch hardware are particularly well suited to the algorithms.
△ Less
Submitted 16 August, 2019;
originally announced August 2019.
-
Performance report and optimized implementation of Weather & Climate Dwarfs on GPU, MIC and Optalysys Optical Processor
Authors:
Cyril Mazauric,
Erwan Raffin,
Xavier Vigouroux,
David Guibert,
Alex Macfaden,
Jacob Poulsen,
Per Berg,
Alan Gray,
Peter Messmer
Abstract:
This document is one of the deliverable reports created for the ESCAPE project. ESCAPE stands for Energy-efficient Scalable Algorithms for Weather Prediction at Exascale. The project develops world-class, extreme-scale computing capabilities for European operational numerical weather prediction and future climate models. This is done by identifying Weather & Climate dwarfs which are key patterns i…
▽ More
This document is one of the deliverable reports created for the ESCAPE project. ESCAPE stands for Energy-efficient Scalable Algorithms for Weather Prediction at Exascale. The project develops world-class, extreme-scale computing capabilities for European operational numerical weather prediction and future climate models. This is done by identifying Weather & Climate dwarfs which are key patterns in terms of computation and communication (in the spirit of the Berkeley dwarfs). These dwarfs are then optimised for different hardware architectures (single and multi-node) and alternative algorithms are explored. Performance portability is addressed through the use of domain specific languages.
Here we summarize the work performed on optimizations of the dwarfs on CPUs, Xeon Phi, GPUs and on the Optalysys optical processor. We limit ourselves to a subset of the dwarf configurations and to problem sizes small enough to execute on a single node. Also, we use time-to-solution as the main performance metric. Multi-node optimizations of the dwarfs and energy-specific optimizations are beyond the scope of this report and will be described in Deliverable D3.4. To cover the important algorithmic motifs we picked dwarfs related to the dynamical core as well as column physics. Specifically, we focused on the formulation relevant to spectral codes like ECMWF's IFS code.
The main findings of this report are: (a) Acceleration of 1.1x - 2.5x of the dwarfs on CPU based systems using compiler directives, (b) order of magnitude acceleration of the dwarfs on GPUs (23x for spectral transform, 9x for MPDATA) using data locality optimizations and (c) demonstrated feasibility of a spectral transform in a purely optical fashion.
△ Less
Submitted 16 August, 2019;
originally announced August 2019.
-
Photo-Guided Exploration of Volume Data Features
Authors:
Mohammad Raji,
Alok Hota,
Robert Sisneros,
Peter Messmer,
Jian Huang
Abstract:
In this work, we pose the question of whether, by considering qualitative information such as a sample target image as input, one can produce a rendered image of scientific data that is similar to the target. The algorithm resulting from our research allows one to ask the question of whether features like those in the target image exists in a given dataset. In that way, our method is one of imager…
▽ More
In this work, we pose the question of whether, by considering qualitative information such as a sample target image as input, one can produce a rendered image of scientific data that is similar to the target. The algorithm resulting from our research allows one to ask the question of whether features like those in the target image exists in a given dataset. In that way, our method is one of imagery query or reverse engineering, as opposed to manual parameter tweaking of the full visualization pipeline. For target images, we can use real-world photographs of physical phenomena. Our method leverages deep neural networks and evolutionary optimization. Using a trained similarity function that measures the difference between renderings of a phenomenon and real-world photographs, our method optimizes rendering parameters. We demonstrate the efficacy of our method using a superstorm simulation dataset and images found online. We also discuss a parallel implementation of our method, which was run on NCSA's Blue Waters.
△ Less
Submitted 18 October, 2017;
originally announced October 2017.
-
A portable platform for accelerated PIC codes and its application to GPUs using OpenACC
Authors:
F. Hariri,
T. M. Tran,
A. Jocksch,
E. Lanti,
J. Progsch,
P. Messmer,
S. Brunner,
G. Gheller,
L. Villard
Abstract:
We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platfo…
▽ More
We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the basic steps of the PIC algorithm and has been designed as a test bed for different algorithmic options and data structures. Among the architectures that this engine can explore, particular attention is given here to systems equipped with GPUs. The study demonstrates that our portable PIC implementation based on the OpenACC programming model can achieve performance closely matching theoretical predictions. Using the Cray XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the one on an Intel Sandybridge 8-core CPU by a factor of 3.4.
△ Less
Submitted 9 March, 2016;
originally announced March 2016.
-
Apar-T: code, validation, and physical interpretation of particle-in-cell results
Authors:
Mickaël Melzani,
Christophe Winisdoerffer,
Rolf Walder,
Doris Folini,
Jean M. Favre,
Stefan Krastanov,
Peter Messmer
Abstract:
We present the parallel particle-in-cell (PIC) code Apar-T and, more importantly, address the fundamental question of the relations between the PIC model, the Vlasov-Maxwell theory, and real plasmas.
First, we present four validation tests: spectra from simulations of thermal plasmas, linear growth rates of the relativistic tearing instability and of the filamentation instability, and non-linear…
▽ More
We present the parallel particle-in-cell (PIC) code Apar-T and, more importantly, address the fundamental question of the relations between the PIC model, the Vlasov-Maxwell theory, and real plasmas.
First, we present four validation tests: spectra from simulations of thermal plasmas, linear growth rates of the relativistic tearing instability and of the filamentation instability, and non-linear filamentation merging phase. For the filamentation instability we show that the effective growth rates measured on the total energy can differ by more than 50% from the linear cold predictions and from the fastest modes of the simulation.
Second, we detail a new method for initial loading of Maxwell-Jüttner particle distributions with relativistic bulk velocity and relativistic temperature, and explain why the traditional method with individual particle boosting fails.
Third, we scrutinize the question of what description of physical plasmas is obtained by PIC models. These models rely on two building blocks: coarse-graining, i.e., grou** of the order of p~10^10 real particles into a single computer superparticle, and field storage on a grid with its subsequent finite superparticle size. We introduce the notion of coarse-graining dependent quantities, i.e., quantities depending on p. They derive from the PIC plasma parameter Lambda^{PIC}, which we show to scale as 1/p. We explore two implications. One is that PIC collision- and fluctuation-induced thermalization times are expected to scale with the number of superparticles per grid cell, and thus to be a factor p~10^10 smaller than in real plasmas. The other is that the level of electric field fluctuations scales as 1/Lambda^{PIC} ~ p. We provide a corresponding exact expression.
Fourth, we compare the Vlasov-Maxwell theory, which describes a phase-space fluid with infinite Lambda, to the PIC model and its relatively small Lambda.
△ Less
Submitted 27 August, 2013;
originally announced August 2013.
-
Status of GDL - GNU Data Language
Authors:
A. Coulais,
M. Schellens,
J. Gales,
S. Arabas,
M. Boquien,
P. Chanial,
P. Messmer,
D. Fillmore,
O. Poplawski,
S. Maret,
G. Marchal,
N. Galmiche,
T. Mermet
Abstract:
GNU Data Language (GDL) is an open-source interpreted language aimed at numerical data analysis and visualisation. It is a free implementation of the Interactive Data Language (IDL) widely used in Astronomy. GDL has a full syntax compatibility with IDL, and includes a large set of library routines targeting advanced matrix manipulation, plotting, time-series and image analysis, map**, and data i…
▽ More
GNU Data Language (GDL) is an open-source interpreted language aimed at numerical data analysis and visualisation. It is a free implementation of the Interactive Data Language (IDL) widely used in Astronomy. GDL has a full syntax compatibility with IDL, and includes a large set of library routines targeting advanced matrix manipulation, plotting, time-series and image analysis, map**, and data input/output including numerous scientific data formats. We will present the current status of the project, the key accomplishments, and the weaknesses - areas where contributions are welcome !
△ Less
Submitted 4 January, 2011;
originally announced January 2011.
-
On particle acceleration and trap** by Poynting flux dominated flows
Authors:
Gunnar Paesold,
Eric G. Blackman,
Peter Messmer
Abstract:
Using particle-in-cell (PIC) simulations, we study the evolution of a strongly magnetized plasma slab propagating into a finite density ambient medium. Like previous work, we find that the slab breaks into discrete magnetic pulses. The subsequent evolution is consistent with diamagnetic relativistic pulse acceleration of \cite{liangetal2003}. Unlike previous work, we use the actual electron to p…
▽ More
Using particle-in-cell (PIC) simulations, we study the evolution of a strongly magnetized plasma slab propagating into a finite density ambient medium. Like previous work, we find that the slab breaks into discrete magnetic pulses. The subsequent evolution is consistent with diamagnetic relativistic pulse acceleration of \cite{liangetal2003}. Unlike previous work, we use the actual electron to proton mass ratio and focus on understanding trap** vs. transmission of the ambient plasma by the pulses and on the particle acceleration spectra. We find that the accelerated electron distribution internal to the slab develops a double-power law. We predict that emission from reflected/trapped external electrons will peak after that of the internal electrons. We also find that the thin discrete pulses trap ambient electrons but allow protons to pass through, resulting in less drag on the pulse than in the case of trap** of both species. Poynting flux dominated scenarios have been proposed as the driver of relativistic outflows and particle acceleration in the most powerful astrophysical jets.
△ Less
Submitted 24 August, 2005;
originally announced August 2005.
-
Temperature Isotropization in Solar Flare Plasmas due to the Electron Firehose Instability
Authors:
Peter Messmer
Abstract:
The isotropization process of a collisionless plasma with an electron temperature anisotropy along an external magnetic field ($T_\| ^e\gg T_\perp^e$, $\|$ and $\perp$ with respect to the background magnetic field) and isotropic protons is investigated using a particle-in-cell(PIC) code. Restricting wave growth mainly parallel to the external magnetic field, the isotropization mechanism is ident…
▽ More
The isotropization process of a collisionless plasma with an electron temperature anisotropy along an external magnetic field ($T_\| ^e\gg T_\perp^e$, $\|$ and $\perp$ with respect to the background magnetic field) and isotropic protons is investigated using a particle-in-cell(PIC) code. Restricting wave growth mainly parallel to the external magnetic field, the isotropization mechanism is identified to be the Electron Firehose Instability (EFI). The free energy in the electrons is first transformed into left-hand circularly polarized transverse low-frequency waves by a non-resonant interaction. Fast electrons can then be scattered towards higher perpendicular velocities by gyroresonance, leading finally to a complete isotropization of the velocity distribution. During this phase of the instability, Langmuir waves are generated which may lead to the emission of radio waves. A large fraction of the protons is resonant with the left-hand polarized electromagnetic waves, creating a proton temperature anisotropy $T_\|^p < T_\perp^p$. The parameters of the simulated plasma are chosen compatible to solar flare conditions. The results indicate the significance of this mechanism in the particle acceleration context: The EFI limits the anisotropy of the electron velocity distribution, and thus provides the necessary condition for further acceleration. It enhances the pitch-angle of the electrons and heats the ions.
△ Less
Submitted 15 November, 2001;
originally announced November 2001.
-
High-sensitivity observations of solar flare decimeter radiation
Authors:
Arnold O. Benz,
Peter Messmer,
Christian Monstein
Abstract:
A new acousto-optic radio spectrometer has observed the 1 - 2 GHz radio emission of solar flares with unprecedented sensitivity. The number of detected decimeter type III bursts is greatly enhanced compared to observations by conventional spectrometers observing only one frequency at the time. The observations indicate a large number of electron beams propagating in dense plasmas. For the first…
▽ More
A new acousto-optic radio spectrometer has observed the 1 - 2 GHz radio emission of solar flares with unprecedented sensitivity. The number of detected decimeter type III bursts is greatly enhanced compared to observations by conventional spectrometers observing only one frequency at the time. The observations indicate a large number of electron beams propagating in dense plasmas. For the first time, we report weak, reversed drifting type III bursts at frequencies above simultaneous narrowband decimeter spikes. The type III bursts are reliable signatures of electron beams propagating downward in the corona, apparently away from the source of the spikes. The observations contradict the most popular spike model that places the spike sources at the footpoints of loops. Conspicuous also was an apparent bidirectional type U burst forming a fish-like pattern. It occurs simultaneously with an intense U-burst at 600-370 MHz observed in Tremsdorf. We suggest that it intermodulated with strong terrestrial interference (cellular phones) causing a spurious symmetric pattern in the spectrogram at 1.4 GHz. Symmetric features in the 1 - 2 GHz range, some already reported in the literature, therefore must be considered with utmost caution.
△ Less
Submitted 5 December, 2000;
originally announced December 2000.
-
The Minimum Bandwidth of Narrowband Spikes in Solar Flare Decimetric Radio Waves
Authors:
Peter Messmer,
Arnold O. Benz
Abstract:
The minimum and the mean bandwidth of individual narrowband spikes in two events in decimetric radio waves is determined by means of multi-resolution analysis. Spikes of a few tens of millisecond duration occur at decimetric/microwave wavelength in the particle acceleration phase of solar flares. A first method determines the dominant spike bandwidth scale based on their scalegram, the mean squa…
▽ More
The minimum and the mean bandwidth of individual narrowband spikes in two events in decimetric radio waves is determined by means of multi-resolution analysis. Spikes of a few tens of millisecond duration occur at decimetric/microwave wavelength in the particle acceleration phase of solar flares. A first method determines the dominant spike bandwidth scale based on their scalegram, the mean squared wavelet coefficient at each frequency scale. This allows to measure the scale bandwidth independently of heuristic spike selection criteria, e.g. manual selection. The major drawback is a low resolution in the bandwidth. To overcome this uncertainty, a feature detection algorithm and a criterion for spike shape in the time-frequency plane is applied to locate the spikes. In that case, the bandwidth is measured by fitting an assumed spike profile into the denoised data. The smallest FWHM bandwidth of spikes was found at 0.17 % and 0.41 % of the center frequency in the two events. Knowing the shortest relevant bandwidth of spikes, the slope of the Fourier power spectrum of this two events was determined and no resemblance to a Kolmogorov spectrum detected. Additionally the correlation between spike peak flux and bandwidth was examined.
△ Less
Submitted 23 December, 1999;
originally announced December 1999.