-
Accelerating Dedispersion using Many-Core Architectures
Authors:
Jan Novotný,
Karel Adámek,
M. A. Clark,
Mike Giles,
Wesley Armour
Abstract:
Astrophysical radio signals are excellent probes of extreme physical processes that emit them. However, to reach Earth, electromagnetic radiation passes through the ionised interstellar medium (ISM), introducing a frequency-dependent time delay (dispersion) to the emitted signal. Removing dispersion enables searches for transient signals like Fast Radio Bursts (FRB) or repeating signals from isola…
▽ More
Astrophysical radio signals are excellent probes of extreme physical processes that emit them. However, to reach Earth, electromagnetic radiation passes through the ionised interstellar medium (ISM), introducing a frequency-dependent time delay (dispersion) to the emitted signal. Removing dispersion enables searches for transient signals like Fast Radio Bursts (FRB) or repeating signals from isolated pulsars or those in orbit around other compact objects. The sheer volume and high resolution of data that next generation radio telescopes will produce require High-Performance Computing (HPC) solutions and algorithms to be used in time-domain data processing pipelines to extract scientifically valuable results in real-time. This paper presents a state-of-the-art implementation of brute force incoherent dedispersion on NVIDIA GPUs, and on Intel and AMD CPUs. We show that our implementation is 4x faster (8-bit 8192 channels input) than other available solutions and demonstrate, using 11 existing telescopes, that our implementation is at least 20 faster than real-time. This work is part of the AstroAccelerate package.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision
Authors:
Matteo Croci,
Michael B. Giles
Abstract:
Motivated by the advent of machine learning, the last few years have seen the return of hardware-supported low-precision computing. Computations with fewer digits are faster and more memory and energy efficient, but can be extremely susceptible to rounding errors. As shown by recent studies into reduced-precision climate simulations, an application that can largely benefit from the advantages of l…
▽ More
Motivated by the advent of machine learning, the last few years have seen the return of hardware-supported low-precision computing. Computations with fewer digits are faster and more memory and energy efficient, but can be extremely susceptible to rounding errors. As shown by recent studies into reduced-precision climate simulations, an application that can largely benefit from the advantages of low-precision computing is the numerical solution of partial differential equations (PDEs). However, a careful implementation and rounding error analysis are required to ensure that sensible results can still be obtained.
In this paper we study the accumulation of rounding errors in the solution of the heat equation, a proxy for parabolic PDEs, via Runge-Kutta finite difference methods using round-to-nearest (RtN) and stochastic rounding (SR). We demonstrate how to implement the scheme to reduce rounding errors and we derive \emph{a priori} estimates for local and global rounding errors. Let $u$ be the unit roundoff. While the worst-case local errors are $O(u)$ with respect to the discretization parameters (mesh size and timestep), the RtN and SR error behavior is substantially different. In fact, the RtN solution always stagnates for small enough $Δt$, and until stagnation the global error grows like $O(uΔt^{-1})$. In contrast, we show that the leading-order errors introduced by SR are zero-mean, independent in space and mean-independent in time, making SR resilient to stagnation and rounding error accumulation. In fact, we prove that for SR the global rounding errors are only $O(uΔt^{-1/4})$ in 1D and are essentially bounded (up to logarithmic factors) in higher dimensions.
△ Less
Submitted 28 March, 2022; v1 submitted 30 October, 2020;
originally announced October 2020.
-
GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory
Authors:
Karel Adámek,
Sofia Dimoudi,
Mike Giles,
Wesley Armour
Abstract:
We present an implementation of the overlap-and-save method, a method for the convolution of very long signals with short response functions, which is tailored to GPUs. We have implemented several FFT algorithms (using the CUDA programming language) which exploit GPU shared memory, allowing for GPU accelerated convolution. We compare our implementation with an implementation of the overlap-and-sav…
▽ More
We present an implementation of the overlap-and-save method, a method for the convolution of very long signals with short response functions, which is tailored to GPUs. We have implemented several FFT algorithms (using the CUDA programming language) which exploit GPU shared memory, allowing for GPU accelerated convolution. We compare our implementation with an implementation of the overlap-and-save algorithm utilizing the NVIDIA FFT library (cuFFT). We demonstrate that by using a shared memory based FFT we can achieved significant speed-ups for certain problem sizes and lower the memory requirements of the overlap-and-save method on GPUs.
△ Less
Submitted 10 April, 2020; v1 submitted 4 October, 2019;
originally announced October 2019.
-
Beyond 16GB: Out-of-Core Stencil Computations
Authors:
Istvan Z Reguly,
Gihan R Mudalige,
Michael B Giles
Abstract:
Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we addres…
▽ More
Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15\% loss in efficiency
△ Less
Submitted 26 October, 2017; v1 submitted 7 September, 2017;
originally announced September 2017.
-
Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS
Authors:
Istvan Z Reguly,
Gihan R Mudalige,
Mike B Giles
Abstract:
The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time…
▽ More
The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2$\times$ on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, $3.5\times$ on the linear solver TeaLeaf, and $1.7\times$ on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information.
△ Less
Submitted 26 June, 2017; v1 submitted 3 April, 2017;
originally announced April 2017.
-
Acceleration of a Full-scale Industrial CFD Application with OP2
Authors:
István Z. Reguly,
Gihan R. Mudalige,
Carlo Bertolli,
Michael B. Giles,
Adam Betts,
Paul H. J. Kelly,
David Radford
Abstract:
Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc. It consists of over 300 parallel loops with a code base exceeding 50K lines and is capable of performing complex simulations over highly detailed unstructured mesh geometries. Unlike simpler structured-mesh applications, which feature high speed-ups when accelerated by modern processor archit…
▽ More
Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc. It consists of over 300 parallel loops with a code base exceeding 50K lines and is capable of performing complex simulations over highly detailed unstructured mesh geometries. Unlike simpler structured-mesh applications, which feature high speed-ups when accelerated by modern processor architectures, such as multi-core and many-core processor systems, Hydra presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging platforms. We present research in achieving this goal through the OP2 domain-specific high-level framework. OP2 targets the domain of unstructured mesh problems and follows the design of an active library using source-to-source translation and compilation to generate multiple parallel implementations from a single high-level application source for execution on a range of back-end hardware platforms. We chart the conversion of Hydra from its original hand-tuned production version to one that utilizes OP2, and map out the key difficulties encountered in the process. To our knowledge this research presents the first application of such a high-level framework to a full scale production code. Specifically we show (1) how different parallel implementations can be achieved with an active library framework, even for a highly complicated industrial application such as Hydra, and (2) how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly, reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performance as that of the hand-tuned original production code could be achieved, but it can be significantly improved on conventional processor systems. Additionally, we achieve further...
△ Less
Submitted 27 March, 2014;
originally announced March 2014.
-
Software Abstractions and Methodologies for HPC Simulation Codes on Future Architectures
Authors:
A. Dubey,
S. Brandt,
R. Brower,
M. Giles,
P. Hovland,
D. Q. Lamb,
F. Loffler,
B. Norris,
B. OShea,
C. Rebbi,
M. Snir,
R. Thakur
Abstract:
Large, complex, multi-scale, multi-physics simulation codes, running on high performance com-puting (HPC) platforms, have become essential to advancing science and engineering. These codes simulate multi-scale, multi-physics phenomena with unprecedented fidelity on petascale platforms, and are used by large communities. Continued ability of these codes to run on future platforms is as crucial to t…
▽ More
Large, complex, multi-scale, multi-physics simulation codes, running on high performance com-puting (HPC) platforms, have become essential to advancing science and engineering. These codes simulate multi-scale, multi-physics phenomena with unprecedented fidelity on petascale platforms, and are used by large communities. Continued ability of these codes to run on future platforms is as crucial to their communities as continued improvements in instruments and facilities are to experimental scientists. However, the ability of code developers to do these things faces a serious challenge with the paradigm shift underway in platform architecture. The complexity and uncertainty of the future platforms makes it essential to approach this challenge cooperatively as a community. We need to develop common abstractions, frameworks, programming models and software development methodologies that can be applied across a broad range of complex simulation codes, and common software infrastructure to support them. In this position paper we express and discuss our belief that such an infrastructure is critical to the deployment of existing and new large, multi-scale, multi-physics codes on future HPC platforms.
△ Less
Submitted 6 September, 2013;
originally announced September 2013.