-
Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision
Authors:
Matteo Croci,
Michael B. Giles
Abstract:
Motivated by the advent of machine learning, the last few years have seen the return of hardware-supported low-precision computing. Computations with fewer digits are faster and more memory and energy efficient, but can be extremely susceptible to rounding errors. As shown by recent studies into reduced-precision climate simulations, an application that can largely benefit from the advantages of l…
▽ More
Motivated by the advent of machine learning, the last few years have seen the return of hardware-supported low-precision computing. Computations with fewer digits are faster and more memory and energy efficient, but can be extremely susceptible to rounding errors. As shown by recent studies into reduced-precision climate simulations, an application that can largely benefit from the advantages of low-precision computing is the numerical solution of partial differential equations (PDEs). However, a careful implementation and rounding error analysis are required to ensure that sensible results can still be obtained.
In this paper we study the accumulation of rounding errors in the solution of the heat equation, a proxy for parabolic PDEs, via Runge-Kutta finite difference methods using round-to-nearest (RtN) and stochastic rounding (SR). We demonstrate how to implement the scheme to reduce rounding errors and we derive \emph{a priori} estimates for local and global rounding errors. Let $u$ be the unit roundoff. While the worst-case local errors are $O(u)$ with respect to the discretization parameters (mesh size and timestep), the RtN and SR error behavior is substantially different. In fact, the RtN solution always stagnates for small enough $Δt$, and until stagnation the global error grows like $O(uΔt^{-1})$. In contrast, we show that the leading-order errors introduced by SR are zero-mean, independent in space and mean-independent in time, making SR resilient to stagnation and rounding error accumulation. In fact, we prove that for SR the global rounding errors are only $O(uΔt^{-1/4})$ in 1D and are essentially bounded (up to logarithmic factors) in higher dimensions.
△ Less
Submitted 28 March, 2022; v1 submitted 30 October, 2020;
originally announced October 2020.
-
Beyond 16GB: Out-of-Core Stencil Computations
Authors:
Istvan Z Reguly,
Gihan R Mudalige,
Michael B Giles
Abstract:
Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we addres…
▽ More
Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15\% loss in efficiency
△ Less
Submitted 26 October, 2017; v1 submitted 7 September, 2017;
originally announced September 2017.
-
Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS
Authors:
Istvan Z Reguly,
Gihan R Mudalige,
Mike B Giles
Abstract:
The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time…
▽ More
The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2$\times$ on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, $3.5\times$ on the linear solver TeaLeaf, and $1.7\times$ on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information.
△ Less
Submitted 26 June, 2017; v1 submitted 3 April, 2017;
originally announced April 2017.
-
Acceleration of a Full-scale Industrial CFD Application with OP2
Authors:
István Z. Reguly,
Gihan R. Mudalige,
Carlo Bertolli,
Michael B. Giles,
Adam Betts,
Paul H. J. Kelly,
David Radford
Abstract:
Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc. It consists of over 300 parallel loops with a code base exceeding 50K lines and is capable of performing complex simulations over highly detailed unstructured mesh geometries. Unlike simpler structured-mesh applications, which feature high speed-ups when accelerated by modern processor archit…
▽ More
Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc. It consists of over 300 parallel loops with a code base exceeding 50K lines and is capable of performing complex simulations over highly detailed unstructured mesh geometries. Unlike simpler structured-mesh applications, which feature high speed-ups when accelerated by modern processor architectures, such as multi-core and many-core processor systems, Hydra presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging platforms. We present research in achieving this goal through the OP2 domain-specific high-level framework. OP2 targets the domain of unstructured mesh problems and follows the design of an active library using source-to-source translation and compilation to generate multiple parallel implementations from a single high-level application source for execution on a range of back-end hardware platforms. We chart the conversion of Hydra from its original hand-tuned production version to one that utilizes OP2, and map out the key difficulties encountered in the process. To our knowledge this research presents the first application of such a high-level framework to a full scale production code. Specifically we show (1) how different parallel implementations can be achieved with an active library framework, even for a highly complicated industrial application such as Hydra, and (2) how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly, reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performance as that of the hand-tuned original production code could be achieved, but it can be significantly improved on conventional processor systems. Additionally, we achieve further...
△ Less
Submitted 27 March, 2014;
originally announced March 2014.