-
Lossy Checkpoint Compression in Full Waveform Inversion: a case study with ZFPv0.5.5 and the Overthrust Model
Authors:
Navjot Kukreja,
Jan Hueckelheim,
Mathias Louboutin,
John Washbourne,
Paul H. J. Kelly,
Gerard J. Gorman
Abstract:
This paper proposes a new method that combines check-pointing methods with error-controlled lossy compression for large-scale high-performance Full-Waveform Inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can significantly reduce data movement, allowing a reduction in run time as well as peak memory. In the Exascale computing era, frequent data transf…
▽ More
This paper proposes a new method that combines check-pointing methods with error-controlled lossy compression for large-scale high-performance Full-Waveform Inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can significantly reduce data movement, allowing a reduction in run time as well as peak memory. In the Exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit. Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory footprint during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations. Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution.
△ Less
Submitted 15 September, 2021; v1 submitted 26 September, 2020;
originally announced September 2020.
-
Automatic Differentiation for Adjoint Stencil Loops
Authors:
Jan Hückelheim,
Navjot Kukreja,
Sri Hari Krishna Narayanan,
Fabio Luporini,
Gerard Gorman,
Paul Hovland
Abstract:
Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoin…
▽ More
Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable.
In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications.
△ Less
Submitted 5 July, 2019;
originally announced July 2019.
-
Towards automatically building starting models for full-waveform inversion using global optimization methods: A PSO approach via DEAP + Devito
Authors:
Oscar F. Mojica,
Navjot Kukreja
Abstract:
In this work, we illustrate an example of estimating the macro-model of velocities in the subsurface through the use of global optimization methods (GOMs). The optimization problem is solved using DEAP (Distributed Evolutionary Algorithms in Python) and Devito, python frameworks for evolutionary and automated finite difference computations, respectively. We implement a Particle swarm optimization…
▽ More
In this work, we illustrate an example of estimating the macro-model of velocities in the subsurface through the use of global optimization methods (GOMs). The optimization problem is solved using DEAP (Distributed Evolutionary Algorithms in Python) and Devito, python frameworks for evolutionary and automated finite difference computations, respectively. We implement a Particle swarm optimization (PSO) with an "elitism strategy" on top of DEAP, leveraging its transparent, simple and coherent environment for implementing of evolutionary algorithms (EAs). The high computational effort, due to the huge number of cost function evaluations (each one demanding a forward modeling step) required by PSO, is alleviated through the use of Devito as well as through parallelization with Dask. The combined use of these frameworks yields not only an efficient way of providing acoustic macro models of the P-wave velocity field (Vp), but also significantly reduces the amount of human effort in fulfilling this task.
△ Less
Submitted 29 May, 2019;
originally announced May 2019.
-
Training on the Edge: The why and the how
Authors:
Navjot Kukreja,
Alena Shilova,
Olivier Beaumont,
Jan Huckelheim,
Nicola Ferrier,
Paul Hovland,
Gerard Gorman
Abstract:
Edge computing is the natural progression from Cloud computing, where, instead of collecting all data and processing it centrally, like in a cloud computing environment, we distribute the computing power and try to do as much processing as possible, close to the source of the data. There are various reasons this model is being adopted quickly, including privacy, and reduced power and bandwidth req…
▽ More
Edge computing is the natural progression from Cloud computing, where, instead of collecting all data and processing it centrally, like in a cloud computing environment, we distribute the computing power and try to do as much processing as possible, close to the source of the data. There are various reasons this model is being adopted quickly, including privacy, and reduced power and bandwidth requirements on the Edge nodes. While it is common to see inference being done on Edge nodes today, it is much less common to do training on the Edge. The reasons for this range from computational limitations, to it not being advantageous in reducing communications between the Edge nodes. In this paper, we explore some scenarios where it is advantageous to do training on the Edge, as well as the use of checkpointing strategies to save memory.
△ Less
Submitted 13 February, 2019;
originally announced March 2019.
-
Combining Checkpointing and Data Compression to Accelerate Adjoint-Based Optimization Problems
Authors:
Navjot Kukreja,
Jan Hueckelheim,
Mathias Louboutin,
Fabio Luporini,
Paul Hovland,
Gerard Gorman
Abstract:
Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory requirement by a certain factor, particularly if some loss in accuracy is acceptable. A popular alternative is checkpointing, where data is stored at selected point…
▽ More
Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory requirement by a certain factor, particularly if some loss in accuracy is acceptable. A popular alternative is checkpointing, where data is stored at selected points in time, and values at other times are recomputed as needed from the last stored state. This allows arbitrarily large adjoint computations with limited memory, at the cost of additional recomputations.
In this paper, we combine compression and checkpointing for the first time to compute a realistic seismic inversion. The combination of checkpointing and compression allows larger adjoint computations compared to using only compression, and reduces the recomputation overhead significantly compared to using only checkpointing.
△ Less
Submitted 20 September, 2021; v1 submitted 11 October, 2018;
originally announced October 2018.
-
Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration
Authors:
Mathias Louboutin,
Michael Lange,
Fabio Luporini,
Navjot Kukreja,
Philipp A. Witte,
Felix J. Herrmann,
Paulius Velesko,
Gerard J. Gorman
Abstract:
We introduce Devito, a new domain-specific language for implementing high-performance finite difference partial differential equation solvers. The motivating application is exploration seismology where methods such as Full-Waveform Inversion and Reverse-Time Migration are used to invert terabytes of seismic data to create images of the earth's subsurface. Even using modern supercomputers, it can t…
▽ More
We introduce Devito, a new domain-specific language for implementing high-performance finite difference partial differential equation solvers. The motivating application is exploration seismology where methods such as Full-Waveform Inversion and Reverse-Time Migration are used to invert terabytes of seismic data to create images of the earth's subsurface. Even using modern supercomputers, it can take weeks to process a single seismic survey and create a useful subsurface image. The computational cost is dominated by the numerical solution of wave equations and their corresponding adjoints. Therefore, a great deal of effort is invested in aggressively optimizing the performance of these wave-equation propagators for different computer architectures. Additionally, the actual set of partial differential equations being solved and their numerical discretization is under constant innovation as increasingly realistic representations of the physics are developed, further ratcheting up the cost of practical solvers. By embedding a domain-specific language within Python and making heavy use of SymPy, a symbolic mathematics library, we make it possible to develop finite difference simulators quickly using a syntax that strongly resembles the mathematics. The Devito compiler reads this code and applies a wide range of analysis to generate highly optimized and parallel code. This approach can reduce the development time of a verified and optimized solver from months to days.
△ Less
Submitted 9 August, 2019; v1 submitted 6 August, 2018;
originally announced August 2018.
-
Architecture and performance of Devito, a system for automated stencil computation
Authors:
Fabio Luporini,
Michael Lange,
Mathias Louboutin,
Navjot Kukreja,
Jan Hückelheim,
Charles Yount,
Philipp Witte,
Paul H. J. Kelly,
Felix J. Herrmann,
Gerard J. Gorman
Abstract:
Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process…
▽ More
Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process---from mathematical equations down to C++ code---is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications.
△ Less
Submitted 7 February, 2020; v1 submitted 9 July, 2018;
originally announced July 2018.
-
Backpropagation for long sequences: beyond memory constraints with constant overheads
Authors:
Navjot Kukreja,
Jan Hückelheim,
Gerard J. Gorman
Abstract:
Naive backpropagation through time has a memory footprint that grows linearly in the sequence length, due to the need to store each state of the forward propagation. This is a problem for large networks. Strategies have been developed to trade memory for added computations, which results in a sublinear growth of memory footprint or computation overhead. In this work, we present a library that uses…
▽ More
Naive backpropagation through time has a memory footprint that grows linearly in the sequence length, due to the need to store each state of the forward propagation. This is a problem for large networks. Strategies have been developed to trade memory for added computations, which results in a sublinear growth of memory footprint or computation overhead. In this work, we present a library that uses asynchronous storing and prefetching to move data to and from slow and cheap stor- age. The library only stores and prefetches states as frequently as possible without delaying the computation, and uses the optimal Revolve backpropagation strategy for the computations in between. The memory footprint of the backpropagation can thus be reduced to any size (e.g. to fit into DRAM), while the computational overhead is constant in the sequence length, and only depends on the ratio between compute and transfer times on a given hardware. We show in experiments that by exploiting asyncronous data transfer, our strategy is always at least as fast, and usually faster than the previously studied "optimal" strategies.
△ Less
Submitted 22 May, 2018;
originally announced June 2018.
-
High-level python abstractions for optimal checkpointing in inversion problems
Authors:
Navjot Kukreja,
Jan Hückelheim,
Michael Lange,
Mathias Louboutin,
Andrea Walther,
Simon W. Funke,
Gerard Gorman
Abstract:
Inversion and PDE-constrained optimization problems often rely on solving the adjoint problem to calculate the gradient of the objec- tive function. This requires storing large amounts of intermediate data, setting a limit to the largest problem that might be solved with a given amount of memory available. Checkpointing is an approach that can reduce the amount of memory required by redoing parts…
▽ More
Inversion and PDE-constrained optimization problems often rely on solving the adjoint problem to calculate the gradient of the objec- tive function. This requires storing large amounts of intermediate data, setting a limit to the largest problem that might be solved with a given amount of memory available. Checkpointing is an approach that can reduce the amount of memory required by redoing parts of the computation instead of storing intermediate results. The Revolve checkpointing algorithm o ers an optimal schedule that trades computational cost for smaller memory footprints. Integrat- ing Revolve into a modern python HPC code and combining it with code generation is not straightforward. We present an API that makes checkpointing accessible from a DSL-based code generation environment along with some initial performance gures with a focus on seismic applications.
△ Less
Submitted 12 January, 2018;
originally announced February 2018.
-
Optimised finite difference computation from symbolic equations
Authors:
Michael Lange,
Navjot Kukreja,
Fabio Luporini,
Mathias Louboutin,
Charles Yount,
Jan Hückelheim,
Gerard J. Gorman
Abstract:
Domain-specific high-productivity environments are playing an increasingly important role in scientific computing due to the levels of abstraction and automation they provide. In this paper we introduce Devito, an open-source domain-specific framework for solving partial differential equations from symbolic problem definitions by the finite difference method. We highlight the generation and automa…
▽ More
Domain-specific high-productivity environments are playing an increasingly important role in scientific computing due to the levels of abstraction and automation they provide. In this paper we introduce Devito, an open-source domain-specific framework for solving partial differential equations from symbolic problem definitions by the finite difference method. We highlight the generation and automated execution of highly optimized stencil code from only a few lines of high-level symbolic Python for a set of scientific equations, before exploring the use of Devito operators in seismic inversion problems.
△ Less
Submitted 12 July, 2017;
originally announced July 2017.
-
Devito: Towards a generic Finite Difference DSL using Symbolic Python
Authors:
Michael Lange,
Navjot Kukreja,
Mathias Louboutin,
Fabio Luporini,
Felippe Vieira,
Vincenzo Pandolfo,
Paulius Velesko,
Paulius Kazakas,
Gerard Gorman
Abstract:
Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain…
▽ More
Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain scientists to build applications within the comforts of the Python software ecosystem. In this paper we present Devito, a new finite difference DSL that provides optimized stencil computation from high-level problem specifications based on symbolic Python expressions. We demonstrate Devito's symbolic API and performance advantages over traditional Python acceleration methods before highlighting its use in the scientific context of seismic inversion problems.
△ Less
Submitted 12 September, 2016;
originally announced September 2016.
-
Devito: automated fast finite difference computation
Authors:
Navjot Kukreja,
Mathias Louboutin,
Felippe Vieira,
Fabio Luporini,
Michael Lange,
Gerard Gorman
Abstract:
Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance opti- mization on different computer architectures. Although a large number of stencil languages are available, finite differ- ence domain specific languages have proved challenging to design because most practical use cases requi…
▽ More
Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance opti- mization on different computer architectures. Although a large number of stencil languages are available, finite differ- ence domain specific languages have proved challenging to design because most practical use cases require additional features that fall outside the finite difference abstraction. Inspired by the complexity of real-world seismic imaging problems, we introduce Devito, a domain specific language in which high level equations are expressed using symbolic expressions from the SymPy package. Complex equations are automatically manipulated, optimized, and translated into highly optimized C code that aims to perform compa- rably or better than hand-tuned code. All this is transpar- ent to users, who only see concise symbolic mathematical expressions.
△ Less
Submitted 10 October, 2016; v1 submitted 30 August, 2016;
originally announced August 2016.
-
Performance prediction of finite-difference solvers for different computer architectures
Authors:
Mathias Louboutin,
Michael Lange,
Felix Herrmann,
Navjot Kukreja,
Gerard Gorman
Abstract:
The life-cycle of a partial differential equation (PDE) solver is often characterized by three development phases: the development of a stable numerical discretization, development of a correct (verified) implementation, and the optimization of the implementation for different computer architectures. Often it is only after significant time and effort has been invested that the performance bottlene…
▽ More
The life-cycle of a partial differential equation (PDE) solver is often characterized by three development phases: the development of a stable numerical discretization, development of a correct (verified) implementation, and the optimization of the implementation for different computer architectures. Often it is only after significant time and effort has been invested that the performance bottlenecks of a PDE solver are fully understood, and the precise details varies between different computer architectures. One way to mitigate this issue is to establish a reliable performance model that allows a numerical analyst to make reliable predictions of how well a numerical method would perform on a given computer architecture, before embarking upon potentially long and expensive implementation and optimization phases. The availability of a reliable performance model also saves developer effort as it both informs the developer on what kind of optimisations are beneficial, and when the maximum expected performance has been reached and optimisation work should stop. We show how discretization of a wave equation can be theoretically studied to understand the performance limitations of the method on modern computer architectures. We focus on the roofline model, now broadly used in the high-performance computing community, which considers the achievable performance in terms of the peak memory bandwidth and peak floating point performance of a computer with respect to algorithmic choices. A first principles analysis of operational intensity for key time-step** finite-difference algorithms is presented. With this information available at the time of algorithm design, the expected performance on target computer systems can be used as a driver for algorithm design.
△ Less
Submitted 18 April, 2017; v1 submitted 13 August, 2016;
originally announced August 2016.