Search | arXiv e-print repository

arXiv:2009.12623 [pdf, other]

Lossy Checkpoint Compression in Full Waveform Inversion: a case study with ZFPv0.5.5 and the Overthrust Model

Authors: Navjot Kukreja, Jan Hueckelheim, Mathias Louboutin, John Washbourne, Paul H. J. Kelly, Gerard J. Gorman

Abstract: This paper proposes a new method that combines check-pointing methods with error-controlled lossy compression for large-scale high-performance Full-Waveform Inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can significantly reduce data movement, allowing a reduction in run time as well as peak memory. In the Exascale computing era, frequent data transf… ▽ More This paper proposes a new method that combines check-pointing methods with error-controlled lossy compression for large-scale high-performance Full-Waveform Inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can significantly reduce data movement, allowing a reduction in run time as well as peak memory. In the Exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit. Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory footprint during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations. Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution. △ Less

Submitted 15 September, 2021; v1 submitted 26 September, 2020; originally announced September 2020.

arXiv:1907.02818 [pdf, other]

doi 10.1145/3337821.3337906

Automatic Differentiation for Adjoint Stencil Loops

Authors: Jan Hückelheim, Navjot Kukreja, Sri Hari Krishna Narayanan, Fabio Luporini, Gerard Gorman, Paul Hovland

Abstract: Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoin… ▽ More Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable. In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications. △ Less

Submitted 5 July, 2019; originally announced July 2019.

Comments: ICPP 2019

arXiv:1905.12795 [pdf, other]

Towards automatically building starting models for full-waveform inversion using global optimization methods: A PSO approach via DEAP + Devito

Authors: Oscar F. Mojica, Navjot Kukreja

Abstract: In this work, we illustrate an example of estimating the macro-model of velocities in the subsurface through the use of global optimization methods (GOMs). The optimization problem is solved using DEAP (Distributed Evolutionary Algorithms in Python) and Devito, python frameworks for evolutionary and automated finite difference computations, respectively. We implement a Particle swarm optimization… ▽ More In this work, we illustrate an example of estimating the macro-model of velocities in the subsurface through the use of global optimization methods (GOMs). The optimization problem is solved using DEAP (Distributed Evolutionary Algorithms in Python) and Devito, python frameworks for evolutionary and automated finite difference computations, respectively. We implement a Particle swarm optimization (PSO) with an "elitism strategy" on top of DEAP, leveraging its transparent, simple and coherent environment for implementing of evolutionary algorithms (EAs). The high computational effort, due to the huge number of cost function evaluations (each one demanding a forward modeling step) required by PSO, is alleviated through the use of Devito as well as through parallelization with Dask. The combined use of these frameworks yields not only an efficient way of providing acoustic macro models of the P-wave velocity field (Vp), but also significantly reduces the amount of human effort in fulfilling this task. △ Less

Submitted 29 May, 2019; originally announced May 2019.

arXiv:1903.03051 [pdf, other]

Training on the Edge: The why and the how

Authors: Navjot Kukreja, Alena Shilova, Olivier Beaumont, Jan Huckelheim, Nicola Ferrier, Paul Hovland, Gerard Gorman

Abstract: Edge computing is the natural progression from Cloud computing, where, instead of collecting all data and processing it centrally, like in a cloud computing environment, we distribute the computing power and try to do as much processing as possible, close to the source of the data. There are various reasons this model is being adopted quickly, including privacy, and reduced power and bandwidth req… ▽ More Edge computing is the natural progression from Cloud computing, where, instead of collecting all data and processing it centrally, like in a cloud computing environment, we distribute the computing power and try to do as much processing as possible, close to the source of the data. There are various reasons this model is being adopted quickly, including privacy, and reduced power and bandwidth requirements on the Edge nodes. While it is common to see inference being done on Edge nodes today, it is much less common to do training on the Edge. The reasons for this range from computational limitations, to it not being advantageous in reducing communications between the Edge nodes. In this paper, we explore some scenarios where it is advantageous to do training on the Edge, as well as the use of checkpointing strategies to save memory. △ Less

Submitted 13 February, 2019; originally announced March 2019.

Comments: Submitted to PAISE 2019

arXiv:1810.05268 [pdf, other]

Combining Checkpointing and Data Compression to Accelerate Adjoint-Based Optimization Problems

Authors: Navjot Kukreja, Jan Hueckelheim, Mathias Louboutin, Fabio Luporini, Paul Hovland, Gerard Gorman

Abstract: Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory requirement by a certain factor, particularly if some loss in accuracy is acceptable. A popular alternative is checkpointing, where data is stored at selected point… ▽ More Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory requirement by a certain factor, particularly if some loss in accuracy is acceptable. A popular alternative is checkpointing, where data is stored at selected points in time, and values at other times are recomputed as needed from the last stored state. This allows arbitrarily large adjoint computations with limited memory, at the cost of additional recomputations. In this paper, we combine compression and checkpointing for the first time to compute a realistic seismic inversion. The combination of checkpointing and compression allows larger adjoint computations compared to using only compression, and reduces the recomputation overhead significantly compared to using only checkpointing. △ Less

Submitted 20 September, 2021; v1 submitted 11 October, 2018; originally announced October 2018.

Comments: Accepted in European Conference on Parallel Proessing (EuroPar) 2019. Part of the Lecture Notes in Computer Science book series (LNCS, volume 11725)

arXiv:1808.01995 [pdf, other]

doi 10.5194/gmd-12-1165-2019

Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration

Authors: Mathias Louboutin, Michael Lange, Fabio Luporini, Navjot Kukreja, Philipp A. Witte, Felix J. Herrmann, Paulius Velesko, Gerard J. Gorman

Abstract: We introduce Devito, a new domain-specific language for implementing high-performance finite difference partial differential equation solvers. The motivating application is exploration seismology where methods such as Full-Waveform Inversion and Reverse-Time Migration are used to invert terabytes of seismic data to create images of the earth's subsurface. Even using modern supercomputers, it can t… ▽ More We introduce Devito, a new domain-specific language for implementing high-performance finite difference partial differential equation solvers. The motivating application is exploration seismology where methods such as Full-Waveform Inversion and Reverse-Time Migration are used to invert terabytes of seismic data to create images of the earth's subsurface. Even using modern supercomputers, it can take weeks to process a single seismic survey and create a useful subsurface image. The computational cost is dominated by the numerical solution of wave equations and their corresponding adjoints. Therefore, a great deal of effort is invested in aggressively optimizing the performance of these wave-equation propagators for different computer architectures. Additionally, the actual set of partial differential equations being solved and their numerical discretization is under constant innovation as increasingly realistic representations of the physics are developed, further ratcheting up the cost of practical solvers. By embedding a domain-specific language within Python and making heavy use of SymPy, a symbolic mathematics library, we make it possible to develop finite difference simulators quickly using a syntax that strongly resembles the mathematics. The Devito compiler reads this code and applies a wide range of analysis to generate highly optimized and parallel code. This approach can reduce the development time of a verified and optimized solver from months to days. △ Less

Submitted 9 August, 2019; v1 submitted 6 August, 2018; originally announced August 2018.

Journal ref: https://www.geosci-model-dev.net/12/1165/2019/

arXiv:1807.03032 [pdf, other]

Architecture and performance of Devito, a system for automated stencil computation

Authors: Fabio Luporini, Michael Lange, Mathias Louboutin, Navjot Kukreja, Jan Hückelheim, Charles Yount, Philipp Witte, Paul H. J. Kelly, Felix J. Herrmann, Gerard J. Gorman

Abstract: Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process… ▽ More Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process---from mathematical equations down to C++ code---is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications. △ Less

Submitted 7 February, 2020; v1 submitted 9 July, 2018; originally announced July 2018.

Comments: Submitted to ACM Transactions on Mathematical Software

MSC Class: 65N06; 68N20

arXiv:1806.01117 [pdf, other]

Backpropagation for long sequences: beyond memory constraints with constant overheads

Authors: Navjot Kukreja, Jan Hückelheim, Gerard J. Gorman

Abstract: Naive backpropagation through time has a memory footprint that grows linearly in the sequence length, due to the need to store each state of the forward propagation. This is a problem for large networks. Strategies have been developed to trade memory for added computations, which results in a sublinear growth of memory footprint or computation overhead. In this work, we present a library that uses… ▽ More Naive backpropagation through time has a memory footprint that grows linearly in the sequence length, due to the need to store each state of the forward propagation. This is a problem for large networks. Strategies have been developed to trade memory for added computations, which results in a sublinear growth of memory footprint or computation overhead. In this work, we present a library that uses asynchronous storing and prefetching to move data to and from slow and cheap stor- age. The library only stores and prefetches states as frequently as possible without delaying the computation, and uses the optimal Revolve backpropagation strategy for the computations in between. The memory footprint of the backpropagation can thus be reduced to any size (e.g. to fit into DRAM), while the computational overhead is constant in the sequence length, and only depends on the ratio between compute and transfer times on a given hardware. We show in experiments that by exploiting asyncronous data transfer, our strategy is always at least as fast, and usually faster than the previously studied "optimal" strategies. △ Less

Submitted 22 May, 2018; originally announced June 2018.

arXiv:1802.02474 [pdf, other]

High-level python abstractions for optimal checkpointing in inversion problems

Authors: Navjot Kukreja, Jan Hückelheim, Michael Lange, Mathias Louboutin, Andrea Walther, Simon W. Funke, Gerard Gorman

Abstract: Inversion and PDE-constrained optimization problems often rely on solving the adjoint problem to calculate the gradient of the objec- tive function. This requires storing large amounts of intermediate data, setting a limit to the largest problem that might be solved with a given amount of memory available. Checkpointing is an approach that can reduce the amount of memory required by redoing parts… ▽ More Inversion and PDE-constrained optimization problems often rely on solving the adjoint problem to calculate the gradient of the objec- tive function. This requires storing large amounts of intermediate data, setting a limit to the largest problem that might be solved with a given amount of memory available. Checkpointing is an approach that can reduce the amount of memory required by redoing parts of the computation instead of storing intermediate results. The Revolve checkpointing algorithm o ers an optimal schedule that trades computational cost for smaller memory footprints. Integrat- ing Revolve into a modern python HPC code and combining it with code generation is not straightforward. We present an API that makes checkpointing accessible from a DSL-based code generation environment along with some initial performance gures with a focus on seismic applications. △ Less

Submitted 12 January, 2018; originally announced February 2018.

arXiv:1707.03776 [pdf, other]

Optimised finite difference computation from symbolic equations

Authors: Michael Lange, Navjot Kukreja, Fabio Luporini, Mathias Louboutin, Charles Yount, Jan Hückelheim, Gerard J. Gorman

Abstract: Domain-specific high-productivity environments are playing an increasingly important role in scientific computing due to the levels of abstraction and automation they provide. In this paper we introduce Devito, an open-source domain-specific framework for solving partial differential equations from symbolic problem definitions by the finite difference method. We highlight the generation and automa… ▽ More Domain-specific high-productivity environments are playing an increasingly important role in scientific computing due to the levels of abstraction and automation they provide. In this paper we introduce Devito, an open-source domain-specific framework for solving partial differential equations from symbolic problem definitions by the finite difference method. We highlight the generation and automated execution of highly optimized stencil code from only a few lines of high-level symbolic Python for a set of scientific equations, before exploring the use of Devito operators in seismic inversion problems. △ Less

Submitted 12 July, 2017; originally announced July 2017.

Comments: Accepted for publication in Proceedings of the 16th Python in Science Conference (SciPy 2017)

arXiv:1609.03361 [pdf, other]

Devito: Towards a generic Finite Difference DSL using Symbolic Python

Authors: Michael Lange, Navjot Kukreja, Mathias Louboutin, Fabio Luporini, Felippe Vieira, Vincenzo Pandolfo, Paulius Velesko, Paulius Kazakas, Gerard Gorman

Abstract: Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain… ▽ More Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain scientists to build applications within the comforts of the Python software ecosystem. In this paper we present Devito, a new finite difference DSL that provides optimized stencil computation from high-level problem specifications based on symbolic Python expressions. We demonstrate Devito's symbolic API and performance advantages over traditional Python acceleration methods before highlighting its use in the scientific context of seismic inversion problems. △ Less

Submitted 12 September, 2016; originally announced September 2016.

Comments: pyHPC 2016 conference submission

arXiv:1608.08658 [pdf, other]

Devito: automated fast finite difference computation

Authors: Navjot Kukreja, Mathias Louboutin, Felippe Vieira, Fabio Luporini, Michael Lange, Gerard Gorman

Abstract: Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance opti- mization on different computer architectures. Although a large number of stencil languages are available, finite differ- ence domain specific languages have proved challenging to design because most practical use cases requi… ▽ More Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance opti- mization on different computer architectures. Although a large number of stencil languages are available, finite differ- ence domain specific languages have proved challenging to design because most practical use cases require additional features that fall outside the finite difference abstraction. Inspired by the complexity of real-world seismic imaging problems, we introduce Devito, a domain specific language in which high level equations are expressed using symbolic expressions from the SymPy package. Complex equations are automatically manipulated, optimized, and translated into highly optimized C code that aims to perform compa- rably or better than hand-tuned code. All this is transpar- ent to users, who only see concise symbolic mathematical expressions. △ Less

Submitted 10 October, 2016; v1 submitted 30 August, 2016; originally announced August 2016.

Comments: Accepted at WolfHPC 2016

arXiv:1608.03984 [pdf, other]

doi 10.1016/j.cageo.2017.04.014

Performance prediction of finite-difference solvers for different computer architectures

Authors: Mathias Louboutin, Michael Lange, Felix Herrmann, Navjot Kukreja, Gerard Gorman

Abstract: The life-cycle of a partial differential equation (PDE) solver is often characterized by three development phases: the development of a stable numerical discretization, development of a correct (verified) implementation, and the optimization of the implementation for different computer architectures. Often it is only after significant time and effort has been invested that the performance bottlene… ▽ More The life-cycle of a partial differential equation (PDE) solver is often characterized by three development phases: the development of a stable numerical discretization, development of a correct (verified) implementation, and the optimization of the implementation for different computer architectures. Often it is only after significant time and effort has been invested that the performance bottlenecks of a PDE solver are fully understood, and the precise details varies between different computer architectures. One way to mitigate this issue is to establish a reliable performance model that allows a numerical analyst to make reliable predictions of how well a numerical method would perform on a given computer architecture, before embarking upon potentially long and expensive implementation and optimization phases. The availability of a reliable performance model also saves developer effort as it both informs the developer on what kind of optimisations are beneficial, and when the maximum expected performance has been reached and optimisation work should stop. We show how discretization of a wave equation can be theoretically studied to understand the performance limitations of the method on modern computer architectures. We focus on the roofline model, now broadly used in the high-performance computing community, which considers the achievable performance in terms of the peak memory bandwidth and peak floating point performance of a computer with respect to algorithmic choices. A first principles analysis of operational intensity for key time-step** finite-difference algorithms is presented. With this information available at the time of algorithm design, the expected performance on target computer systems can be used as a driver for algorithm design. △ Less

Submitted 18 April, 2017; v1 submitted 13 August, 2016; originally announced August 2016.

Journal ref: Computer & Geoscience (2017) 148-157; Volume 105

Showing 1–13 of 13 results for author: Kukreja, N