-
Lost in Interpretation: Navigating Challenges in Validating Persistency Models Amid Vague Specs and Stubborn Machines, All with a Sense of Humour
Authors:
Vasileios Klimis,
Alastair F. Donaldson,
Viktor Vafeiadis,
John Wickerson,
Azalea Raad
Abstract:
Memory persistency models provide a foundation for persistent programming by specifying which (and when) writes to non-volatile memory (NVM) become persistent. Memory persistency models for the Intel-x86 and Arm architectures have been formalised, but not empirically validated against real machines. Traditional validation methods %such as %extensive litmus testing used for memory \emph{consistency…
▽ More
Memory persistency models provide a foundation for persistent programming by specifying which (and when) writes to non-volatile memory (NVM) become persistent. Memory persistency models for the Intel-x86 and Arm architectures have been formalised, but not empirically validated against real machines. Traditional validation methods %such as %extensive litmus testing used for memory \emph{consistency} models do not straightforwardly apply because a test program cannot directly observe when its data has become persistent: it cannot distinguish between reading data from a volatile cache and from NVM. We investigate addressing this challenge using a commercial off-the-shelf device that intercepts data on the memory bus and logs all writes in the order they reach the memory. Using this technique we conducted a litmus-testing campaign aimed at empirically validating the persistency guarantees of Intel-x86 and Arm machines. We observed writes propagating to memory out of order, and took steps to build confidence that these observations were not merely artefacts of our testing setup. However, despite gaining high confidence in the trustworthiness of our observation method, our conclusions remain largely negative. We found that the Intel-x86 architecture is not amenable to our approach, and on consulting Intel engineers discovered that there are currently no reliable methods of validating their persistency guarantees. For Arm, we found that even a machine recommended to us by a persistency expert at Arm did not match the formal Arm persistency model, due to a loophole in the specification.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Intel PMDK Transactions: Specification, Validation and Concurrency (Extended Version)
Authors:
Azalea Raad,
Ori Lahav,
John Wickerson,
Piotr Balcer,
Brijesh Dongol
Abstract:
Software Transactional Memory (STM) is an extensively studied paradigm that provides an easy-to-use mechanism for thread safety and concurrency control. With the recent advent of byte-addressable persistent memory, a natural question to ask is whether STM systems can be adapted to support failure atomicity. In this article, we answer this question by showing how STM can be easily integrated with I…
▽ More
Software Transactional Memory (STM) is an extensively studied paradigm that provides an easy-to-use mechanism for thread safety and concurrency control. With the recent advent of byte-addressable persistent memory, a natural question to ask is whether STM systems can be adapted to support failure atomicity. In this article, we answer this question by showing how STM can be easily integrated with Intel's Persistent Memory Development Kit (PMDK) transactional library (which we refer to as txPMDK) to obtain STM systems that are both concurrent and persistent. We demonstrate this approach using known STM systems, TML and NOrec, which when combined with txPMDK result in persistent STM systems, referred to as PMDK-TML and PMDK-NORec, respectively. However, it turns out that existing correctness criteria are insufficient for specifying the behaviour of txPMDK and our concurrent extensions. We therefore develop a new correctness criterion, dynamic durable opacity, that extends the previously defined notion of durable opacity with dynamic memory allocation. We provide a model of txPMDK, then show that this model satisfies dynamic durable opacity. Moreover, dynamic durable opacity supports concurrent transactions, thus we also use it to show correctness of both PMDK-TML and PMDK-NORec.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
View-Based Owicki-Gries Reasoning for Persistent x86-TSO (Extended Version)
Authors:
Eleni Vafeiadi Bila,
Brijesh Dongol,
Ori Lahav,
Azalea Raad,
John Wickerson
Abstract:
The rise of persistent memory is disrupting computing to its core. Our work aims to help programmers navigate this brave new world by providing a program logic for reasoning about x86 code that uses low-level operations such as memory accesses and fences, as well as persistency primitives such as flushes. Our logic, Pierogi, benefits from a simple underlying operational semantics based on views, i…
▽ More
The rise of persistent memory is disrupting computing to its core. Our work aims to help programmers navigate this brave new world by providing a program logic for reasoning about x86 code that uses low-level operations such as memory accesses and fences, as well as persistency primitives such as flushes. Our logic, Pierogi, benefits from a simple underlying operational semantics based on views, is able to handle optimised flush operations, and is mechanised in the Isabelle/HOL proof assistant. We detail the proof rules of Pierogi and prove them sound. We also show how Pierogi can be used to reason about a range of challenging single- and multi-threaded persistent programs.
△ Less
Submitted 15 January, 2022;
originally announced January 2022.
-
Specifying and Testing GPU Workgroup Progress Models
Authors:
Tyler Sorensen,
Lucas F. Salvador,
Harmit Raval,
Hugues Evrard,
John Wickerson,
Margaret Martonosi,
Alastair F. Donaldson
Abstract:
As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications say almost nothing…
▽ More
As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model.
This work is a collection of tools experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a constraint for concurrent programs that require relative forward progress to terminate. Using this constraint, we synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, this allows us to determine the expected status of each litmus test -- i.e. whether it is guaranteed eventual termination -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, an intuitive progress model defined by prior work and hypothesized to describe the workgroup schedulers of existing GPUs.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
ARCHITECT: Arbitrary-precision Hardware with Digit Elision for Efficient Iterative Compute
Authors:
He Li,
James J. Davis,
John Wickerson,
George A. Constantinides
Abstract:
Many algorithms feature an iterative loop that converges to the result of interest. The numerical operations in such algorithms are generally implemented using finite-precision arithmetic, either fixed- or floating-point, most of which operate least-significant digit first. This results in a fundamental problem: if, after some time, the result has not converged, is this because we have not run the…
▽ More
Many algorithms feature an iterative loop that converges to the result of interest. The numerical operations in such algorithms are generally implemented using finite-precision arithmetic, either fixed- or floating-point, most of which operate least-significant digit first. This results in a fundamental problem: if, after some time, the result has not converged, is this because we have not run the algorithm for enough iterations or because the arithmetic in some iterations was insufficiently precise? There is no easy way to answer this question, so users will often over-budget precision in the hope that the answer will always be to run for a few more iterations. We propose a fundamentally new approach: with the appropriate arithmetic able to generate results from most-significant digit first, we show that fixed compute-area hardware can be used to calculate an arbitrary number of algorithmic iterations to arbitrary precision, with both precision and approximant index increasing in lockstep. Consequently, datapaths constructed following our principles demonstrate efficiency over their traditional arithmetic equivalents where the latter's precisions are either under- or over-budgeted for the computation of a result to a particular accuracy. Use of most-significant digit-first arithmetic additionally allows us to declare certain digits to be stable at runtime, avoiding their recalculation in subsequent iterations and thereby increasing performance and decreasing memory footprints. Versus arbitrary-precision iterative solvers without the optimisations we detail herein, we achieve up-to 16$\times$ performance speedups and 1.9x memory savings for the evaluated benchmarks.
△ Less
Submitted 1 October, 2019;
originally announced October 2019.
-
The Semantics of Transactions and Weak Memory in x86, Power, ARM, and C++
Authors:
Nathan Chong,
Tyler Sorensen,
John Wickerson
Abstract:
Weak memory models provide a complex, system-centric semantics for concurrent programs, while transactional memory (TM) provides a simpler, programmer-centric semantics. Both have been studied in detail, but their combined semantics is not well understood. This is problematic because such widely-used architectures and languages as x86, Power, and C++ all support TM, and all have weak memory models…
▽ More
Weak memory models provide a complex, system-centric semantics for concurrent programs, while transactional memory (TM) provides a simpler, programmer-centric semantics. Both have been studied in detail, but their combined semantics is not well understood. This is problematic because such widely-used architectures and languages as x86, Power, and C++ all support TM, and all have weak memory models.
Our work aims to clarify the interplay between weak memory and TM by extending existing axiomatic weak memory models (x86, Power, ARMv8, and C++) with new rules for TM. Our formal models are backed by automated tooling that enables (1) the synthesis of tests for validating our models against existing implementations and (2) the model-checking of TM-related transformations, such as lock elision and compiling C++ transactions to hardware. A key finding is that a proposed TM extension to ARMv8 currently being considered within ARM Research is incompatible with lock elision without sacrificing portability or performance.
△ Less
Submitted 16 April, 2018; v1 submitted 13 October, 2017;
originally announced October 2017.
-
Overhauling SC Atomics in C11 and OpenCL
Authors:
Mark Batty,
Alastair F. Donaldson,
John Wickerson
Abstract:
Despite the conceptual simplicity of sequential consistency (SC), the semantics of SC atomic operations and fences in the C11 and OpenCL memory models is subtle, leading to convoluted prose descriptions that translate to complex axiomatic formalisations. We conduct an overhaul of SC atomics in C11, reducing the associated axioms in both number and complexity. A consequence of our simplification is…
▽ More
Despite the conceptual simplicity of sequential consistency (SC), the semantics of SC atomic operations and fences in the C11 and OpenCL memory models is subtle, leading to convoluted prose descriptions that translate to complex axiomatic formalisations. We conduct an overhaul of SC atomics in C11, reducing the associated axioms in both number and complexity. A consequence of our simplification is that the SC operations in an execution no longer need to be totally ordered. This relaxation enables, for the first time, efficient and exhaustive simulation of litmus tests that use SC atomics. We extend our improved C11 model to obtain the first rigorous memory model formalisation for OpenCL (which extends C11 with support for heterogeneous many-core programming). In the OpenCL setting, we refine the SC axioms still further to give a sensible semantics to SC operations that employ a 'memory scope' to restrict their visibility to specific threads. Our overhaul requires slight strengthenings of both the C11 and the OpenCL memory models, causing some behaviours to become disallowed. We argue that these strengthenings are natural, and that all of the formalised C11 and OpenCL compilation schemes of which we are aware (Power and x86 CPUs for C11, AMD GPUs for OpenCL) remain valid in our revised models. Using the Herd memory model simulator, we show that our overhaul leads to an exponential improvement in simulation time for C11 litmus tests compared with the original model, making exhaustive simulation competitive, time-wise, with the non-exhaustive CDSChecker tool.
△ Less
Submitted 16 November, 2016; v1 submitted 24 March, 2015;
originally announced March 2015.