Parsimonious Optimal Dynamic Partial Order Reduction for Interleaving-Based Concurrency
Abstract
Stateless model checking is a fully automatic verification technique for concurrent programs that checks for safety violations by exploring all possible thread schedulings. It becomes effective when coupled with Dynamic Partial Order Reduction (DPOR), which introduces an equivalence on schedulings and reduces the amount of needed exploration. DPOR algorithms that are optimal are particularly effective in that they guarantee to explore exactly one execution from each equivalence class. Unfortunately, existing sequence-based optimal algorithms may in the worst case consume memory that is exponential in the size of the analyzed program. In this paper, we present Parsimonious-OPtimal DPOR (POP), an optimal DPOR algorithm for analyzing multi-threaded programs under sequential consistency, whose space consumption is polynomial in the worst case. POP combines several novel algorithmic techniques, including (i) a parsimonious race reversal strategy, which avoids multiple reversals of the same race, (ii) an eager race reversal strategy to avoid storing initial fragments of to-be-explored executions, and (iii) a space-efficient scheme for preventing redundant exploration, which replaces the use of sleep sets. Our implementation in Nidhugg shows that these techniques can significantly speed up the analysis of concurrent programs, and do so with low memory consumption. Comparison to a related optimal DPOR algorithm for a different representation of concurrent executions as graphs shows that POP has comparable worst-case performance for smaller benchmarks and outperforms the other one for larger programs.
1 Introduction
Testing and verification of multi-threaded programs is challenging, since it requires reasoning about all the ways in which operations executed by different threads can interfere. A successful technique for finding concurrency bugs in multithreaded programs and for verifying their absence is stateless model checking (SMC) [Godefroid:popl97]. Given a terminating program and fixed input data, SMC systematically explores the set of all thread schedulings that are possible during program runs. A dedicated runtime scheduler drives the SMC exploration by making decisions on scheduling whenever such choices may affect the interaction between threads. Given enough time, the exploration covers all possible executions and detects any unexpected program results, program crashes, or assertion violations. The technique is entirely automatic, has no false positives, does not consume excessive memory, and can reproduce the concurrency bugs it detects. SMC has been implemented in many tools (e.g., VeriSoft [Godefroid:verisoft-journal], Chess [MQBBNN:chess], Concuerror [Concuerror:ICST13], Nidhugg [tacas15:tso], rInspect [DBLP:conf/pldi/ZhangKW15], CDSChecker [NoDe:toplas16], RCMC [KLSV:popl18], and GenMC [GenMC@CAV-21]), and successfully applied to realistic programs (e.g., [GoHaJa:heartbeat] and [KoSa:sttt19]).
To reduce the number of explored executions, SMC tools typically employ dynamic partial order reduction (DPOR) [FG:dpor, abdulla2014optimal, KLSV:popl18]. DPOR defines an equivalence relation on executions, typically Mazurkiewicz trace equivalence [Mazurkiewicz:traces], which preserves many important correctness properties, such as reachability of local states and assertion violations, and explores at least one execution in each equivalence class. Thus, to analyze a program, it suffices to explore one execution from each equivalence class. DPOR was originally developed for models of concurrency [FG:dpor], where executions are expressed as sequences of interactions between threads/processes and shared objects. Subsequently, sequence-based DPOR has been adapted and refined to a number of programming models, including actor programs [TKLLMA:forte12], abstract computational models [ProbeSets@CONCUR-08], event driven programs [Event-DrivenSMC@OOPSLA-15, Maiya:tacas16, atva23], and MPI programs [DPOR:MPI]; it has been extended with features for efficiently handling spinloops and blocking constructs [Godot@FMCAD-22], and been adapted for weak concurrency memory models, such as TSO and PSO [DBLP:conf/pldi/ZhangKW15, tacas15:tso]. DPOR has also been adapted for weak memory models by representing executions as graphs, where nodes represent read and write operations, and edges represent reads-from and coherence relations; this allows the algorithm to be parametric on a specific memory model, at the cost of calling a memory-model oracle [KLSV:popl18, GenMC@PLDI-19]
An important improvement has been the introduction of optimal DPOR algorithms, which are efficient in that they guarantee to explore exactly one execution from each equivalence class. The first optimal DPOR algorithm was designed for the sequence-based representation [abdulla2014optimal]. Subsequently, optimal DPOR algorithms for even weaker equivalences than Mazurkiewicz trace equivalence have been developed [observers, rfsc@OOPSLA-19, OptimalDPOR@JSS-23]. In some DPOR algorithms [abdulla2014optimal, observers, OptimalDPOR@JSS-23], optimality comes at the price of added memory consumption which in the worst case can be exponential in the size of the program. Even though most benchmarks in the literature show a modest memory overhead as the price for optimality, it would be desirable to have an optimal DPOR algorithm whose memory consumption is guaranteed to be polynomial in the size of the program [optimal-dpor-jacm]. Such an algorithm, called TruSt [TruSt@POPL-22], was recently presented, but for a graph-based setting [GenMC@PLDI-19]. It would be desirable to develop a polynomial-space optimal DPOR algorithm also for sequence-based settings. One reason is that a majority of past work on DPOR is sequence-based; hence such an algorithm could be adapted to various programming models and features, some of which were recalled above. Another reason is that sequence-based models represent computations adhering to sequential consistency (SC) and TSO more naturally than graph-based models. For SC, representing executions as sequences of events makes executions consistent by construction and alleviates the need to resort to a potentially expensive SC oracle.
In this paper, we present the Parsimonious-OPtimal DPOR (POP) algorithm for analyzing multi-threaded programs under SC (Section 4). POP is designed for programs in which threads interact by atomic reads, writes, and RMWs to shared variables, and combines several novel algorithmic techniques.
-
•
A parsimonious race reversal technique (Section 4.1), which considers a race if and only if its reversal will generate a previously unexplored execution; in contrast, most existing DPOR algorithms reverse races indiscriminately, only to thereafter discard redundant reversals (e.g., by sleep sets or similar mechanisms).
-
•
An eager race reversal strategy (LABEL:sec:pop-algo), which immediately starts exploration of the new execution resulting from a race reversal; this prevents accumulation of a potentially exponential number of execution fragments generated by race reversals.
-
•
In order to avoid exploring several executions in the same equivalence class, a naïve realization of POP would employ an adaptation of sleep sets [Godefroid:thesis]. However, these can in the worst case become exponentially large. Therefore, POP employs a parsimonious characterization of sleep sets (Section 4.3): instead of representing the elements of the sleep set, POP uses a characterization of them, which allows to detect and prevent redundant exploration, and uses at most polynomial space. This sleep set characterization is uniquely computed from its generating race, implying that explorations of different executions share no state, thereby making POP suitable for parallelization.
We prove (in the appendices of the longer version) that the POP algorithm is correct (explores at least one execution in each equivalence class), optimal (explores exactly one execution in each equivalence class), does not suffer from blocked explorations, and requires only polynomial size memory.
We have implemented POP DPOR in an extension of the Nidhugg tool [tacas15:tso]. Using a wide variety of benchmarks (Section 6), we show that POP’s implementation indeed has its claimed properties, it always outperforms Optimal DPOR, and offers performance which is on par with TruSt, the state-of-the-art graph-based DPOR algorithm. Moreover, by being sequence-based, it scales much better than TruSt on programs with long executions.
2 Main Concepts
Initially: x = y = z = 0
In this section, we informally present the core principles of our approach, in particular the three novel algorithmic techniques of parsimonious race reversal, eager race reversal, and parsimonious static representation of sleep sets, along with how they relate to previous sequence-based DPOR algorithms, on a simple example, shown in Fig. 1. In this code, four threads () access three shared variables (x, y, z), using six thread-local registers ().111Throughout this paper, we assume that threads are spawned by a main thread, and that all shared variables get initialized to , also by the main thread. DPOR algorithms typically first explore an arbitrary execution, which is then inspected to detect races. Assume that the first execution is (cf. the tree in Fig. 1). To detect races in an execution , one must first compute its happens-before order, denoted , which is the smallest transitive relation that orders two events that are (i) in the same thread, or (ii) access a common shared variable and at least one of them is a write. A race consists of two events in different threads that are adjacent in the order. The execution contains three races (red arcs left of in Fig. 1). Let us first consider the race on y, in which the first event is y = 1 and the second is = y. For each race, a DPOR algorithm constructs an initial fragment of an alternative execution, which reverses the race and branches off from the explored execution just before the race. In early DPOR algorithms, including the “classic” DPOR algorithm by Flanagan and Godefroid [FG:dpor] and the Source DPOR algorithm of Abdulla et al. [abdulla2014optimal], this initial fragment consists of just one event which can initiate an execution which reverses the race. For this race, they would store = z as an alternative to y = 1, since = z must be performed before the racing access to y (i.e., = y). Storing only the first event of an alternative execution does not consume much space, but has the problem that the execution after that event ( = z in our case) is not controlled, and may deviate from the path towards the racing event ( = y in our case), potentially leading to redundant exploration. Some DPOR algorithms, including Optimal DPOR and POP, address this problem by constructing the initial fragment as a sequence of events that lead to the second event, = y in this case. POP constructs a minimal such sequence (called a schedule) consisting of the events that happen before (in the order) the second event of the race while omitting the first event of the race. The schedule in this case is , which is inserted as an alternative continuation after x = 1, marked purple in Fig. 1. This continuation can be extended to a maximal execution .
Let us next consider the races on x in ; their first event is x = 1. For the race whose second event is = x, POP constructs the schedule = x (there are no additional events happening before = x), which is inserted as an initial fragment of an alternative execution, which can later be extended to (cf. Fig. 1). For the race whose second event is = x, POP constructs the schedule = x, which is also inserted as an initial fragment of an alternative execution.
Parsimonious Race Reversals:
To illustrate POP’s mechanism for reversing each race only once, let us next consider the races in execution . There is one race on y, between = y and y = 1. However, this race should not be reversed since it would give back the event order in . POP achieves this by forbidding to consider races whose first event (in this case = y) is in some schedule: reversing a race whose first event is in a schedule yields a fragment that was already explored in some previous execution. The execution also exhibits three races on x, all including x = 1. The race with = x is new, but the two other races have already occurred in . These two races should therefore not be considered, since the schedules they would generate have already been generated from the corresponding races in . POP achieves this by forbidding to consider races whose second event is not fresh. Intuitively, a fresh event is an event which did not occur with the same predecessors in a previously explored execution (and whose reversal will therefore generate a schedule that did not appear before). More precisely, a second event of a race is fresh if it happens-after (in the order) the last event of each schedule that appears between the two racing events. Returning to the three races on x in , the only second event which is fresh is = x. Therefore, only this race is considered; it generates the schedule , which is inserted as the start of an alternative exploration (top right in Fig. 1).
Eager Race Reversal Strategy:
POP as described so far can in principle be implemented so that the schedules constructed as alternative continuations of an event are all collected before they are explored. However, such a strategy can in the worst case consume memory that is exponential in the program size. The reason is that for some programs, the number of schedules that branch off at a particular point in an execution may become exponential in the size of the program; this was first observed by Abdulla et al. [optimal-dpor-jacm, Sect. 9]; an illustrating shared-variable program is given in [TruSt@POPL-22, Sect. 2.3]. POP avoids this problem by exploring schedules eagerly: immediately after the creation of a schedule, exploration switches to continuations of that schedule. This policy can be realized by an algorithm that calls a recursive function to initiate exploration of a new schedule. Just for illustration, we show a simple such algorithm in LABEL:sec:simple-dpor, which however still has worst-case exponential memory consumption, due to a problem that is solved by our parsimonious characterization of sleep sets below. We establish, in Lemma 1, that the recursion depth of such an algorithm is at most , where is the length longest execution of the analyzed program.
Parsimonious Characterization of Sleep Sets:
Even though the parsimonious race reversal strategy guarantees that the initial fragments of alternative executions are inequivalent, one must prevent that their continuations become equivalent. To illustrate this problem, let us return to the reversal of races on x in Fig. 1 (in both and ). The top leftmost event x = 1 races with three fresh reads on x: two in and one in , shown in teal color. From these races, three schedules are constructed as initial fragments of alternative explorations: = x and = x from and from . At this point, we need to be careful: these schedules are not conflicting, since they all read from their common shared variable x. There is then a danger that the first schedule will be continued using the second, and the second will be continued using the first; we would then explore two equivalent executions, consisting of these two schedules in either order. The standard DPOR technique for avoiding such redundant exploration is sleep sets [Godefroid:thesis]. In its standard form, a sleep set is a set of events that should not be performed before some conflicting event. Since POP uses schedules as beginnings of alternative explorations, the appropriate adaptation would be to let a sleep set be a set of schedules that should not be performed unless some conflicting event is performed before that. In Fig. 1, this would mean that after exploring the continuations of = x, this schedule is added to the sleep set when starting to explore the continuations of = x, which is thereafter also added to the sleep set when starting exploration from from . This mechanism is simple to combine with eager exploration of schedules; in fact, we show a pseudocode version of it in LABEL:sec:simple-dpor. Unfortunately, there are programs where the number of schedules that conflict with a write event is exponential, whence the memory consumption may become exponential in the size of the program, because of large sleep sets. POP avoids this problem by a parsimonious characterization of sleep sets, which consumes memory that is linear in the size of the program. We describe it below.
Consider the schedules that are generated from races with x = 1 in and (the racing read events are colored teal). Now order these schedules in post-order w.r.t. the position of the racing events in the exploration tree; in this case = x is first, = x is second, and is third. POP’s parsimonious sleep set characterization then enforces the policy that an exploration starting with one of these schedules must not explore a schedule that succeeds it in this post-order. These succeeding schedules are thus implicitly in a sleep set, which is parsimoniously characterized using linear space, even though the number of succeeding schedules may be exponential. Let us explain how this is done for the first schedule = x. The schedules that must be avoided in explorations starting with = x are of two kinds:
-
1.
Schedules that precede it in the execution where = x is fresh (in this case = x); POP puts all such schedules into a sleep set which will accompany the exploration starting with = x.
-
2.
Schedules that arise from fresh read events in executions that branch off from to its right (in this case ). Because such a read must be fresh, its schedule must be in conflict with the execution from which it branches off ( in this case). To detect such reads, POP creates a conflict detector for the sequence , where is the sequence of events between the racing events. Such a conflict detector is a mechanism that for each event in an execution checks whether it conflicts with or happens-after an event that conflicts with . It is represented as a pair of form , where is a sequence of events and is a set of threads and accesses, initialized to , which will accompany the exploration starting with = x. During that exploration, will be the subsequence of that has not been exhibited in the exploration and against which conflicts must still be detected, and is the set of threads and accesses that have already been observed to be in conflict with .
Let us illustrate the working of the sleep set and conflict detector on the exploration of in Fig. 1. When the exploration from = x starts, the sleep set is initialized with , and the conflict detector with . For selecting the next event, the sleep set prevents performing = x before any write on x. The next event = z in is not in conflict with either the sequence in the sleep set or in the conflict detector; in this case = z is removed from these sequences if present, which in this case has no effect on them. The next event = y is not in conflict with the sleep set, but is in conflict with the conflict detector. Therefore, the conflict detector is updated to , which from now on prevents a read on x by thread , or happening after a write on y, implying that the event = x cannot be performed next. Therefore the only non-blocked event is now x = 1; since it is a write on x, the sleep set is emptied and the conflict detector is removed, allowing continuation of the exploration.
The sleep set and conflict detector generated from a particular race can be represented using linear space. At each prefix, the exploration may need to maintain sleep sets and conflict detectors of several preceding races, so that their total size is at most quadratic. By the previous observation that POP keeps at most a quadratic number of executions alive at any point in time, it follows that the space requirement for POP is polynomial. Details are in Theorem 5.2.
3 Programs, Executions, and Equivalence
We consider programs consisting of a finite set of threads that share a finite set of (shared) variables. Each thread has a finite set of local registers and runs a deterministic code, built in a standard way from expressions (over local registers) and atomic commands, using standard control flow constructs (sequential composition, selection, and bounded loop constructs). Atomic commands either write the value of an expression to a shared variable, or assign the value of a shared variable to a register, or can atomically both read and modify a shared variable. Conditional control flow constructs can branch on the value of an expression. From here on, we use to range over threads, and to range over shared variables. The local state of a thread is defined as usual by its program counter and the contents of its registers. The global state of a program consists of the local state of each thread together with the valuation of the shared variables. The program has a unique initial state, in which shared variables have predefined initial values. We assume that memory is sequentially consistent.
The execution of a program statement is an event, which affects or is affected by the global state of the program. An event is represented by a tuple , where is the thread performing the event, is a positive integer, denoting that the event results from the -th execution step in thread . is the type of the event (either for read or for write and read-modify-write), and is the accessed variable. If is the event , we write for , for , and for . An access is a pair consisting of a type and a variable. We write for . We say that two accesses and are dependent, denoted , if and at least one of and is . We say that two events and are dependent, denoted , if or . As is customary in DPOR algorithms, we can let an event represent the combined effect of a sequence of statements, if at most one of them accesses a shared variable.
An execution sequence (or just execution) is a finite sequence of events, starting from the initial state of the program. We let denote the set of events that can be performed in the state to which leads. An execution is maximal if . We let denote the set of events in ; we also write to denote . We use and , possibly with superscripts, to range over sequences of events (not necessarily starting from the initial state), to denote the empty sequence, and to denote the sequence with only the event . We let denote the concatenation of sequences and , and let denote the sequence with the first occurrence of (if any) removed. For a sequence , we let denote .
The basis for a DPOR algorithm is an equivalence relation on the set of execution sequences. The definition of this equivalence is based on a happens-before relation on the events of each execution sequence, which captures the data and control dependencies that must be respected by any equivalent execution.
Definition 1 (Happens-before)
Given an execution sequence , we define the happens-before relation on , denoted , as the smallest irreflexive partial order on such that if occurs before in , and .
The -trace (or trace for short) of is the directed graph .
Definition 2 (Equivalence)
Two execution sequences and are equivalent, denoted , if they have the same -trace. We let denote the equivalence class of .
The equivalence relation partitions the set of execution sequences into equivalence classes, paving the way for an optimal DPOR algorithm which explores precisely one execution in each equivalence class.
4 Designing the DPOR Algorithm
In this section, we explain the design of POP, which is optimal in the sense that it explores precisely one execution in each equivalence class defined by Definition 2. We first need some auxiliary definitions
Definition 3 (Compatible sequences and happens-before prefix)
For two execution sequences and ,
-
•
the sequences and are called compatible, denoted , iff there are sequences and s.t. ,
-
•
the sequence is called a happens-before prefix of , denoted , iff there is a sequence s.t. .
We illustrate the definition on the program in Fig. 1. Let be the sequence of events in , and be the sequence . Then since where is the sequence and is some preceding execution. Let next be the sequence , and be the sequence , then , since and ’s first accesses to y are in conflict.
Definition 4 (Schedule)
A sequence of events is called a schedule if all its events happen-before its last one, i.e., where is its last event, and is any other event in . The last event of a schedule is called the head of , sometimes denoted . We sometimes call a read schedule if is a read event. For an execution sequence and event , define the schedule to be the subsequence of such that (i) , and (ii) for each it holds that iff .
4.1 Parsimonious Race Reversals
A central mechanism of many DPOR algorithms is to detect and reverse races. Intuitively, a race is a conflict between two consecutive accesses to a shared variable, where at least one access writes to the variable (i.e., it is a write or a read-modify-write).
Definition 5 (Race)
Let be an execution sequence. Two events and in are racing in if (i) and are performed by different threads, (ii) . (iii) there is no event with .
Intuitively, a race arises when two different threads perform dependent accesses to a shared variable, which are adjacent in the order. If and are racing in , then to reverse the race, is decomposed as with in , thereafter the schedule is formed as the initial fragment of an alternative execution, which extends .
The key idea of parsimonious race reversal is to reverse a race only if such a reversal generates an execution that has not been explored before. To be able to do so, POP remembers whenever an event in a new execution is in a schedule, and whether it is a schedule head. This can be done, e.g., by marking events in schedules, and specifically marking the schedule head. From now on, we consider such markings to be included in the events of executions. They play an important role in selecting races.
Definition 6 (Fresh event)
For an execution , the event is called fresh in after if (i) if is in a schedule, then it is the head of that schedule, and (ii) for each head of a schedule in it is the case that .
Definition 7 (Parsimonious race)
Let be an execution sequence. Two events and in are in a parsimonious race, denoted if (i) and are racing in , (ii) is not in a schedule in , and (iii) is fresh in after , where
Conditions (ii) and (iii) are the additional conditions for a race to be parsimonious. They filter out races, whose reversals would lead to previously explored executions. Let us provide the intuition behind these conditions. (ii) If is in a schedule, then that schedule, call it , was generated by a race in an earlier explored execution . Hence was contained in . Moreover would race with the head of also in ; if appeared after the resulting new schedule had been generated already in ; if appeared before , then we would only undo the previous race reversal. This is illustrated in Fig. 1 by the race on y, between = y and y = 1 in . (iii) If is not fresh, then appeared with the same happens-before predecessors in an earlier explored execution , where it was in a race that would generate the same schedule as in . This is illustrated in Fig. 1 by the race on x, between x = 1 and = x in , which was already considered in .
4.2 Eager Race Reversal Strategy
A DPOR algorithm with parsimonious race reversal could be implemented so that the schedules constructed from races are all collected before they are explored. However, for some programs, the number of alternative schedules from a program point can be exponential. In order not to consume exponential memory, POP explores schedules eagerly: immediately after the creation of a schedule, exploration switches to continuations of that schedule. This policy can be realized by an algorithm that calls a recursive function to initiate exploration of a new schedule. To illustrate this policy, we show a simple realization of it in LABEL:alg:dpor-bj-sleepsets-recursive (Simple), in LABEL:sec:simple-dpor. We establish (in Lemma 1) that the recursion depth of such an algorithm is at most , where is the length longest execution of the analyzed program.
4.3 Parsimonious Sleep Set Characterization
As described in Section 2, a naïve use of sleep sets by POP may in the worst case consume exponential memory. Instead, POP uses a different mechanism for avoiding redundant exploration, in which each explored execution is equipped with
-
•
a sleep set, denoted , which is a set of read schedules that must not be performed during subsequent exploration from ; more precisely, the algorithm must not explore an execution such that for some , and
-
•
a conflict map, denoted , which checks that the subsequent exploration from does not perform a read event, which is in a parsimonious race whose generated schedule conflicts with a previous execution, as described in Section 2. The description in Section 2 has made the simplification to check for races that conflict with a previous execution without checking that such a race also satisfies the freshness condition (condition (iii) in Definition 7). In the following paragraphs, we will first explain the technique in this simplified setting, and thereafter explain how it is extended to fully consider the freshness condition. Note that the simplified technique is sufficient when there are no additional schedules betwen the racing events in the checked-for races.
Let be a sequence of events. A conflict detector for is a mechanism that for each event in an execution checks whether it conflicts with or happens-after an event that conflicts with . It is represented as a pair of form , where is a sequence of events and is a set of threads and accesses. At start of exploration, is initialized as . During exploration, contains the set of threads and accesses that have been performed in conflict with , as well as threads and accesses that have happened after events that have been performed in conflict with , whereas is the subsequence of which could still be performed by events that are not in conflict with . For an event , let denote the set . For a set of threads and accesses, and an event , let denote . For a pair consisting of a sequence of events and set of threads and accesses, and an event , let denote , i.e., that the thread of or the access of is in or that is not compatible with . For a sequence , let denote .
On each explored event , a conflict detector is updated to if , otherwise to It follows that the event is conflicting with the original sequence if .
We can now describe how the sleep set and conflict map are manipulated. When reversing a parsimonious race where is a read on , the execution is decomposed as and the schedule is formed as the initial fragment of an alternative execution, which extends . When that execution is explored, starting from , the sleep set is extended with the schedules of form such that , i.e., for fresh reads on that appear in . The conflict map is extended by adding a new conflict detector for to , initialized as .
The sleep set and conflict map are updated during exploration, as follows: on each explored event , we obtain from and from for each variable as follows:
that is, the sleep set is updated by removing from those sequences that are compatible with and completely discarding the other sequences; for each variable , if is a write on then , otherwise the conflict map is updated by adding the thread and access of to the set for each pair that depends with , and removing from in the others.
It follows that an extension of by an event which reads from should be blocked if either or for some .
Let us now leave the simplified setting and make a full description of the conflict map, for the case that the sequence between the racing events contains schedules. We then have to work a bit harder to check whether encountered read events are also fresh. In this case can be written as , where are the schedules in . Let denote for . It must be checked that the new exploration starting from does not exhibit an event which reads such that the schedule is in conflict with for some and furthermore happens-after for all . For this case, is for each variable extended from being just a conflict detector to being a pair , where is a tuple of conflict detectors and is a tuple of sets of threads and accesses. Here each conflict detector checks for conflicts with and each set collects threads and accesses that happen after , initialized with .
For a pair consisting of a sequence of conflict detectors and a sequence of sets of threads and accesses, let denote that there is an such that and for all . For a sequence , let denote that there is an i such that and .
During exploration is obtained from as described by the function in Algorithm 1, in the following way. For each variable , if is a write on then , otherwise the conflict map is updated as follows: For each pair in , then for each pair in , if depends with then is added to , otherwise is removed from . Furthermore each set in is extended with if .
It follows that an extension of by an event which reads from should be blocked if either or for some .
4.4 The Parsimonious-OPtimal DPOR (POP) Algorithm
The POP algorithm, shown as Algorithm 1, takes an input program, and explores its executions by repeated calls to the procedure . For each prefix of an execution that is under exploration, the algorithm maintains a sleep set and a conflict map , as described in Section 4.3, in order to prevent redundant exploration of read schedules. The algorithm first picks an enabled event , initializes the sleep sets of and , as well as the ranges of the conflict maps for and , to the empty set, whereafter it calls .
Each call to consists of a race reversal phase (Algorithms 1 to 1) and an exploration phase (Algorithms 1 to 1). In the race reversal phase, it considers all parsimonious races between an event in and the last event of (Algorithm 1). For each such race, of form , it decomposes as (Algorithm 1), and forms the schedule that reverses the race as (Algorithm 1). It then intends to call in order to recursively switch the exploration to the newly reversed race, but before that it needs to check that exploring will not be redundant, and also initialize and . To check that exploring will not be redundant, it checks that the schedule does not include a schedule in , and that the head of cannot form a schedule that is characterized by (Algorithm 1). If this check fails, this race will not lead to a call of . If, on the other hand, the check succeeds, is computed by first updating it from (Algorithm 1), and computing from using the function (Algorithm 1). If the last event of is a read event, then and must be further extended: is extended with the schedules of parsimonious write-read races with occurring before in (Algorithm 1), and is extended with the machinery for checking whether the execution will exhibit a read in a parsimonious race on the variable of that conflicts with with removed (Algorithm 1). After these preparations, is called recursively (Algorithm 1).
After the return of all recursive calls initiated in the race reversal phase, enters the exploration phase. There it picks an event that is enabled for execution. Before starting the execution, it must be checked that it is not a read event which is a sequence in the sleep set nor that it completes a schedule that is characterized by . These checks (Algorithm 1) are done in the same way as before a recursive call to . If the checks succeed, the call is prepared by updating (Algorithm 1) and (at Algorithm 1) in an analogous manner.
5 Correctness and Space Complexity
In this section, we state theorems of correctness, optimality, and space complexity of POP. We first consider correctness and optimality.
Theorem 5.1
For a terminating program , the POP algorithm has the properties that (i) for each maximal execution of , it explores some execution with , and (ii) it never explores two different but equivalent maximal executions, and (iii) it is never blocked (at LABEL:algbrl:explore-check) unless it has explored a maximal execution.
We thereafter consider space complexity.
Lemma 1
The number of nested recursive calls to at LABEL:algbrl:race-recursive-call is at most , where is the length of the longest execution of the program.
Note that in this lemma, we do not count the calls at LABEL:algbrl:explore-continue, since they are considered as normal exploration of some execution. Only the calls at LABEL:algbrl:race-recursive-call start the exploration of a new execution.
Theorem 5.2
LABEL:alg:bj-pop needs space which is polynomial in , where is the length of the longest execution of the analyzed program.
6 Implementation and Evaluation
Our implementation was done in a fork of the Nidhugg tool.222The code is publicly available on GitHub in simplified-dpor branch of a Nidhugg’s fork (https://github.com/sarbojit4/nidhugg). We also submitted our implementation as well as the benchmarking infrastructure we have used as an artifact (https://doi.org/10.5281/zenodo.11001033). Nidhugg is a state-of-the-art stateless model checker for C/C++ programs with Pthreads, which works at the level of LLVM Intermediate Representation, typically produced by the Clang compiler. Nidhugg comes with a selection of DPOR algorithms, one of which is Optimal DPOR [abdulla2014optimal] nowadays also enhanced with Partial Loop Purity elimination and support for await statements [Godot@FMCAD-22]. In our Nidhugg fork, we have added the POP algorithm as another selection. Its iterative implementation involved: (i) designing an efficient data structure to simulate recursive calls to , i.e., follow the next schedule to explore and backtrack to the previous execution when no further races to reverse; and (ii) develo** a procedure that filters out races that are not parsimonious races, and implementing it with the necessary data structures to store and maintain the parsimonious sleep sets.
Time (secs) | Memory (MB) | ||||||
Benchmark | Executions | TruSt | Optimal | POP | TruSt | Optimal | POP |
circular-buffer(7) | 3432 | 0.62 | 0.45 | 0.43 | 85 | 84 | 84 |
circular-buffer(8) | 12870 | 2.63 | 1.79 | 1.66 | 85 | 84 | 84 |
circular-buffer(9) | 48620 | 11.04 | 7.21 | 6.67 | 85 | 84 | 84 |
fib-bench(4) | 19605 | 1.08 | 1.93 | 1.82 | 85 | 84 | 84 |
fib-bench(5) | 218243 | 14.59 | 24.66 | 24.10 | 85 | 84 | 84 |
fib-bench(6) | 2364418 | 186.25 | 301.30 | 297.40 | 85 | 84 | 84 |
linuxrwlocks(6) | 99442 | 3.61 | 13.71 | 12.88 | 90 | 91 | 91 |
linuxrwlocks(7) | 829168 | 32.75 | 127.66 | 121.17 | 90 | 91 | 91 |
linuxrwlocks(8) | 6984234 | 311.93 | 1176.13 | 1119.23 | 90 | 91 | 91 |
filesystem(22) | 512 | 0.72 | 0.62 | 0.34 | 86 | 84 | 84 |
filesystem(24) | 2048 | 2.84 | 2.97 | 1.32 | 86 | 187 | 84 |
filesystem(26) | 8192 | 11.88 | 15.71 | 5.66 | 85 | 622 | 84 |
indexer(15) | 4096 | 11.07 | 8.58 | 5.65 | 89 | 116 | 90 |
indexer(16) | 32768 | 90.14 | 80.37 | 46.46 | 89 | 464 | 90 |
indexer(17) | 262144 | 736.78 | 827.02 | 399.87 | 89 | 3030 | 90 |
lastzero(10) | 3328 | 0.07 | 0.34 | 0.27 | 85 | 84 | 84 |
lastzero(15) | 147456 | 3.19 | 24.46 | 15.09 | 85 | 276 | 84 |
lastzero(20) | 6029312 | 152.13 | 1828.92 | 786.19 | 85 | 8883 | 84 |
exp-mem3(7) | 10080 | 0.22 | 0.67 | 0.54 | 86 | 104 | 85 |
exp-mem3(8) | 80640 | 1.96 | 6.15 | 4.61 | 86 | 506 | 85 |
exp-mem3(9) | 725760 | 19.11 | 73.68 | 44.83 | 86 | 4489 | 85 |
dispatcher(4) | 6854 | 1.15 | 1.75 | 1.47 | 90 | 90 | 90 |
dispatcher(5) | 151032 | 34.66 | 55.07 | 42.76 | 89 | 407 | 90 |
dispatcher(6) | 4057388 | 1245.13 | 2333.51 | 1424.57 | 89 | 9097 | 90 |
poke(10) | 135944 | 88.54 | 96.30 | 63.45 | 90 | 791 | 90 |
poke(15) | 728559 | 874.76 | 891.26 | 479.03 | 89 | 5527 | 90 |
poke(20) | 2366924 | 4502.45 | 4356.59 | 2008.92 | 90 | 22383 | 90 |
length-param(2,1024) | 4 | 0.14 | 0.05 | 0.06 | 85 | 84 | 84 |
length-param(2,8196) | 4 | 7.95 | 0.16 | 0.14 | 95 | 101 | 89 |
length-param(2,65536) | 4 | 1413.00 | 1.13 | 0.90 | 389 | 441 | 343 |
In the rest of this section, we evaluate the performance of POP’s implementation and compare it, in terms of time and memory, against the implementation of Optimal DPOR in Nidhugg commit 5805d77 and the graph-based Truly Stateless (TruSt) Optimal DPOR algorithm [TruSt@POPL-22] as implemented in GenMC v0.10.0 using options -sc --disable-instruction-caching. All tools employed LLVM 14.0.6. The evaluation was conducted on a desktop with a Ryzen 7950X CPU running Debian 12.4.
Table 1 contains the results of our evaluation. Its first nine benchmarks are from the DPOR literature, and are all parametric on the number of threads (shown in parentheses). The last benchmark, length-param, is synthetic and is additionally parametric on the length of its executions. Since these DPOR algorithms are optimal, they explore the same number of executions (2nd column) in all ten benchmarks. We will analyze the results in five groups (cf. Table 1).
The first group consists of three programs (circular-buffer from SCTBench [thomsonDB16], fib-bench from SV-Comp [svcomp:19], and the linuxrwlocks from SATCheck [SATCheck@OOPSLA-15]). Here, all algorithms consume memory that stays constant as the size of the program and the number of executions explored increase. We can therefore compare the raw performance of the implementation of these three DPOR algorithms. POP’s implementation is fastest on circular-buffer, while TruSt’s is fastest on the two other programs. However, notice that all three implementations scale similarly.
The second group consists of the two benchmarks (filesystem and indexer) from the “classic” DPOR paper of Flanagan and Godefroid [FG:dpor]. Here, Optimal DPOR shows an increase in memory consumption (measured in MB), while the other two algorithms use constant memory. POP is fastest here by approximately .
The third group, consisting of lastzero [abdulla2014optimal] and exp-mem3,333exp-mem3 is slight variant of the exp-mem program used in the TruSt paper. It uses atomic stores and loads instead of fetch-and-adds (FAAs), because the current implementation of Optimal DPOR (and POP) in Nidhugg employs an optimization which treats independent FAAs as non-conflicting [Godot@FMCAD-22] and explores only one trace on the exp-mem program independently of the benchmark’s parameter. two synthetic benchmarks also used in the TruSt paper [TruSt@POPL-22, Table 1], shows a similar picture in terms of memory consumption: Optimal DPOR’s increases more noticeably here, while the two other algorithms use memory that stays constant. Time-wise, TruSt is 2–5 faster than POP, which in turn is faster than Optimal.
The fourth group, consisting of two concurrent data structure programs (dispatcher and poke) from the Quasi-Optimal POR paper [Quasi-Optimal@CAV-18], shows Optimal’s memory explosion more profoundly, and provides further evidence of the good memory performance of the TruSt and POP algorithms. Time-wise, there is no clear winner here, with TruSt’s implementation being a bit faster on dispatcher, and with POP’s being faster and scaling slightly better than TruSt’s on poke.
Finally, let us examine the algorithms’ performance on length-param(,), a synthetic but simple program in which a number of threads (just two here) issue stores and loads to thread-specific global variables, followed by a store and a load to a variable shared between threads. The total number of executions is just four here, but the executions grow in length. One can clearly see the superior time performance of sequence-based DPOR algorithms, such as Optimal and POP, compared to graph-based algorithms such as TruSt. We can also notice that POP performs slightly better than Optimal in terms of memory.
We mention in passing that we also conducted an experiment in which we used the implementation of POP to check, under the SC memory model, the code of Linux kernel’s Hierarchical Read-Copy Update (Tree RCU) implementation [KoSa:sttt19]. Both Optimal DPOR and POP performed similarly: on our desktop, the experiment finished in about five minutes, exploring a bit over executions, and with very low (90MB) memory requirements. Since GenMC cannot handle RCU’s code, we do not include these results in the table, but we offer it as further evidence about POP’s implementation being able to handle “real code” out there.
Wrap** up our evaluation, we can make the following two general claims:
-
1.
Both POP and TruSt live up to their promise about performing SMC exploration which is optimal (w.r.t. the Mazurkiewicz equivalence) but also with polynomial (in fact, in practice, constant) space consumption.
-
2.
The implementation of the POP algorithm consistently outperforms that of Optimal DPOR in Nidhugg. This is mostly due to increased simplicity.