License: CC BY 4.0
arXiv:2403.02665v1 [cs.DS] 05 Mar 2024

DGAP: Efficient Dynamic Graph Analysis on Persistent Memory

Abdullah Al Raqibul Islam Computer Science Department,
University of North Carolina at Charlotte
CharlotteNCUSA
[email protected]
 and  Dong Dai Computer Science Department,
University of North Carolina at Charlotte
CharlotteNCUSA
[email protected]
Abstract.

Dynamic graphs, featuring continuously updated vertices and edges, have grown in importance for numerous real-world applications. To accommodate this, graph frameworks, particularly their internal data structures, must support both persistent graph updates and rapid graph analysis simultaneously, leading to complex designs to orchestrate ‘fast but volatile’ and ‘persistent but slow’ storage devices. Emerging persistent memory technologies, such as Optane DCPMM, offer a promising alternative to simplify the designs by providing data persistence, low latency, and high IOPS together. In light of this, we propose DGAP, a framework for efficient dynamic graph analysis on persistent memory. Unlike traditional dynamic graph frameworks, which combine multiple graph data structures (e.g., edge list or adjacency list) to achieve the required performance, DGAP utilizes a single mutable Compressed Sparse Row (CSR) graph structure with new designs for persistent memory to construct the framework. Specifically, DGAP introduces a per-section edge log to reduce write amplification on persistent memory; a per-thread undo log to enable high-performance, crash-consistent rebalancing operations; and a data placement schema to minimize in-place updates on persistent memory. Our extensive evaluation results demonstrate that DGAP can achieve up to 3.2×3.2\times3.2 × better graph update performance and up to 3.77×3.77\times3.77 × better graph analysis performance compared to state-of-the-art dynamic graph frameworks for persistent memory, such as XPGraph, LLAMA, and GraphOne.

1. Introduction

The ability to ingest new graph data continuously and analyze the latest graphs efficiently is crucial for many real-world applications today. For instance, cellular network operators need to address traffic hotspots in their networks as they are generated and identified (Iyer et al., 2015). A dynamic graph framework that can both persistently store new graph updates and perform complex graph analysis on the latest graph is essential for supporting such applications. However, constructing such a framework is fundamentally challenging. Existing storage devices like SSDs, hard disks, or DRAM either lack persistence (as in volatile DRAM) or offer low performance on graph analysis (as in SSDs or hard disks). To handle both operations, graph frameworks must manage various storage devices, design unique data structures for each, and find a balance between them, leading to intricate systems. For example, GraphOne persists the graphs updates on SSD in Edge List (EL), conducts graph analysis on DRAM using Adjacency List (AL), and continuously synchronizes data between the two (Kumar and Huang, 2019).

Recently, a new set of non-volatile or persistent memory devices (PMs) have emerged, such as Intel Optane DC Persistent Memory (Optane, 2019). These devices can be accessed in bytes via the memory bus with data persistence guarantees. Compared to DRAM, PMs offer data persistence and greater density (e.g., Optane’s 512GB/dimm vs. DRAM’s 64GB/dimm). Compared to block-based devices, PMs allow byte-level access using load and store instructions with significantly lower latency (e.g., similar-to\sim300 ns vs. similar-to\sim100 ms) and higher IOPS (e.g., similar-to\sim10M vs. similar-to\sim500K for random writes) (Suzuki and Swanson, 2015; Qureshi et al., 2009a; Yang et al., 2013; Qureshi et al., 2009b). These characteristics suggest a promising alternative for building dynamic graph frameworks: employ PMs to serve both graph updates and graph analysis for persistence, speed, and capacity. This approach further avoids the cost of data movements and reduces the complexity of coordinating multiple data structures on different storage devices. Although Intel has discontinued Optane PMs due to business reasons, millions of these devices remain available, and various new non-volatile memory solutions continue to emerge. We contend that designing high-performance storage systems on persistent memory devices remains both economically practical and beneficial, as evidenced by recent studies (Wang et al., 2022; Song et al., 2023).

However, directly porting existing graph frameworks to PMs can be sub-optimal. Existing dynamic graph frameworks, such as LLAMA (Macko et al., 2015) or GraphOne (Kumar and Huang, 2019), utilize block I/O interfaces, whose software overheads are not acceptable for byte-addressable PMs (Wu et al., 2020). The data structures are not tailored for PMs either, leading to potential performance issues (Islam et al., 2020, 2022b). Moreover, although PMs are persistent devices, writing data persistently is complicated due to the existence of volatile CPU caches. Extra flushing and fencing operations, though necessary, become costly without the right optimizations (Islam et al., 2020, 2022b). Unexpected crashes further necessitates expensive transactions to avoid partial writes, significantly impacting the performance (Haria et al., 2020; Wu et al., 2021).

On the other hand, existing PM-specific dynamic graph frameworks, such as NVGRAPH (Lim et al., 2019) and, more recently, XPGraph (Wang et al., 2022), continue to follow the traditional approach of coordinating separate persistence-friendly and analysis-friendly data structures (i.e., edge list or adjacency list) on DRAM or PMs. This approach still leads to overly complicated data synchronization between data structures and creates unnecessary conversions or movements.

In this study, we introduce a novel approach to design a unified graph data structure, serving both graph persistence and analysis directly from persistent memory. To this end, we propose DGAP, a Dynamic Graph Analysis framework specifically designed for Persistent memory. DGAP is built upon a recently proposed mutable Compressed Sparse Row (CSR) graph structure (Wheatman and Xu, 2021; Islam et al., 2022a), which leverages Packed Memory Array (PMA) (Bender and Hu, 2007) for efficient graph updates and analysis. Instead of naively porting mutable CSR to PMs, DGAP introduces a series of new designs to enhance its performance on PMs. Firstly, DGAP introduces a new per-section edge log data structure to mitigate the write amplification issues associated with mutable CSR. Secondly, DGAP integrates new per-thread undo logs to support high-performance crash-consistent rebalancing operations, which are frequent and costly operations in mutable CSR. Thirdly, DGAP strategically caches various mutable CSR components in DRAM according to the workloads. Through these designs, DGAP is able to deliver exceptional performance on both graph updates and graph analysis by maximally utilizing PMs.

We implemented DGAP in around 2,000 lines of C++ code and compared its performance to that of state-of-the-art graph frameworks on PMs, using multiple graph analysis algorithms on different real-world graphs. Our results show that DGAP achieves up to 3.2×3.2\times3.2 × improved graph update performance and 3.77×3.77\times3.77 × enhanced graph analysis performance compared to leading graph frameworks, such as XPGraph, LLAMA, and GraphOne.

The remainder of this paper is organized as follows: In §2 we discuss the background and motivation of this study. We introduce persistent memory device, existing graph storage formats including PMA-based mutable CSR, and most importantly, why directly porting PMA-based mutable CSR to PMs does not work. In §3, we present the key components of DGAP and its operations in details. We present the extensive experimental results in §4. We compare with related work in §5, conclude this paper and discuss the future work in §6.

2. Background and Motivation

2.1. PMs and Optane DCPMM Overview

2.1.1. Overview

Persistent memory describes storage devices that are accessible in bytes via memory interfaces and can retain the stored data after the power is off (Lee et al., 2010; Raoux et al., 2008; Kültürsay et al., 2013; Akinaga and Shima, 2010). Intel Optane DC Persistent Memory is the first commercially available PMs (van Renen et al., 2019; Optane, 2019). Working on Intel Cascade Lake platforms, Optane can scale up to 24TB in a single machine (Corporation, 2019). It can be configured in either Memory mode or App Direct mode (pmem.io Persistent, 2019). In Memory mode, the Optane devices are exposed as DRAM, with the actual DRAM becomes a transparent ‘L4’ cache to accelerate data access. However, this model does not support data persistence. In App Direct mode, Optane devices are directly exposed to users alongside DRAM. This mode allows users to access both DCPMM and DRAM and offers data persistence capability. In this study, we focus on App Direct mode.

2.1.2. Performance Features

PMs exhibit performance characteristics critical for building graph storage on them. For instance, their writes are slower due to the added persistence cost. The performances of large sequential accesses are often better than small random accesses due to the internal read/write buffers in these devices. Here, we use Optane DCPMM as an example to further highlight some performance features (van Renen et al., 2019; Izraelevitz et al., 2019; Shu et al., 2018; Yang et al., 2020; Xiang et al., 2022; Islam et al., 2020, 2022b). Firstly, the read/write performance of Optane DCPMM is asymmetric. Write operations, particularly persistent ones, incur significant overheads (e.g., up to 7similar-toabsent7\sim 7∼ 7-8×8\times8 × slower than DRAM). In contrast, read latencies are around 2similar-toabsent2\sim 2∼ 2-3×3\times3 × slower than DRAM. This underscores the importance of minimizing unnecessary writes. Secondly, since Optane DCPMM uses 256 bytes internal write buffers, small random writes will perform much worse than large sequential writes. It is then critical to ensure the writes can be properly grouped (Gwennap, 2019).

2.1.3. Persistence Features

The challenge to achieve persistence in PMs is that not all the components in the memory hierarchy is persistent. Optane DCPMM introduces a concept called Asynchronous DRAM Refresh (ADR) which ensures during a power loss, all data in ADR will be written to PMs. But ADR does not include CPU caches. To guarantee data persistence, programmers must explicitly call CLFLUSHOPT and SFENCE instructions to flush the cache line and enforce the memory operations order (Chen and **, 2015). But even with the cache line flushed and memory fenced, large writes to PMs may still be partially persisted as its atomic write unit is small (i.e., 8 bytes). Transactions are essential for ensuring data safety during large writes, yet they can significantly affect the performance, as recent research suggested (Haria et al., 2020; Wu et al., 2021). Lately, extended ADR (eADR) was introduced in the 3rd generation Intel Xeon Scalable Processors to make CPU caches included in the power fail protected domain (Corporation, 2021). The eADR feature greatly simplified the programming (Zardoshti et al., 2020). But it is not available in all PMs platforms. The applications need to recognize the devices and perform correctly and efficiently regardless which platforms are supported. DGAP is implemented to work with both ADR and eADR platforms.

2.2. Graph Store and CSR

At the heart of graph frameworks are their storage data structures. There have been a significant number of graph storage data structures, such as edge list (EL), adjacency list (AL), Compressed Sparse Row (CSR), and many others (Sahu et al., 2018; Besta et al., 2021) used in different graph frameworks (Ediger et al., 2012; Shao et al., 2013; Kumar and Huang, 2019; Roy et al., 2013; Zhang et al., 2018; Kyrola et al., 2012; Macko et al., 2015).

EdgeList (EL) is a sequential edge log, efficient for edge additions but slow for vertex accesses since it requires scanning the entire edge log. The Adjacency List (AL) and its variations, like blocked adjacency list (Pinar and Heath, 1999), use a per-vertex linked list for storing vertex neighbors. While perform well at graph insertions and single vertex operations, they struggle with whole graph analysis due to memory overheads and cache inefficiencies (Macko et al., 2015; Kumar and Huang, 2019).

Compressed sparse row (CSR), on the other hand, is designed for efficient graph analysis. It groups all edges from the same vertex together and stores them sequentially in an edge array, while the vertex array stores each vertex’s starting index. In this way, CSR supports both per-vertex queries and edge iterations efficiently. It delivers extreme graph analysis performance because most of the vertices and edges are accessed sequentially. Its major limitation, however, is that it can not accommodate dynamic graph updates without rebuilding the entire edge array for each edge insertion. To address this limitation, recent studies have proposed to use the Packed Memory Array (PMA) to make the edge array mutable (Sha et al., 2017; Wheatman and Xu, 2018; De Leo and Boncz, 2021; Islam et al., 2022a). Such mutable CSR data structures can offer extreme graph analysis performance while handling graph updates efficiently, making them a perfect candidate to build the PMs-based graph framework.

2.3. PMA-based Mutable CSR

The Packed Memory Array (PMA) is fundamentally a sorted array with reserved empty gaps interspersed (Bender and Hu, 2007). These gaps provide room for future insertions without shifting the entire array. To maintain the gap density, PMA employs a binary PMA Tree to track density changes in different sections of the array. For any section located at tree height i𝑖iitalic_i, PMA assigns the lower and upper bound density thresholds as ρisubscript𝜌𝑖\rho_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. When insertions or deletions make the density of a section out of the range, PMA initiates rebalancing operations to adjust its gap density by redistributing gaps among adjacent sections. The rebalancing will happen at a level where all affected sections’ densities together will fall within the density range. If the whole array is full, PMA resizes the array by increasing its size. The amortized write overhead for adaptive PMA is O(logN)𝑂𝑁{O}(\log{}N)italic_O ( roman_log italic_N ). More details about PMA can be found in (Bender and Hu, 2007).

PMA-based mutable CSRs incorporate this concept by replacing the original CSR edge array with the packed memory array, exemplified by PCSR (Wheatman and Xu, 2018) first. VCSR (Islam et al., 2022a) further optimizes PCSR by considering the skewed workloads inherent in real-world graphs. It partitioned the edge array into varied-size sections and distributed the gaps unevenly based on historical workloads in each section to improve performance.

2.4. Issues of Mutable CSR on PMs

PMA-based mutable CSR has been proven effective to support both graph updates and analysis. However, due to the unique features of PMs, a naive implementation leads to problematic performance, as summarized in the later three issues.

2.4.1. Write Amplification Issue

Although mutable CSR avoids shifting the entire edge array for insertion, it still requires shifting a small range of elements if the targeted insertion location is occupied. These additional shifts result in write amplifications. Compared to DRAM, write amplifications in PMs are more critical due to PMs’ asymmetric read/write performance. Additionally, these nearby shifts often occur within a range smaller than 256 bytes, the size of the Optane DCPMM internal write buffer. This forces the buffers to be flushed before being filled, leading to inefficient buffer utilization. To illustrate the issue, we inserted the real-world graph, Orkut (Leskovec and Krevl, 2014), into a mutable CSR implementation and calculated the ratio of actual memory writes v.s the edge size (write amplification) during insertions. Figure 1(a) reported the ratio during insertions. We can observe that the write amplification can be as high as 7×7\times7 ×. It is hence critical to address such an issue.

Refer to caption
Figure 1. Issues of PMA-based mutable CSRs on PMs. The evaluation platform is described in Sec. §4.1.

2.4.2. Crash Consistency Issue.

In addition to nearby shifts, insertions could further trigger PMA rebalancing when a section becomes full. These rebalancing operations move large chunks of sequential elements to new locations. Although efficient in DRAM, these operations are costly on PMs due to the persistence guarantee. It is necessary to use transactions to protect large chunks of writes. However, as demonstrated in Figure 1(b), transactions are extremely expensive on PMs. The time required to insert a graph into DRAM, PMs (without transactions), and PMs-TX (with transactions) differ substantially (Islam et al., 2020, 2022b). Therefore, it is crucial to develop efficient crash recovery for frequent rebalancing operations.

2.4.3. In-place Update Issue

In-place updates in DRAM are efficient, leveraging the cache. But, persistent in-place updates on PMs are exactly the opposite. Figure 1(c) illustrates the performance of in-place updates on PMs. We present the latency of writing the same size of data in a sequential (Seq), random (Rnd), and in-place (In-place) manner respectively. We can observe 7×7\times7 × difference in latency. The reason is that persistent in-place updates repeatedly flush the same cache line and dramatically slow down the performance due to the blocking of previous flushing operations and possible on-chip wear-leveling protection (Izraelevitz et al., 2019). Crucial components of mutable CSR, such as the vertex degree and the PMA tree, require frequent in-place updates. Conducting these updates directly on PMs would be significantly slow. It is essential to design the data placement strategy to minimize in-place updates on PMs.

3. DGAP Design and Implementation

DGAP, as illustrated in Fig.2, is designed to address the three issues outlined in Sec. §2.4. Its architecture comprises four primary components: 1 vertex array, 2 edge array, 3 per-section edge log, and 4 per-thread undo log. When interacting with DGAP, users launch multiple writer threads for graph updates and can execute multi-thread graph analysis tasks on the latest graphs. DGAP ensures the analysis tasks access only the latest graph snapshot when they start. This guarantees the long-running multi-iteration graph algorithms can access a consistent graph throughout their runs.

1 Vertex Array

DGAP stores all vertices sequentially in the vertex array. These sequential vertex IDs result from pre-processing by upstream applications, and their range is often known. Consequently, DGAP can pre-allocate the vertex array accordingly. Each vertex (v𝑣vitalic_v) in the vertex array takes 16 bytes to store three key pieces of metadata: its current degree (degreev𝑑𝑒𝑔𝑟𝑒subscript𝑒𝑣degree_{v}italic_d italic_e italic_g italic_r italic_e italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, 4 Bytes), starting index in the edge array (startv𝑠𝑡𝑎𝑟subscript𝑡𝑣start_{v}italic_s italic_t italic_a italic_r italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, 8 Bytes), and a pointer to its per-section edge log (elv𝑒subscript𝑙𝑣el_{v}italic_e italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, 4 Bytes). The most important design decision about the DGAP vertex array is placing it entirely in DRAM. The main reason behind this design decision is to prevent frequent in-place updates on PMs. For dynamic graphs, the vertex degree (degreev𝑑𝑒𝑔𝑟𝑒subscript𝑒𝑣degree_{v}italic_d italic_e italic_g italic_r italic_e italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT) must be updated each time an edge is inserted. The pointer to the per-section edge log (elv𝑒subscript𝑙𝑣el_{v}italic_e italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT) also changes when edges are added to the edge log. Both operations are frequent enough to significantly impact overall performance if executed as in-place updates on PMs. Storing them in DRAM effectively avoids this issue.

Data safety is a critical issue when storing the entire vertex array in DRAM. DGAP introduces a new pivot element for each vertex in the edge array and leverages these elements to reconstruct the entire vertex array after a crash. More details are provided in Sec. §3.1.5. The reconstruction is fast due to the high bandwidth of PMs for sequential accesses. Detailed results are reported in the evaluation section. Another potential concern is DRAM capacity. Theoretically, each DGAP vertex takes 16 bytes, so 16GB DRAM can store 1 billion vertices. Since most graphs have more edges than vertices, we anticipate that the capacity issue will primarily affect the PMs edge array rather than the DRAM vertex array.

Refer to caption
Figure 2. Overall architecture of DGAP.

2 Edge Array

DGAP stores all the edges in the edge array on persistent memory. The edge array is a PMA constructed based on the VCSR strategy (Islam et al., 2022a). Following other dynamic graph frameworks (e.g., XPGraph, GraphOne), each DGAP edge takes 4 bytes as it only stores the destination vertex ID. Storing the source vertex ID is unnecessary, as it is shared by all edges originating from the same vertex. The source vertex ID is instead stored as a pivot element at the beginning of each vertex’ edge list. The pivot element serves as additional metadata in DGAP to reconstruct the DRAM vertex array after crashes. Specifically, the pivot is a special ‘edge’ element with a value of −vertex-id. Since it is negative and illegal as a vertex ID, it can be used to denote the start of the vertex during recovery. Further details about DGAP recovery are in Sec. §3.1.5.

One important design decision regarding the DGAP edge array is the storage order of all edges for a vertex. Traditionally, the edges of a vertex are sorted based on their destination vertex ID (Wheatman and Xu, 2018). However, DGAP stores them according to their insertion order, meaning a new edge will always be stored at the end of the vertex’ edge list. So, an edge (12121\to 21 → 2) may be stored after edge (16161\to 61 → 6). This seemingly minor change is critical for DGAP to maintain a consistent snapshot of the latest graph for analysis tasks. This means that for any vertex v𝑣vitalic_v, if we know its degree at time t𝑡titalic_t (degreevt𝑑𝑒𝑔𝑟𝑒superscriptsubscript𝑒𝑣𝑡degree_{v}^{t}italic_d italic_e italic_g italic_r italic_e italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT), we can easily determine its readable edges for Taskt𝑇𝑎𝑠subscript𝑘𝑡Task_{t}italic_T italic_a italic_s italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which should fall within the range [startv,startv+degreevt𝑠𝑡𝑎𝑟subscript𝑡𝑣𝑠𝑡𝑎𝑟subscript𝑡𝑣𝑑𝑒𝑔𝑟𝑒superscriptsubscript𝑒𝑣𝑡start_{v},start_{v}+degree_{v}^{t}italic_s italic_t italic_a italic_r italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_s italic_t italic_a italic_r italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_d italic_e italic_g italic_r italic_e italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT). Any edge after that will not be visible to taskt𝑡𝑎𝑠subscript𝑘𝑡task_{t}italic_t italic_a italic_s italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Hence, creating a snapshot of the latest graph only involves storing the degrees of all vertices at time t𝑡titalic_t. At present, we simply cache this degree info in the Degree Cache within each task’s DRAM space, as shown in Fig. 2. This can be done at the beginning of the analysis tasks. The primary issue here is memory cost. Many of the degrees are the same and do not need to be stored in each task. In the future, we plan to implement a Copy-on-Write (CoW) Degree Cache so that all tasks and the main vertex array can share unchanged degrees without wasting memory.

3 Per-section Edge Log

A primary performance challenge in existing PMA-based mutable CSRs on PMs is the write amplifications caused by nearby shifts within each PMA section during insertions. To mitigate this, our principal approach is to temporarily hold these insertions in a persistent log and merge them back in batches later. We introduce the concept of per-section edge logs in DGAP, representing a pre-allocated, continuous, fixed-size space (ELOG_SZ) on PMs dedicated for each PMA section. These logs temporarily store new edge insertions when a nearby shift becomes necessary. In our prototype, ELOG_SZ is set to 2K bytes.

Each element stored in the edge log contains three metadata components and occupies 12 bytes: (i) source vertex ID, (ii) destination vertex ID, and (iii) a back-pointer. This back-pointer is designed to connect all edges originating from the same source vertex, arranging them in reverse order within the edge log. The most recent edge points back to the preceding edge of the same source vertex in the log. The edge log pointer (elv𝑒subscript𝑙𝑣el_{v}italic_e italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT), stored in the vertex array, pinpoints the most recent edge of a vertex in the edge log. A detailed insertion workflow of DGAP is shown in Fig. 3.

When the per-section edge log reaches 90% usage, a merging operation is initiated, integrating the edge log data back into the edge array. Notably, edges within the edge log also contribute to the density of the corresponding edge array section. Therefore, the standard PMA rebalancing operations might be triggered if either the edge array or edge log is approaching full capacity. During DGAP rebalancing, data from both the edge array sections and their respective edge logs are considered.

4 Per-thread Undo Log

PMA Rebalancing, which redistributes gaps among sections, is critical for mutable CSR. To ensure data safety, it requires transaction mechanisms to avoid partial writes and guarantee crash consistency. While existing PMs programming libraries like PMDK (pmem.io Persistent, 2019) support transactions natively, using them directly for recurrent rebalancing operations results in significant overhead, due to two major bottlenecks: 1) the high memory allocation cost of frequent journal allocations and 2) performance overheads due to excessive ordering (Haria et al., 2020). In DGAP, we introduce a per-thread undo log specifically to enhance the performance of rebalancing while ensuring crash consistency. During insertion, whenever a Writer Thread triggers rebalancing, before actually moving data, it first uses its own undo log to persistently backup the data set to be relocated, chunk by chunk, acting as an ‘undo log’. If a crash happens in the middle, we can recover the data from the undo log. The per-thread undo log is pre-allocated in fixed size (i.e., ULOG_SZ) for each Writer Thread. In our prototype, ULOG_SZ is set to 2K bytes.

3.1. DGAP Graph Operations

This section explains how the DGAP components work together to serve various graph operations.

3.1.1. Initialization

When DGAP starts for the first time, it takes multiple user-specific parameters for system initialization. The number of vertices and edges in the graph are specified by the parameters INIT_VERTICES_SIZE and INIT_EDGES_SIZE. DGAP allocates the initial vertex array in DRAM and the edge array in PMs accordingly. Both parameters are just initial user estimations. The actual numbers of vertices or edges can significantly surpass these values. When this happens, DGAP automatically resizes both the vertex and edge arrays during insertions. DGAP also utilizes the parameters ELOG_SIZE and ULOG_SIZE to pre-allocate the per-section edge logs and per-thread undo logs. Furthermore, DGAP initializes multiple key metadata pieces on PMs for its operation. For instance, it maintains a global flag, NORMAL_SHUTDOWN, on PMs to determine if DGAP had a graceful shutdown in its previous session. Whenever DGAP restarts, this value guides the system initialization process. In addition, DGAP creates and upholds various DRAM indexing metadata, including the PMA tree for density tracking. Locks are allocated based on this PMA tree to ensure concurrent reads/writes in DGAP. More details are discussed in later subsections.

Refer to caption
Figure 3. Two insertion cases in DGAP. The dashed blue line points to the starting index of a vertex in the edge array.

3.1.2. Graph Updates

DGAP utilizes the PMA-based mutable CSR structure to enable dynamic graph updates. For edge updates, an edge pair (vsrc,vdstsubscript𝑣𝑠𝑟𝑐subscript𝑣𝑑𝑠𝑡v_{src},v_{dst}italic_v start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_d italic_s italic_t end_POSTSUBSCRIPT) will be fed to the g.insertE() call. For vertex updates, a vertex ID (vsrcsubscript𝑣𝑠𝑟𝑐v_{src}italic_v start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT) will be fed to the g.insertV() call. Edge updates include both edge insertions and deletions. Deletions are executed by re-inserting the same edge marked with a tombstone flag. Specifically, we set the first bit of the destination vertex ID to 1, signifying that the edge has been removed from the graph. In the following, we delve into edge insertion operations.

Edge insertion includes two steps: 1) inserting the new edge into the edge array or edge log, and 2) updating the degree and pointer in the vertex array. The DRAM vertex array is updated only after the PMs edge array has been successfully updated and flushed. In this way, even crash happens after PMs updates, the DRAM data structures can be reconstructed afterward. Given that we store all edges of a vertex chronologically, the insertion point for a new edge [vsrc,vdstsubscript𝑣𝑠𝑟𝑐subscript𝑣𝑑𝑠𝑡v_{src},v_{dst}italic_v start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_d italic_s italic_t end_POSTSUBSCRIPT] in the edge array can be easily determined. It can be calculated directly from the degree of vsrcsubscript𝑣𝑠𝑟𝑐v_{src}italic_v start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT and its starting index using the formula (startvsrc+degreevsrc)𝑠𝑡𝑎𝑟subscript𝑡subscript𝑣𝑠𝑟𝑐𝑑𝑒𝑔𝑟𝑒subscript𝑒subscript𝑣𝑠𝑟𝑐(start_{v_{src}}+degree_{v_{src}})( italic_s italic_t italic_a italic_r italic_t start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_d italic_e italic_g italic_r italic_e italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). If the calculated location is a gap, the new edge can be inserted in an atomic manner. However, if the location is taken by a subsequent vertex, which requires a nearby shift, DGAP appends the edge to the per-section edge log to minimize write amplification.

Fig. 3 illustrates two DGAP insertion scenarios. This figure provides a snapshot of the vertex array, edge array, and the associated per-section edge log. Here, edges for vertices (6,7,8,967896,7,8,96 , 7 , 8 , 9) are showcased on the edge array with gaps, while the per-section edge log is empty. Fig. 3(a) first shows a normal insertion case (89898\to 98 → 9) where the intended edge location is empty. Then the edge is inserted on the edge array (marked in red). Fig. 3(b) shows another scenario where the desired locations for a series of edge insertions (e.g., 61616\to 16 → 1, 64646\to 46 → 4) are already taken (by vertex 7 and its edges). In this case, new edges will be stored on the per-section edge log to reduce the unnecessary data shifts within the edge array. Multiple edges of the same vertex on the edge log will be connected using the back-pointer, shown as the black arrow from (6,4646,46 , 4) to (6,1616,16 , 1) in Fig. 3(b).

After many edge insertions, the corresponding section of the edge array is becoming full. This will trigger a PMA rebalancing operation that redistributes the gaps among adjacent sections to ensure all the sections maintain a satisfactory density. While DGAP adopts the same logic to initiate the rebalancing, it carries out the operation with assistance from the per-thread undo log to guarantee data consistency. Further details about crash-consistent rebalancing are elaborated in Sec 3.1.4.

3.1.3. Graph Analysis

DGAP supports graph analysis by offering high-performance interfaces to iterate through all vertices (i.e., g.v()) and the edges associated with a vertex (i.e., v.e()). Graph analysis tasks might run for extended durations. For instance, the PageRank algorithm executed on the Orkut graph can take over 20 seconds. During this time, the graph may be updated. To ensure a consistent view of the graph, it is necessary to guarantee that future reads from the same task bypass the newly added data. To achieve this, users must first call the g.consistent_view() function prior to iterating through the graph in their analysis tasks. Once this function is invoked, DGAP allocates a Degree Cache for the analysis task and temporarily holds the graph updates. It then copies the degree part of the vertex array to the per-task Degree Cache. This snapshot of degree information aids in pinpointing the appropriate set of edges for reading during task execution.

Once the snapshot is created, DGAP starts serving data-accessing function calls. For each call, DGAP initially reads the required metadata about v𝑣vitalic_v from DRAM vertex array, then accesses the PMs edge array based on that. The necessary metadata from vertex array includes the starting index (startv𝑠𝑡𝑎𝑟subscript𝑡𝑣start_{v}italic_s italic_t italic_a italic_r italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT) and the edge log pointer (elv𝑒subscript𝑙𝑣el_{v}italic_e italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT). The degree information is obtained from the Degree Cache created at the task starting time t𝑡titalic_t (degreevt𝑑𝑒𝑔𝑟𝑒superscriptsubscript𝑒𝑣𝑡degree_{v}^{t}italic_d italic_e italic_g italic_r italic_e italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT). If elv𝑒subscript𝑙𝑣el_{v}italic_e italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is NULL, then iterating through v𝑣vitalic_v’s edges involves simply iterating the corresponding edge array from startv𝑠𝑡𝑎𝑟subscript𝑡𝑣start_{v}italic_s italic_t italic_a italic_r italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to (startv+degreevt𝑠𝑡𝑎𝑟subscript𝑡𝑣𝑑𝑒𝑔𝑟𝑒superscriptsubscript𝑒𝑣𝑡start_{v}+degree_{v}^{t}italic_s italic_t italic_a italic_r italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_d italic_e italic_g italic_r italic_e italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT). If elv𝑒subscript𝑙𝑣el_{v}italic_e italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is not NULL, the edges also come from the edge log. In this case, we first scan the edge array. If the edge array does not contain a sufficient number of edges as needed (based on degreevt𝑑𝑒𝑔𝑟𝑒superscriptsubscript𝑒𝑣𝑡degree_{v}^{t}italic_d italic_e italic_g italic_r italic_e italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT), DGAP proceeds to scan the edge log. The elv𝑒subscript𝑙𝑣el_{v}italic_e italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT pointer always points to the last edge. From this point, we track all edges in the edge log through their back-pointer. To read only the required number of edges (assuming restvt𝑟𝑒𝑠superscriptsubscript𝑡𝑣𝑡rest_{v}^{t}italic_r italic_e italic_s italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT), we allocate a first-come-first-out (FIFO) buffer with a size of restvt𝑟𝑒𝑠superscriptsubscript𝑡𝑣𝑡rest_{v}^{t}italic_r italic_e italic_s italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to keep only the necessary edges.

3.1.4. Crash Consistent PMA Rebalancing

Thus far, we have discussed how DGAP handles point insertions without initiating rebalancing operations. In PMA, however, rebalancing is crucial when the insertions of new edges makes sections of the edge array overly dense. These rebalancing operations redistribute gaps among neighboring sections to alleviate the density issue. Given that rebalancing involves considerable amount of data movement, it necessitates crash-consistent transactions. Yet, standard PMDK transactions are proven to be overly expensive. As a solution, DGAP introduces per-thread undo logs to achieve more efficient, crash-consistent DGAP rebalancing.

Refer to caption
Figure 4. Crash consistent PMA rebalancing in DGAP. In (a), the blue area shows the intended data movement region; the green dashed boxes show the expected state after the data movement. In (b), a crash case is shown after moving data 3333.

For every Writer Thread in DGAP, a per-thread undo log is allocated on PMs to support the execution of triggered rebalancing. Fig. 4 illustrates the rebalancing process in detail. Once DGAP determines a valid rebalancing range based on density thresholds, it recalculates the location of each vertex within these sections, assuming the gaps will be redistributed evenly. For instance, in Fig. 4(a), the new location of vertex v8subscript𝑣8v_{8}italic_v start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT and its edges is represented by dashed boxes above the edge array. During rebalancing, all vertices and their edges must be moved to their new locations. To prevent permanent data loss in the event of a crash, this relocation process must be safeguarded using a transaction mechanism.

To perform data movements in a crash-consistent manner, DGAP first backs up the data that may be overwritten during data movements in the undo log. It then calls CLWB and SFENCE to ensure that the data is persisted before proceeding with the actual data movement on the edge array. If a crash occurs before the backup of data on the undo log is completed, the data on the original edge array remains unaffected, as no data movement has occurred yet. After the backup, DGAP initiates the process of moving and overwriting data element by element. DGAP iteratively performs these steps until the entire rebalancing range is moved. In each step, it moves a maximum of ULOG_SZ=2K bytes of data.

Figure 4(b) illustrates a crash scenario during rebalancing. In this instance, DGAP has already backed up the moving data in the undo log and is beginning to shift all edges of v8subscript𝑣8v_{8}italic_v start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT one element to the right. Suppose that after the edge (8,3838,38 , 3) has been moved, a crash occurs, resulting in an inconsistent edge array due to the presence of two edges (8,3838,38 , 3). However, a consistent backup of this region is available in the persistent undo log. Upon restart, DGAP recognizes the crash by checking its NORMAL_SHUTDOWN flag. It then iterates through all per-thread undo logs and utilizes the backup data to overwrite the inconsistent regions. The idx index, stored at the beginning of the per-thread undo log, is used to determine which part of the edge array should be overwritten for recovery. After restoring the data, DGAP proceeds to reissue the rebalancing operation to complete the interrupted process.

3.1.5. Shutdown and Crash Recovery

DGAP can initiate a graceful shutdown by calling g.shutdown(). During a normal shutdown, DGAP first waits for all ongoing graph analytic tasks to complete. Subsequently, it persists all DRAM components to persistent memory (PM), including the vertex array and PMA-related metadata. While this backup process may require a few seconds, it ensures a quicker subsequent startup. Detailed normal shutdown times are measured and presented in the evaluation section. Before shutting down, DGAP resets the NORMAL_SHUTDOWN flag to indicate a graceful shutdown. After rebooting, DGAP first checks the NORMAL_SHUTDOWN flag to understand whether the previous shutdown is normal or due to a crash. If the flag indicates a normal shutdown, DGAP simply loads the vertex array and PMA-related metadata to DRAM and starts operating. If this is a reboot after a crash, DGAP initiates a data recovery process. Initially, DGAP scans the edge array to reconstruct the vertex array and build PMA metadata, such as the density tree. Following that, DGAP scrutinizes all per-thread undo logs and recovers the inconsistencies resulting from crashed rebalancing operations. It then continues to finish the ongoing rebalancing from the inconsistent region. Next, DGAP checks the per-section edge log to retrieve the metadata for these vertices and update the vertex array. After all these steps, DGAP can start to operate normally. In the evaluation section, we present the time durations associated with both standard and crash reboots.

Table 1. A list of graph kernels and inputs and outputs used in our evaluations.
Graph kernel Kernel Type Input Output Notes
PageRank (PR) Link Analysis - |V|𝑉|V|| italic_V |-sized array of ranks Fixed number (20) of iterations
Breadth-First Search (BFS) Graph Traversal Source vertex |V|𝑉|V|| italic_V |-sized array of parent IDs Direction-Optimizing approach (Beamer et al., 2012)
Betweenness Centrality (BC) Shortest Path Source vertex |V|𝑉|V|| italic_V |-sized array of centrality scores Brandes approx. algorithm (Brandes, 2001; Madduri et al., 2009)
Connected Components (CC) Connectivity - |V|𝑉|V|| italic_V |-sized array of component labels Shiloach-Vishkin (Shiloach and Vishkin, 1980; Bader et al., 2005)
Table 2. Graph inputs and their key properties.
Datasets Domain |V|𝑉|V|| italic_V | |E|𝐸|E|| italic_E | |E|/|V|𝐸𝑉|E|/|V|| italic_E | / | italic_V |
Orkut social 3,072,626 234,370,166 76
LiveJournal social 4,847,570 85,702,474 18
CitPatents citation 6,009,554 33,037,894 6
Twitter social 61,578,414 2,405,026,390 39
Friendster social 124,836,179 3,612,134,270 29
Protein biology 8,745,543 1,309,240,502 149

3.1.6. Concurrency Control

DGAP supports multi-thread graph updates (multiple Writer Threads) and graph analysis (multiple Analysis Tasks) on PMs. To optimize performance, DGAP implements an optimistic read/write lock to enable multiple readers and writers to run concurrently, as long as they do not write to the same section. For each PMA section, DGAP maintains a lock and its linked condition variable, resulting in |log(v)|𝑙𝑜𝑔𝑣|log(v)|| italic_l italic_o italic_g ( italic_v ) | locks. When inserting an edge (vsrc,vdstsubscript𝑣𝑠𝑟𝑐subscript𝑣𝑑𝑠𝑡v_{src},v_{dst}italic_v start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_d italic_s italic_t end_POSTSUBSCRIPT), DGAP first needs to acquire the lock for the respective section of vsrcsubscript𝑣𝑠𝑟𝑐{v_{src}}italic_v start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT so that no other threads can insert into the same section. This also prevents concurrent readers. After the insertion, DGAP checks whether the density of Sectionvsrc𝑆𝑒𝑐𝑡𝑖𝑜subscript𝑛subscript𝑣𝑠𝑟𝑐Section_{v_{src}}italic_S italic_e italic_c italic_t italic_i italic_o italic_n start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT has reached the rebalancing threshold. If rebalancing is needed, the writer thread first sets the condition variable of Sectionvsrc𝑆𝑒𝑐𝑡𝑖𝑜subscript𝑛subscript𝑣𝑠𝑟𝑐Section_{v_{src}}italic_S italic_e italic_c italic_t italic_i italic_o italic_n start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT to block other writes or rebalancing operations in this section. It then attempts to acquire all the locks of the sections affected by the rebalancing, sequentially. To prevent deadlocks, DGAP follows a strict order (from low to high section IDs) when acquiring locks. After obtaining all the locks, DGAP executes the rebalancing as previously described. Finally, DGAP resets the condition variable and notifies all waiting writes or rebalancing operations to start. Note that DGAP stores all the locks in DRAM instead of PMs to increase performance. If a crash occurs, all the locks are lost. The pending rebalancing operation will be recovered by checking the per-thread undo log. The pending edge writes will be ignored, as they have not yet been returned successfully to users.

4. Evaluation

We developed DGAP using the PMDK library (pmem.io Persistent, 2019). Its core data structure consists of approximately 2,000 lines of C++ code. The code is publicly available on Github111https://github.com/DIR-LAB/DGAP. In this section, we compare DGAP with other graph analysis frameworks on real-world graphs with synthetic graph insertion patterns. The results reported are the averages of five runs.

4.1. Evaluation Setup

Evaluation Platform. We conducted all evaluations on a Dell R740 rack server equipped with a 2nd generation Intel Xeon Scalable Processor (Gold 6254 @ 3.10 GHz) featuring 18 physical cores. The server also included 6 DRAM DIMMs with 32 GB each (for a total of 192 GB) and 6 Optane DC DIMMs with 128 GB each (for a total of 768 GB). We configured Optane DC in App Direct mode. The system ran Ubuntu 20.04 and used the Linux kernel version 4.15.0. Our implementation is based on PMDK 1.12.

Graph Algorithms. To ensure a fair comparison among various graph analysis frameworks, we used the same implementations of four graph algorithms from the GAP Benchmark Suite (GAPBS) (Beamer, 2015). These algorithms are PageRank (PR), Breadth-First Search (BFS), Betweenness Centrality (BC), and Connected Components (CC), detailed in Table 1. GAPBS also offers an optimized Compressed Sparse Row (CSR) implementation, which we modified for persistent memory to serve as one of our evaluation baselines.

Graph Datasets. We used several real-world graphs from SNAP datasets (Leskovec and Krevl, 2014) in our evaluations. Table 2 lists these graphs and their key properties. We generate the insertion order by randomly shuffling all the edges for these datasets. Note that, in all the experiments, we will insert the first 10% of the graph and then start to benchmark the insertion performance for the purpose of warming up the system, similar to the warm-up stage in YCSB (Cooper et al., 2010).

Compared Systems. To showcase the performance of DGAP, we compare it with multiple data structures and state-of-the-art dynamic graph frameworks.

First, we ported two foundational graph data structures to persistent memory to serve as baselines. The Compressed Sparse Row (CSR) on persistent memory is based on GAPBS. CSR serves as a baseline for graph analysis evaluations since 1) it can not be updated and 2) it offers the optimal graph analysis performance due to its compact memory layout. We also implemented Blocked Adjacency-List (BAL) on persistent memory as another extreme baseline. BAL is known to have poor graph analysis performance due to pointer chasing and great edge insertions performance due to efficient appending to a block. We use BAL as a baseline to understand the insertion performance of DGAP.

We further compared DGAP with three state-of-the-art dynamic graph frameworks designed to support graph updates and analysis. LLAMA uses a multi-versioned CSR structure to enable fast graph analysis and graph mutations (Macko et al., 2015). The graph updates are conducted in batches and organized as multiple immutable snapshots in LLAMA. To avoid creating too many snapshots, in our evaluation, we only created a snapshot after inserting 1%percent11\%1 % of the graph, which ranges from 330K edges to 36M edges, depending on the chosen graph dataset. In total, we created 90 snapshots for each graph (the first 10% warm-up is a single snapshot). Because graph analysis in LLAMA can not read the latest graph unless the snapshot is created, these large snapshots mean its graph analysis tasks may miss as many as 36 million edges, which might not be acceptable in some applications. We ported LLAMA to persistent memory by changing the location of its snapshot files to PMs space, which shows a naive way of moving existing graph data structure to persistent memory.

GraphOne is an in-memory graph analysis framework with an extra durability guarantee using external non-volatile devices (Kumar and Huang, 2019). New data is first stored in an in-DRAM edge list in an append-only manner. Background threads incrementally move this data to non-volatile memory for persistence. To port GraphOne to persistent memory, we changed the location of durable phase to the PM space and required it to flush DRAM data after each 216superscript2162^{16}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT insertions to reduce the chances of losing data. We do not limit the DRAM usage of GraphOne during graph analysis. Hence for some graphs, the graph data may be completely cached in DRAM. Due to these settings, we name this baseline as GraphOne-FD, indicating GraphOne Flushing-DRAM, in the rest of the paper.

XPGraph is state-of-the-art PM-based dynamic graph system (Wang et al., 2022). It is based on GraphOne but extends it with new designs for persistent memory. Specifically, XPGraph stores both the edge list and adjacency list in persistent memory to guarantee data persistence and leverages the DRAM as a cache to batch data into the adjacency list. Similar to GraphOne, XPGraph also transfers data to DRAM for graph analysis. In our evaluations, we used the default parameter settings of XPGraph for comparisons. One exception is that we used a lower archiving threshold to record the graph insert performance of XPGraph. Here, archiving threshold indicates the batch size of edges to move from the edge list to the adjacency list. In theory, XPGraph is supposed to perform both operations (adding new edge to edge list and moving the edges to adjacency list) concurrently, which would incur higher overhead due to cache line inference. But, in their current prototype implementation (Wang, 2022), XPGraph performed these operation separately and avoided the overhead. The insert performance of XPGraph thus solely depends on the archiving threshold as showing in Fig. 5. For a fair comparison, we picked threshold 210superscript2102^{10}2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT in our evaluations, which still means XPGraph may delay the real-time analysis by 210superscript2102^{10}2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT edges.

Refer to caption
Figure 5. Graph insertion throughput of XPGraph (in MEPS) for varying Archiving Threshold. Higer value is better.

4.2. Graph Insertions Performance

We first compared the graph updates, particularly the edge insertion performance of DGAP, with other systems. Fig. 6 shows the graph insertion throughput in MEPS (Million Edges Per Second) using a single writer thread. The scalability results are reported later. From these results, we can observe that DGAP achieves almost the best performance across all datasets among all the frameworks. It delivers 1.03×2.82×1.03\times-2.82\times1.03 × - 2.82 × better performance than BAL, which is considered extremely efficient in graph insertions as edges are simply appended to the end of each block. However, the inefficient usage of persistent memory (e.g., journaling and transaction for crash consistency) makes it slower in many cases. DGAP also outperforms LLAMA, GraphOne, and XPGraph on persistent memory by up to 6×6\times6 ×, 2.5×2.5\times2.5 ×, and 2.3×2.3\times2.3 ×, respectively. It is obvious to us that the costs of asynchronous batch data structure conversions and movements between DRAM and PMs in LLAMA, GraphOne, and XPGraph impact the performance significantly. It is worth noting that, from the results, XPGraph performs better than GraphOne, but not as significant as the original paper reports (Wang et al., 2022). This is because our GraphOne-FD has a large batch write size in DRAM, which offers better performance but is impractical as this data may be lost. Still, the better performance of DGAP clearly showcases the efficiency of mutable CSR data structure on persistent memory.

Refer to caption
Refer to caption
Figure 6. Dynamic graph insertion throughput in million edges per second (MEPS). Higer value is better.
Table 3. Graph insertion throughput (MEPS) using the different number of writer threads. Larger throughput is better.
T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T8subscript𝑇8T_{8}italic_T start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT
Graph DGAP BAL LLAMA GO-FD XPGrp. DGAP BAL LLAMA GO-FD XPGrp. DGAP BAL LLAMA GO-FD XPGrp.
Orkut 2.524 614 1112.5246141112.524\,614\,1112.524 614 111 2.349 964 972.349964972.349\,964\,972.349 964 97 1.840 474 9231.8404749231.840\,474\,9231.840 474 923 1.230 745 421.230745421.230\,745\,421.230 745 42 1.863 023 0761.8630230761.863\,023\,0761.863 023 076 6.486 952 4036.4869524036.486\,952\,4036.486 952 403 5.973 070 195.973070195.973\,070\,195.973 070 19 2.331 912 5532.3319125532.331\,912\,5532.331 912 553 2.536 380 642.536380642.536\,380\,642.536 380 64 4.949 091 4814.9490914814.949\,091\,4814.949 091 481 7.372 939 037.372939037.372\,939\,037.372 939 03 5.261 014 6875.2610146875.261\,014\,6875.261 014 687 2.395 361 6332.3953616332.395\,361\,6332.395 361 633 2.862 305 2662.8623052662.862\,305\,2662.862 305 266 5.436 309 2875.4363092875.436\,309\,2875.436 309 287
LiveJournal 2.593 619 2692.5936192692.593\,619\,2692.593 619 269 1.264 353 4021.2643534021.264\,353\,4021.264 353 402 0.973 056 302 70.97305630270.973\,056\,302\,70.973 056 302 7 1.227 506 2471.2275062471.227\,506\,2471.227 506 247 1.731 583 4941.7315834941.731\,583\,4941.731 583 494 6.265 831 3286.2658313286.265\,831\,3286.265 831 328 4.786 963 7364.7869637364.786\,963\,7364.786 963 736 1.073 308 2911.0733082911.073\,308\,2911.073 308 291 2.633 627 272.633627272.633\,627\,272.633 627 27 4.917 770 2544.9177702544.917\,770\,2544.917 770 254 7.953 504 0227.9535040227.953\,504\,0227.953 504 022 5.920 610 0065.9206100065.920\,610\,0065.920 610 006 1.086 951 1431.0869511431.086\,951\,1431.086 951 143 2.936 533 1222.9365331222.936\,533\,1222.936 533 122 5.659 880 3345.6598803345.659\,880\,3345.659 880 334
CitPatents 2.426 446 9052.4264469052.426\,446\,9052.426 446 905 0.853 198 154 70.85319815470.853\,198\,154\,70.853 198 154 7 0.397 695 543 40.39769554340.397\,695\,543\,40.397 695 543 4 1.215 493 6671.2154936671.215\,493\,6671.215 493 667 1.481 247 5731.4812475731.481\,247\,5731.481 247 573 6.817 161 1466.8171611466.817\,161\,1466.817 161 146 3.451 580 3633.4515803633.451\,580\,3633.451 580 363 0.412 297 970.412297970.412\,297\,970.412 297 97 2.622 529 2052.6225292052.622\,529\,2052.622 529 205 5.045 316 6735.0453166735.045\,316\,6735.045 316 673 7.233 994 5947.2339945947.233\,994\,5947.233 994 594 4.680 063 5254.6800635254.680\,063\,5254.680 063 525 0.415 832 529 20.41583252920.415\,832\,529\,20.415 832 529 2 2.807 454 8272.8074548272.807\,454\,8272.807 454 827 5.753 064 6085.7530646085.753\,064\,6085.753 064 608
Twitter 1.858 658 2941.8586582941.858\,658\,2941.858 658 294 2.017 550 2582.0175502582.017\,550\,2582.017 550 258 1.605 122 5061.6051225061.605\,122\,5061.605 122 506 0.725 116 580 20.72511658020.725\,116\,580\,20.725 116 580 2 1.986 590 7751.9865907751.986\,590\,7751.986 590 775 5.348 993 65.34899365.348\,993\,65.348 993 6 5.507 841 1765.5078411765.507\,841\,1765.507 841 176 2.129 158 4292.1291584292.129\,158\,4292.129 158 429 1.992 767 1491.9927671491.992\,767\,1491.992 767 149 4.881 072 3834.8810723834.881\,072\,3834.881 072 383 6.821 753 8926.8217538926.821\,753\,8926.821 753 892 5.987 000 6485.9870006485.987\,000\,6485.987 000 648 2.171 520 3892.1715203892.171\,520\,3892.171 520 389 2.433 096 8932.4330968932.433\,096\,8932.433 096 893 5.326 770 8615.3267708615.326\,770\,8615.326 770 861
Friendster 1.920 984 7091.9209847091.920\,984\,7091.920 984 709 1.818 648 6471.8186486471.818\,648\,6471.818 648 647 1.230 996 281.230996281.230\,996\,281.230 996 28 0.567 525 250 70.56752525070.567\,525\,250\,70.567 525 250 7 1.601 236 911.601236911.601\,236\,911.601 236 91 4.291 106 5174.2911065174.291\,106\,5174.291 106 517 5.626 404 1445.6264041445.626\,404\,1445.626 404 144 1.520 849 5451.5208495451.520\,849\,5451.520 849 545 2.404 349 4142.4043494142.404\,349\,4142.404 349 414 4.408 052 174.408052174.408\,052\,174.408 052 17 6.025 289 4886.0252894886.025\,289\,4886.025 289 488 5.820 030 8865.8200308865.820\,030\,8865.820 030 886 1.533 593 5661.5335935661.533\,593\,5661.533 593 566 3.345 911 2633.3459112633.345\,911\,2633.345 911 263 4.997 107 6374.9971076374.997\,107\,6374.997 107 637
Protein 2.193 612 4032.1936124032.193\,612\,4032.193 612 403 2.305 273 7092.3052737092.305\,273\,7092.305 273 709 2.117 883 8552.1178838552.117\,883\,8552.117 883 855 1.019 269 621.019269621.019\,269\,621.019 269 62 1.819 069 2111.8190692111.819\,069\,2111.819 069 211 7.429 157 9257.4291579257.429\,157\,9257.429 157 925 5.822 706 9115.8227069115.822\,706\,9115.822 706 911 3.086 205 7753.0862057753.086\,205\,7753.086 205 775 3.214 630 6923.2146306923.214\,630\,6923.214 630 692 5.082 988 1325.0829881325.082\,988\,1325.082 988 132 8.298 061 6138.2980616138.298\,061\,6138.298 061 613 6.225 190 1986.2251901986.225\,190\,1986.225 190 198 3.183 228 1123.1832281123.183\,228\,1123.183 228 112 4.078 249 74.07824974.078\,249\,74.078 249 7 5.759 612 2635.7596122635.759\,612\,2635.759 612 263

4.2.1. Graph Insertions Scalability

We further evaluated the graph insertions scalability by increasing the number of concurrent writer threads from 1 to 16. Table 3 shows the MEPS throughput of 1, 8, and 16 threads. We can see DGAP scales with more threads. It delivers up to 4.3×4.3\times4.3 × throughput in 16161616 threads compared with single thread case. The concurrency model and write optimizations implemented in DGAP help deliver such a scalable graph insertion performance (6666 to 8888 million edges/sec), which might be needed in many real-time big data applications.

Across various systems, DGAP consistently ranks as either the best or very close to the best in all scalability cases. BAL occasionally delivers superior performance, primarily due to our implementation of BAL utilizing finer-grain locks for concurrent insertions. Specifically, while DGAP locks writers by edge section, BAL employs vertex-based locking. Consequently, as the number of threads increases, its performance scales more effectively. However, this may not be a realistic representation, as an excessive number of locks are needed. The scalability results of XPGraph are also noteworthy, as it surpasses DGAP in the 16-thread case for three graphs. In fact, these three graphs are all relatively small. We attribute the exceptional performance to XPGraph’s design. Specifically, XPGraph includes a circular edge log for temporarily storing new insertions. By default, the circular edge log has a capacity of 8GB, which can entirely accommodate the three smaller graphs: Orkut, LiveJournal, and CitPatents. In this context, archiving is not activated for these graphs, resulting in XPGraph exhibiting exceptional performance. For larger graphs with over a billion edges, DGAP demonstrates 1221%12percent2112-21\%12 - 21 % better performance, as XPGraph is compelled to flush the DRAM caches back to the persistent edge list more frequently.

4.3. Graph Analysis Performance

Refer to caption
Refer to caption
Refer to caption
Figure 7. Time to run PageRank (PR) and Connected Components (CC), normalized to CSR on PMM. Smaller is better.

Graph analysis performance is key to our graph frameworks. In this section, we show the performance of running four classic graph algorithms (listed in Table 1) on different graphs. Among these four algorithms, PageRank (PR) and Connected Components (CC) access all vertices in each iteration, while Breadth-First Search (BFS) and Betweenness Centrality (BC) access parts of the graphs each time based on the calculation. They show different access patterns which may impact the performance of the frameworks, as shown below.

1) PageRank (PR) and Connected Components (CC). Fig. 7 illustrates the relative speed of PageRank compared to CSR using a single thread. Compared with CSR, which is best for graph analysis, DGAP introduces only 37%percent3737\%37 % overhead on average and achieves up to 2.9×2.9\times2.9 ×, 2.9×2.9\times2.9 ×, 1.4×1.4\times1.4 ×, and 3.1×3.1\times3.1 × better performance compared to BAL, LLAMA, GraphOne, and XPGraph respectively. It is particularly interesting to observe that DGAP outperforms GraphOne-FD in most datasets, even it is actually running on DRAM-cached data. We believe this is because GraphOne uses adjacency list as its in-memory data structure, which is less efficient for graph analysis tasks that apply to all vertices and edges of the graph. While, since DGAP is a mutable CSR, it shows much better cache locality in running these algorithms such as PageRank. We can observe the same behaviors when running a similar algorithm CC, which iterates all vertices/edges in each iteration. Specifically, Fig. 7 illustrates the relative speed of CC compared to CSR on all the systems. Again, DGAP shows up to 2.9×2.9\times2.9 ×, 2.7×2.7\times2.7 ×, 1.6×1.6\times1.6 ×, and 2.4×2.4\times2.4 × better performance than BAL, LLAMA, GraphOne, and XPGraph respectively.

Refer to caption
Refer to caption
Refer to caption
Figure 8. Time to run Breadth-First Search (BFS) and Betweenness Centrality (BC), normalized to CSR on PMM. Smaller is better.
Table 4. The execution time (in seconds) of four algorithms on all systems. T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denotes the time of one thread and T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT denotes that of 16 threads.
PageRank BFS
CSR DGAP BAL LLAMA GraphOne XPGraph CSR DGAP BAL LLAMA GraphOne XPGraph
Graph T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT
Orkut 24.178 0124.1780124.178\,0124.178 01 1.668 911.668911.668\,911.668 91 31.550 8431.5508431.550\,8431.550 84 2.207 412.207412.207\,412.207 41 53.210 2153.2102153.210\,2153.210 21 3.565 873.565873.565\,873.565 87 50.24150.24150.24150.241 9.5149.5149.5149.514 36.007836.007836.007836.0078 2.628 252.628252.628\,252.628 25 49.866 8649.8668649.866\,8649.866 86 3.724 783.724783.724\,783.724 78 0.334 730.334730.334\,730.334 73 0.030 060.030060.030\,060.030 06 0.456 690.456690.456\,690.456 69 0.038 580.038580.038\,580.038 58 0.744 550.744550.744\,550.744 55 0.056 730.056730.056\,730.056 73 1.4441.4441.4441.444 0.3310.3310.3310.331 0.119 3320.1193320.119\,3320.119 332 0.012 4290.0124290.012\,4290.012 429 0.252 840 60.25284060.252\,840\,60.252 840 6 0.028 191 980.028191980.028\,191\,980.028 191 98
LiveJournal 9.07149.07149.07149.0714 0.706 420.706420.706\,420.706 42 12.460 2712.4602712.460\,2712.460 27 0.936 780.936780.936\,780.936 78 32.117 7332.1177332.117\,7332.117 73 2.303 932.303932.303\,932.303 93 32.68532.68532.68532.685 5.1245.1245.1245.124 17.144917.144917.144917.1449 1.237 571.237571.237\,571.237 57 36.451 0636.4510636.451\,0636.451 06 3.035 0083.0350083.035\,0083.035 008 0.342 830.342830.342\,830.342 83 0.032 620.032620.032\,620.032 62 0.432 790.432790.432\,790.432 79 0.037 340.037340.037\,340.037 34 1.256 581.256581.256\,581.256 58 0.102 190.102190.102\,190.102 19 1.9341.9341.9341.934 0.5040.5040.5040.504 0.197 5390.1975390.197\,5390.197 539 0.025 6190.0256190.025\,6190.025 619 0.419 7640.4197640.419\,7640.419 764 0.048 201 820.048201820.048\,201\,820.048 201 82
CitPatents 5.827 085.827085.827\,085.827 08 0.48670.48670.48670.4867 8.170 538.170538.170\,538.170 53 0.631 610.631610.631\,610.631 61 23.470 2923.4702923.470\,2923.470 29 1.726 181.726181.726\,181.726 18 23.30423.30423.30423.304 2.8262.8262.8262.826 9.751 569.751569.751\,569.751 56 0.703 2770.7032770.703\,2770.703 277 25.214 1625.2141625.214\,1625.214 16 2.377 2542.3772542.377\,2542.377 254 0.46880.46880.46880.4688 0.043 740.043740.043\,740.043 74 0.574 170.574170.574\,170.574 17 0.047 160.047160.047\,160.047 16 1.837 421.837421.837\,421.837 42 0.138 270.138270.138\,270.138 27 3.4643.4643.4643.464 0.6750.6750.6750.675 0.194 4910.1944910.194\,4910.194 491 0.025 0490.0250490.025\,0490.025 049 0.347 095 80.34709580.347\,095\,80.347 095 8 0.056 074 780.056074780.056\,074\,780.056 074 78
Twitter 425.1114425.1114425.1114425.1114 31.585 9831.5859831.585\,9831.585 98 545.923 95545.92395545.923\,95545.923 95 39.302539.302539.302539.3025 828.068 03828.06803828.068\,03828.068 03 56.668756.668756.668756.6687 712.729712.729712.729712.729 99.82999.82999.82999.829 775.83775.83775.83775.83 45.099745.099745.099745.0997 1032.0621032.0621032.0621032.062 77.98877.98877.98877.988 7.911 797.911797.911\,797.911 79 0.707 030.707030.707\,030.707 03 10.085 9110.0859110.085\,9110.085 91 0.739 360.739360.739\,360.739 36 19.716 7919.7167919.716\,7919.716 79 1.467 791.467791.467\,791.467 79 32.49732.49732.49732.497 6.6496.6496.6496.649 3.578 583.578583.578\,583.578 58 0.334 0130.3340130.334\,0130.334 013 5.546 795.546795.546\,795.546 79 0.713 4670.7134670.713\,4670.713 467
Friendster 873.376 41873.37641873.376\,41873.376 41 65.405 8365.4058365.405\,8365.405 83 1131.836 711131.836711131.836\,711131.836 71 80.838480.838480.838480.8384 1394.05071394.05071394.05071394.0507 97.704 1697.7041697.704\,1697.704 16 1353.5741353.5741353.5741353.574 186.814186.814186.814186.814 1515.381515.381515.381515.38 85.767185.767185.767185.7671 1922.2621922.2621922.2621922.262 142.493142.493142.493142.493 14.767214.767214.767214.7672 1.124 911.124911.124\,911.124 91 16.095 1416.0951416.095\,1416.095 14 1.188 111.188111.188\,111.188 11 16.787 4316.7874316.787\,4316.787 43 1.414 221.414221.414\,221.414 22 50.23150.23150.23150.231 13.5413.5413.5413.54 6.922 926.922926.922\,926.922 92 0.501 6370.5016370.501\,6370.501 637 10.409810.409810.409810.4098 1.071 271.071271.071\,271.071 27
Protein 203.484 26203.48426203.484\,26203.484 26 13.217 4213.2174213.217\,4213.217 42 274.910 33274.91033274.910\,33274.910 33 16.850 4916.8504916.850\,4916.850 49 316.645 91316.64591316.645\,91316.645 91 20.425 2220.4252220.425\,2220.425 22 264.232264.232264.232264.232 34.58634.58634.58634.586 336.888336.888336.888336.888 20.605320.605320.605320.6053 372.1148372.1148372.1148372.1148 27.963727.963727.963727.9637 0.896 420.896420.896\,420.896 42 0.083 150.083150.083\,150.083 15 0.821 430.821430.821\,430.821 43 0.080 030.080030.080\,030.080 03 0.967 260.967260.967\,260.967 26 0.098 960.098960.098\,960.098 96 12.50512.50512.50512.505 1.2661.2661.2661.266 0.495 8980.4958980.495\,8980.495 898 0.044 535 90.04453590.044\,535\,90.044 535 9 0.715 364 40.71536440.715\,364\,40.715 364 4 0.086 937 240.086937240.086\,937\,240.086 937 24
BC CC
CSR DGAP BAL LLAMA GraphOne XPGraph CSR DGAP BAL LLAMA GraphOne XPGraph
Graph T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT T16subscript𝑇16T_{16}italic_T start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT
Orkut 5.215 415.215415.215\,415.215 41 0.424 250.424250.424\,250.424 25 5.396 685.396685.396\,685.396 68 0.417 320.417320.417\,320.417 32 6.103 346.103346.103\,346.103 34 0.457 670.457670.457\,670.457 67 79.06779.06779.06779.067 5.7145.7145.7145.714 7.976 087.976087.976\,087.976 08 0.581 5360.5815360.581\,5360.581 536 8.008 6888.0086888.008\,6888.008 688 0.805 986 60.80598660.805\,986\,60.805 986 6 2.603 562.603562.603\,562.603 56 0.41920.41920.41920.4192 3.451 883.451883.451\,883.451 88 0.72720.72720.72720.7272 5.712 255.712255.712\,255.712 25 0.884 950.884950.884\,950.884 95 5.945.945.945.94 0.8650.8650.8650.865 4.081 324.081324.081\,324.081 32 0.751 0850.7510850.751\,0850.751 085 4.7664.7664.7664.766 0.710 664 60.71066460.710\,664\,60.710 664 6
LiveJournal 4.373 684.373684.373\,684.373 68 0.331 030.331030.331\,030.331 03 4.231 434.231434.231\,434.231 43 0.318 240.318240.318\,240.318 24 4.906 694.906694.906\,694.906 69 0.360 210.360210.360\,210.360 21 39.71939.71939.71939.719 2.7592.7592.7592.759 5.056 965.056965.056\,965.056 96 0.358 3060.3583060.358\,3060.358 306 6.619 3626.6193626.619\,3626.619 362 0.609 3390.6093390.609\,3390.609 339 0.99260.99260.99260.9926 0.421 970.421970.421\,970.421 97 1.398 991.398991.398\,991.398 99 0.799 120.799120.799\,120.799 12 3.399 293.399293.399\,293.399 29 0.87140.87140.87140.8714 3.7643.7643.7643.764 1.171.171.171.17 2.164 982.164982.164\,982.164 98 0.749 8490.7498490.749\,8490.749 849 3.196 4883.1964883.196\,4883.196 488 1.026 5911.0265911.026\,5911.026 591
CitPatents 3.903 323.903323.903\,323.903 32 0.294 270.294270.294\,270.294 27 3.485 823.485823.485\,823.485 82 0.260 860.260860.260\,860.260 86 3.707 863.707863.707\,863.707 86 0.272 330.272330.272\,330.272 33 24.72324.72324.72324.723 1.6981.6981.6981.698 3.535 343.535343.535\,343.535 34 0.263 1580.2631580.263\,1580.263 158 5.153 8325.1538325.153\,8325.153 832 0.465 085 40.46508540.465\,085\,40.465 085 4 1.669 571.669571.669\,571.669 57 0.484 690.484690.484\,690.484 69 2.342 242.342242.342\,242.342 24 0.487 640.487640.487\,640.487 64 6.683 976.683976.683\,976.683 97 1.426 821.426821.426\,821.426 82 5.2965.2965.2965.296 2.0652.0652.0652.065 3.28413.28413.28413.2841 0.809 2730.8092730.809\,2730.809 273 5.541 8685.5418685.541\,8685.541 868 1.677 7521.6777521.677\,7521.677 752
Twitter 106.098 64106.09864106.098\,64106.098 64 7.832 377.832377.832\,377.832 37 122.387 52122.38752122.387\,52122.387 52 7.860 737.860737.860\,737.860 73 117.088 19117.08819117.088\,19117.088 19 8.413 578.413578.413\,578.413 57 717.385717.385717.385717.385 48.53748.53748.53748.537 141.165141.165141.165141.165 9.126 269.126269.126\,269.126 26 190.2196190.2196190.2196190.2196 15.814115.814115.814115.8141 71.526 7371.5267371.526\,7371.526 73 16.452 0316.4520316.452\,0316.452 03 88.761 0988.7610988.761\,0988.761 09 23.476 0923.4760923.476\,0923.476 09 134.423 66134.42366134.423\,66134.423 66 28.679 9428.6799428.679\,9428.679 94 121.056121.056121.056121.056 25.19925.19925.19925.199 126.663126.663126.663126.663 24.803924.803924.803924.8039 139.904139.904139.904139.904 30.889230.889230.889230.8892
Friendster 203.634 07203.63407203.634\,07203.634 07 14.698 6314.6986314.698\,6314.698 63 209.340 11209.34011209.340\,11209.340 11 14.468 9714.4689714.468\,9714.468 97 216.9206216.9206216.9206216.9206 15.147 5315.1475315.147\,5315.147 53 1568.581568.581568.581568.58 105.367105.367105.367105.367 287.513287.513287.513287.513 17.594917.594917.594917.5949 372.874372.874372.874372.874 28.95428.95428.95428.954 155.400 58155.40058155.400\,58155.400 58 23.720 4623.7204623.720\,4623.720 46 192.705 32192.70532192.705\,32192.705 32 36.409 4936.4094936.409\,4936.409 49 229.481 63229.48163229.481\,63229.481 63 33.450 3333.4503333.450\,3333.450 33 260.483260.483260.483260.483 42.37442.37442.37442.374 269.541269.541269.541269.541 37.794237.794237.794237.7942 284.5334284.5334284.5334284.5334 44.652444.652444.652444.6524
Protein 2.012 672.012672.012\,672.012 67 0.312 970.312970.312\,970.312 97 2.093 212.093212.093\,212.093 21 0.265 850.265850.265\,850.265 85 2.140 012.140012.140\,012.140 01 0.265 460.265460.265\,460.265 46 24.41824.41824.41824.418 1.8611.8611.8611.861 5.432 755.432755.432\,755.432 75 0.448 2260.4482260.448\,2260.448 226 3.879 6883.8796883.879\,6883.879 688 0.472 966 60.47296660.472\,966\,60.472 966 6 66.502 6766.5026766.502\,6766.502 67 4.517 014.517014.517\,014.517 01 84.515 4384.5154384.515\,4384.515 43 6.743 236.743236.743\,236.743 23 102.862 98102.86298102.862\,98102.862 98 6.73856.73856.73856.7385 106.005106.005106.005106.005 11.18611.18611.18611.186 112.012112.012112.012112.012 9.67059.67059.67059.6705 113.2578113.2578113.2578113.2578 11.932 6611.9326611.932\,6611.932 66

2) Breadth-First Search (BFS) and Betweenness Centrality (BC). Fig. 8 shows the relative speed of Breadth-First Search and Betweenness Centrality compared to CSR. For BFS, DGAP outperforms BAL and LLAMA by 2.30×2.30\times2.30 × and 3.71×3.71\times3.71 ×, respectively on average. However, DGAP performs 2.77×2.77\times2.77 × and 1.81×1.81\times1.81 × worse than GraphOne and XPGraph in this particular workload. This is expected since BFS is accessing edges of random vertices each time. The adjacency list in GraphOne and XPGraph performs very well for these tasks. CSR can not fully leverage its own spatial locality. In addition, since most BFS only reaches a small part of the graph, GraphOne and XPGraph can successfully cache the graph in DRAM. We observe similar trends for Betweenness Centrality (BC) as Fig. 8 shows. Since BC is more computationally and memory intensive. It also covers larger parts of the graphs during computation, we can see that DGAP actually catches up and delivers similar performance compared with DRAM-based GraphOne and XPGraph. Specifically, DGAP outperforms BAL, LLAMA, GraphOne, and XPGraph by up to 1.08×1.08\times1.08 ×, 8.19×8.19\times8.19 ×, 1.21×1.21\times1.21 ×, and 1.85×1.85\times1.85 × respectively.

4.3.1. Graph Analysis Scalability

To examine the scalability of DGAP, we further ran the same graph algorithms using 1 to 16 threads and report the execution time (in seconds) in Table 4. Due to the space limits, we only report results of 1 thread and 16 threads for each case. From these results, we make server observations. First, DGAP scales well. It delivers up to 14.3×14.3\times14.3 ×, 13.6×13.6\times13.6 ×, 15.6×15.6\times15.6 ×, and 4.7×4.7\times4.7 × speedup using 16x threads running PageRank, BFS, BC, and CC algorithms respectively. It is interesting to see that DGAP does not scale well in CC. In fact, all the systems do not scale well in this algorithm. After checking the source code, we noticed the bottleneck actually comes from its inappropriate parallel for scheduling keywords. If fixed, CC will deliver similar scalability for all frameworks. Since our goal is not to improve the algorithm implementation, we reported the results from the original GAPS implementation. Second, DGAP still delivers the best performance in most graph analysis algorithms. Similar to the single thread case, DGAP performs worse than GraphOne and XPGraph in the BFS case. As discussed earlier, this is mostly because GraphOne and XPGraph run BFS purely in DRAM and their adjacency list structure fits BFS well.

4.4. DGAP Components Evaluations

Table 5. Insertion performance (in seconds) of different DGAPs.
Datasets DGAP No EL No EL&UL No EL&UL&DP
Orkut 83.550 6583.5506583.550\,6583.550 65 374.86 383.52 588.37
LiveJournal 29.739 22529.73922529.739\,22529.739 225 136.283136.283136.283136.283 146.092146.092146.092146.092 240.463240.463240.463240.463
CitPatents 12.254 17512.25417512.254\,17512.254 175 51.26151.26151.26151.261 58.469258.469258.469258.4692 107.388107.388107.388107.388

DGAP Components Evaluations. In DGAP, we introduce three designs to maximize PMs. We further evaluated their contributions to the final performance. Specifically, we implemented and compared three different versions of DGAP by incrementally excluding its key components: (i) removing per-section Edge Logs as ‘No EL’; (ii) further removing per-thread Undo Log as ‘No EL&UL’, replaced using PMDK transactions; and (iii) further removing Data Placement in DRAM as ‘No EL&UL&DP’, meaning both vertex array and edge array are on PMs. The graph insertion performance results are reported in Table 5. We only report the results for small-size graphs, as we were not able to finish running all the tests on larger graphs in a reasonable time.

The results show that the per-section edge log contributes the most in performance improvements. Without it, DGAP performs 4.5×4.5\times4.5 × worse because of the write amplification caused by the nearby shifts. Specifically, with per-section edge log, DGAP is able to reduce the write amplification by 6×6\times6 × in the Orkut graph. Additionally, per-thread undo log contributes another 13%percent1313\%13 % performance improvement by reducing the high memory allocation and excessive ordering cost of transactions. Finally, placing the vertex array in PMs would incur about 2×2\times2 × performance overhead. Placing all the remaining metadata (e.g., PMA tree) in PMs would even double the overhead.

DGAP Configurations Evaluations. Besides three system components, DGAP includes a set of configurations, impacting its performance. For example, the size of per-section edge log will affect the PM usages as well as the rebalancing frequency, impacting the insertion performance. To evaluate it, we compared how its size, ELOG_SZ, would impact graph insertion performance and PMs consumption. The results are reported in Fig. 9. Due to space limits, we only show results for Orkut and LiveJournal graphs. Other graphs have similar patterns. We changed ELOG_SZ from 64 bytes to 16 KB. The bar length represents the total space needed to store all the per-section edge log, which increases proportionally as ELOG_SZ increase. The labels above each bar further report the percentage-wise utilization of these logs during graph insertions. We can see as the edge log increases, the utilization rate reduces significantly from 80.96% to 5.60% as there might not be so many nearby shifts to fill the logs. The green line shows the delivered insertion performance based on each log size. It is clear that larger logs reduce the insertion time. But the benefits become much smaller after 2048, which is chosen as default ELOG_SZ size in DGAP.

Refer to caption
(a) Orkut
Refer to caption
(b) LiveJournal
Figure 9. Impacts of the size of per-section edge log.

DGAP Recovery Evaluations. Each time DGAP reboots, it reloads the metadata into DRAM before operating. Such a normal start is fast. In our evaluation, we found that DGAP spends 1.161.161.161.16 seconds in rebooting even on the largest Friendster graph. After crash, DGAP needs to do more housekee** work to recover system statuses. These steps include scanning the edge array and logs to recover the inconsistencies caused by the crash. This indicates DGAP crash recovery time will depend on the graph size. However, sequential access in PMs is fast, and so is the DGAP recovery. In our experiment, we found that for the smaller graphs (e.g., Orkut, LiveJournal, and CitPatents), DGAP takes less than 1111 second. For the larger graphs, it may take more than 4444 seconds. But, note that these time costs are for recovery from a crash only.

5. Related Works

The works most closely related to ours are NVGRAPH (Lim et al., 2019) and XPGraph (Wang et al., 2022). Both frameworks are designed for persistent memory devices. NVGRAPH proposed a dual-version data structure for NVM and DRAM to achieve high-speed data persistence and graph analysis. However, since NVGRAPH was designed before actual persistent memory devices were released, many of its assumptions have later been shown to be inaccurate (Yang et al., 2020). Consequently, it did not leverage many performance features of PMs. As such, we do not compare DGAP with NVGRAPH, as it wouldn’t be a fair comparison. Similar to DGAP, XPGraph was designed for and evaluated on Intel Optane PMs, and is essentially a PM-based GraphOne. Through extensive evaluations, we demonstrate that DGAP outperforms XPGraph in both graph updates and graph analysis tasks, highlighting the promising performance of mutable CSR data structures. A recent study (Gill et al., 2020) systematically benchmarks graph processing on PMs. However, this study assumes that persistent memory functions as volatile, larger DRAM serving only graph analysis, which is fundamentally different from DGAP.

In addition to PM-based graph analysis, there has been a large number of PM-based indexing data structures, such as B+-Tree (Li et al., 2022; Zhang et al., 2022; Chen et al., 2020b; Liu et al., 2020; Hwang et al., 2018; Chen and **, 2015; Oukid et al., 2016; Yang et al., 2015) and Hashtable (Lamar et al., 2021; Benson et al., 2021; Nam et al., 2019; Zuo et al., 2018, 2019; Chen et al., 2020a). Some works (Lee et al., 2019; Venkataraman et al., 2011; Kim et al., 2021; Haria et al., 2020; Friedman et al., 2020; Memaripour et al., 2020; Krishnan et al., 2021; Huang et al., 2021) also proposed general guidelines for porting in-memory data structures to PMs. Many of the DGAP’s design choices are aligned with these existing studies, but focus more on graph updates and analysis.

In addition to PM-based graph frameworks, there are a significant amount of single-node dynamic graph analysis frameworks. We categorize them into in-memory and out-of-core frameworks. For in-memory dynamic graph frameworks (Islam et al., 2022a; Pandey et al., 2021; Wheatman and Xu, 2018; King et al., 2016; Firmli et al., 2020), their graphs are not persistent and need rebuilding after a crash or reboot. Even with data periodically synchronized to fast non-volatile storage devices, like PMs, existing in-memory graph frameworks still face the challenges in striking a balance between data loss and graph update speed. Our evaluations on BAL, LLAMA, and GraphOne show naively porting existing in-memory graph frameworks to persistent memory will experience performance issues. DGAP roots from in-memory data structure (mutable CSR) as well, but contains a series of new designs to maximize the performance. Existing out-of-core dynamic graph frameworks are designed based on slow block-based storage devices (Kyrola et al., 2012; Macko et al., 2015). For example, in LLAMA (Macko et al., 2015), newly added edges are first batched up in the delta map and periodically synced to a CSR snapshot. Such batch behaviors may not be necessary on persistent memory. While, in DGAP, graph changes are immediately visible to analytic tasks.

6. Conclusion and Future Work

In this study, we present DGAP, a new graph analysis framework built on persistent memory. DGAP leverages existing DRAM-based mutable Compressed Sparse Row (CSR) graph structure with extensive new designs for persistent memory devices to achieve both efficient graph updates and graph analysis. Our results show DGAP outperforms state-of-the-art dynamic graph frameworks, such as LLAMA, GraphOne, XPGraph on PMs by up to 3.2×3.2\times3.2 × in graph updates and 3.77×3.77\times3.77 × in graph analysis. Our exploration of DGAP shows that persistent memory is a promising alternative to support efficient dynamic graph analysis. In the future, we plan to further improve DGAP designs, including a Copy-on-Write strategy for Degree Cache and a fine-grained locking mechanism. We also plan to investigate how to extend DGAP to a distributed environment using RDMA in PMs to support even larger graphs.

Acknowledgments

We sincerely thank the anonymous reviewers for their valuable feedback. This work was supported in part by NSF grants CNS-1852815, CCF-1910727, CCF-1908843, and CNS-2008265.

References

  • (1)
  • Akinaga and Shima (2010) Hiroyuki Akinaga and Hisashi Shima. 2010. Resistive random access memory (ReRAM) based on metal oxides. Proc. IEEE 98, 12 (2010), 2237–2251.
  • Bader et al. (2005) David A Bader, Guo**g Cong, and John Feo. 2005. On the architectural requirements for efficient execution of graph algorithms. In 2005 International Conference on Parallel Processing (ICPP’05). IEEE.
  • Beamer (2015) Scott Beamer. 2015. GAP Benchmark Suite. https://github.com/sbeamer/gapbs. Accessed July. 30, 2021.
  • Beamer et al. (2012) Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing Breadth-First Search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12).
  • Bender and Hu (2007) Michael A Bender and Haodong Hu. 2007. An adaptive packed-memory array. ACM Transactions on Database Systems (TODS’07) 32 (2007).
  • Benson et al. (2021) Lawrence Benson, Hendrik Makait, and Tilmann Rabl. 2021. Viper: An Efficient Hybrid PMem-DRAM Key-Value Store. Proc. VLDB Endow. 14, 9 (2021).
  • Besta et al. (2021) Maciej Besta, Marc Fischer, Vasiliki Kalavri, Michael Kapralov, and Torsten Hoefler. 2021. Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and Systems. IEEE Transactions on Parallel and Distributed Systems (2021).
  • Brandes (2001) Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. Journal of mathematical sociology 25, 2 (2001).
  • Chen and ** (2015) Shimin Chen and Qin **. 2015. Persistent b+-trees in non-volatile main memory. Proceedings of the VLDB Endowment 8, 7 (2015), 786–797.
  • Chen et al. (2020b) Youmin Chen, Youyou Lu, Kedong Fang, Qing Wang, and Jiwu Shu. 2020b. UTree: A Persistent B+-Tree with Low Tail Latency. Proc. VLDB Endow. 13, 12 (2020).
  • Chen et al. (2020a) Zhangyu Chen, Yu Hua, Bo Ding, and Pengfei Zuo. 2020a. Lock-free Concurrent Level Hashing for Persistent Memory. In 2020 USENIX Annual Technical Conference (USENIX ATC’20).
  • Cooper et al. (2010) Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing (SoCC’10). ACM.
  • Corporation (2019) Intel Corporation. 2019. Key features on Cascade Lake. https://www.intel.com/content/www/us/en/products/platforms/details/cascade-lake.html. Accessed Jan. 22, 2023.
  • Corporation (2021) Intel Corporation. 2021. eADR: New Opportunities for Persistent Memory Applications. https://www.intel.com/content/www/us/en/developer/articles/technical/eadr-new-opportunities-for-persistent-memory-applications.html. Accessed Jan. 22, 2023.
  • De Leo and Boncz (2021) Dean De Leo and Peter Boncz. 2021. Teseo and the Analysis of Structural Dynamic Graphs. Proc. VLDB Endow. 14, 6 (2021).
  • Ediger et al. (2012) David Ediger, Rob McColl, Jason Riedy, and David A. Bader. 2012. STINGER: High performance data structure for streaming graphs. In IEEE Conference on High Performance Extreme Computing (HPEC’12).
  • Firmli et al. (2020) Soukaina Firmli, Vasileios Trigonakis, Jean-Pierre Lozi, Iraklis Psaroudakis, Alexander Weld, Dalila Chiadmi, Sungpack Hong, and Hassan Chafi. 2020. CSR++: A Fast, Scalable, Update-Friendly Graph Data Structure. In 24th International Conference on Principles of Distributed Systems (OPODIS’20).
  • Friedman et al. (2020) Michal Friedman, Naama Ben-David, Yuanhao Wei, Guy E. Blelloch, and Erez Petrank. 2020. NVTraverse: In NVRAM Data Structures, the Destination is More Important than the Journey. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’20).
  • Gill et al. (2020) Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, and Keshav **ali. 2020. Single Machine Graph Analytics on Massive Datasets Using Intel Optane DC Persistent Memory. Proc. VLDB Endow. 13, 8 (2020).
  • Gwennap (2019) Linley Gwennap. 2019. First Optane DIMMs Disappoint. The LinleyGroup.
  • Haria et al. (2020) Swapnil Haria, Mark D Hill, and Michael M Swift. 2020. MOD: Minimally ordered durable datastructures for persistent memory. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20).
  • Huang et al. (2021) Hanxian Huang, Zixuan Wang, Juno Kim, Steven Swanson, and Jishen Zhao. 2021. Ayudante: A Deep Reinforcement Learning Approach to Assist Persistent Memory Programming. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’21).
  • Hwang et al. (2018) Deukyeon Hwang, Wook-Hee Kim, Youjip Won, and Beomseok Nam. 2018. Endurable Transient Inconsistency in Byte-Addressable Persistent B+-Tree. In 16th USENIX Conference on File and Storage Technologies (USENIX FAST’18).
  • Islam et al. (2022a) Abdullah Al Raqibul Islam, Dong Dai, and Dazhao Cheng. 2022a. VCSR: Mutable CSR Graph Format Using Vertex-Centric Packed Memory Array. In 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid’22).
  • Islam et al. (2020) Abdullah Al Raqibul Islam, Dong Dai, Anirudh Narayanan, and Christopher York. 2020. A Performance Study of Optane Persistent Memory: From Indexing Data Structures’ Perspective. In 36th International Conference on Massive Storage Systems and Technology (MSST’20).
  • Islam et al. (2022b) Abdullah Al Raqibul Islam, Christopher York, and Dong Dai. 2022b. A performance study of optane persistent memory: from storage data structures’ perspective. CCF Transactions on High Performance Computing (24 Sep 2022).
  • Iyer et al. (2015) Anand Iyer, Li Erran Li, and Ion Stoica. 2015. CellIQ : Real-Time Cellular Network Analytics at Scale. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI’15).
  • Izraelevitz et al. (2019) Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R Dulloor, et al. 2019. Basic performance measurements of the intel optane DC persistent memory module. arXiv preprint arXiv:1903.05714 (2019).
  • Kim et al. (2021) Wook-Hee Kim, R. Madhava Krishnan, Xinwei Fu, Sanidhya Kashyap, and Changwoo Min. 2021. PACTree: A High Performance Persistent Range Index Using PAC Guidelines. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP’21).
  • King et al. (2016) James King, Thomas Gilray, Robert M Kirby, and Matthew Might. 2016. Dynamic-CSR: A format for dynamic sparse-matrix updates. In Springer-Verlag, Vol. 9697. 61–80.
  • Krishnan et al. (2021) R. Madhava Krishnan, Wook-Hee Kim, Xinwei Fu, Sumit Kumar Monga, Hee Won Lee, Minsung Jang, Ajit Mathew, and Changwoo Min. 2021. TIPS: Making Volatile Index Structures Persistent with DRAM-NVMM Tiering. In 2021 USENIX Annual Technical Conference (USENIX ATC’21).
  • Kültürsay et al. (2013) Emre Kültürsay, Mahmut Kandemir, Anand Sivasubramaniam, and Onur Mutlu. 2013. Evaluating STT-RAM as an energy-efficient main memory alternative. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’13).
  • Kumar and Huang (2019) Pradeep Kumar and H Howie Huang. 2019. Graphone: A data store for real-time analytics on evolving graphs. In 17th USENIX Conference on File and Storage Technologies (FAST’19).
  • Kyrola et al. (2012) Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. Graphchi: Large-scale graph computation on just a PC. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12).
  • Lamar et al. (2021) Kenneth Lamar, Christina Peterson, Damian Dechev, Roger Pearce, Keita Iwabuchi, and Peter Pirkelbauer. 2021. PMap: A Non-volatile Lock-free Hash Map with Open Addressing. In IEEE 10th Non-Volatile Memory Systems and Applications Symposium (NVMSA’21).
  • Lee et al. (2010) Benjamin C Lee, ** Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur Mutlu, and Doug Burger. 2010. Phase-change technology and the future of main memory. IEEE micro 30, 1 (2010), 143–143.
  • Lee et al. (2019) Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, and Vijay Chidambaram. 2019. Recipe: Converting Concurrent DRAM Indexes to Persistent-Memory Indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19).
  • Leskovec and Krevl (2014) Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.
  • Li et al. (2022) Tongliang Li, Haixia Wang, Airan Shao, and Dongsheng Wang. 2022. SSB-Tree: Making Persistent Memory B+- Trees Crash-Consistent and Concurrent by Lazy-Box. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS’22).
  • Lim et al. (2019) Soklong Lim, Zaixin Lu, Bin Ren, and Xuechen Zhang. 2019. Enforcing crash consistency of evolving network analytics in non-volatile main memory systems. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT’19).
  • Liu et al. (2020) Jihang Liu, Shimin Chen, and Lujun Wang. 2020. LB+Trees: Optimizing Persistent Index Performance on 3DXPoint Memory. Proc. VLDB Endow. 13, 7 (2020).
  • Macko et al. (2015) Peter Macko, Virendra J Marathe, Daniel W Margo, and Margo I Seltzer. 2015. Llama: Efficient graph analytics using large multiversioned arrays. In IEEE 31st International Conference on Data Engineering (ICDE’15).
  • Madduri et al. (2009) Kamesh Madduri, David Ediger, Karl Jiang, David A Bader, and Daniel Chavarria-Miranda. 2009. A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. In IEEE International Symposium on Parallel & Distributed Processing (IPDPS’09).
  • Memaripour et al. (2020) Amirsaman Memaripour, Joseph Izraelevitz, and Steven Swanson. 2020. Pronto: Easy and Fast Persistence for Volatile Data Structures. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20).
  • Nam et al. (2019) Moohyeon Nam, Hokeun Cha, Young ri Choi, Sam H. Noh, and Beomseok Nam. 2019. Write-Optimized Dynamic Hashing for Persistent Memory. In 17th USENIX Conference on File and Storage Technologies (USENIX FAST’19).
  • Optane (2019) Optane. 2019. Intel Optane Persistent Memory. https://www.intel.com/content/www/us/en/products/docs/memory-storage/optane-persistent-memory/optane-dc-persistent-memory-brief.html.
  • Oukid et al. (2016) Ismail Oukid, Johan Lasperas, Anisoara Nica, Thomas Willhalm, and Wolfgang Lehner. 2016. FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16).
  • Pandey et al. (2021) Prashant Pandey, Brian Wheatman, Helen Xu, and Aydin Buluc. 2021. Terrace: A Hierarchical Graph Container for Skewed Dynamic Graphs. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD’21).
  • Pinar and Heath (1999) Ali Pinar and Michael T Heath. 1999. Improving performance of sparse matrix-vector multiplication. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (SC’99).
  • pmem.io Persistent (2019) pmem.io Persistent. 2019. Persistent Memory Programming. https://pmem.io.
  • Qureshi et al. (2009a) Moinuddin K Qureshi, John Karidis, Michele Franceschini, Vijayalakshmi Srinivasan, Luis Lastras, and Bulent Abali. 2009a. Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09).
  • Qureshi et al. (2009b) Moinuddin K Qureshi, Vijayalakshmi Srinivasan, and Jude A Rivers. 2009b. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th annual international symposium on Computer architecture (ISCA’09).
  • Raoux et al. (2008) Simone Raoux, Geoffrey W Burr, Matthew J Breitwisch, Charles T Rettner, Y-C Chen, Robert M Shelby, Martin Salinga, Daniel Krebs, S-H Chen, H-L Lung, et al. 2008. Phase-change random access memory: A scalable technology. IBM Journal of Research and Development 52, 4.5 (2008), 465–479.
  • Roy et al. (2013) Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-stream: Edge-Centric Graph Processing using Streaming Partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP’13).
  • Sahu et al. (2018) Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. Tamer Özsu. 2018. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. Proc. VLDB Endow. 11, 4 (2018).
  • Sha et al. (2017) Mo Sha, Yuchen Li, Bingsheng He, and Kian-Lee Tan. 2017. Technical report: Accelerating dynamic graph analytics on gpus. arXiv preprint arXiv:1709.05061 (2017).
  • Shao et al. (2013) Bin Shao, Haixun Wang, and Yatao Li. 2013. Trinity: A Distributed Graph Engine on a Memory Cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13).
  • Shiloach and Vishkin (1980) Yossi Shiloach and Uzi Vishkin. 1980. An O (log n) parallel connectivity algorithm. Technical Report. Computer Science Department, Technion.
  • Shu et al. (2018) Hong** Shu, Hongyu Chen, Hao Liu, Youyou Lu, Qingda Hu, and Jiwu Shu. 2018. Empirical study of transactional management for persistent memory. In IEEE 7th Non-Volatile Memory Systems and Applications Symposium (NVMSA’18).
  • Song et al. (2023) Yongju Song, Wook-Hee Kim, Sumit Kumar Monga, Changwoo Min, and Young Ik Eom. 2023. Prism: Optimizing Key-Value Store for Modern Heterogeneous Storage Devices. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS’23).
  • Suzuki and Swanson (2015) Kosuke Suzuki and Steven Swanson. 2015. A survey of trends in non-volatile memory technologies: 2000-2014. In Proceedings of the IEEE International Memory Workshop (IMW’15).
  • van Renen et al. (2019) Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann, and Alfons Kemper. 2019. Persistent memory i/o primitives. In Proceedings of the 15th International Workshop on Data Management on New Hardware (DaMoN’19).
  • Venkataraman et al. (2011) Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, and Roy H. Campbell. 2011. Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (USENIX FAST’11).
  • Wang (2022) Rui Wang. 2022. XPGraph. https://github.com/rwang067/XPGraph. Accessed January, 2023.
  • Wang et al. (2022) Rui Wang, Shuibing He, Weixu Zong, Yongkun Li, and Yinlong Xu. 2022. XPGraph: XPline-Friendly Persistent Memory Graph Stores for Large-Scale Evolving Graphs. In Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO’22).
  • Wheatman and Xu (2018) Brian Wheatman and Helen Xu. 2018. Packed Compressed Sparse Row: A Dynamic Graph Representation. In IEEE High Performance Extreme Computing Conference (HPEC’18).
  • Wheatman and Xu (2021) Brian Wheatman and Helen Xu. 2021. A Parallel Packed Memory Array to Store Dynamic Graphs. In Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX’21).
  • Wu et al. (2021) Kai Wu, Jie Ren, Ivy Peng, and Dong Li. 2021. ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory. In 19th USENIX Conference on File and Storage Technologies (USENIX FAST’21).
  • Wu et al. (2020) Yinjun Wu, Kwanghyun Park, Rathijit Sen, Brian Kroth, and Jaeyoung Do. 2020. Lessons Learned from the Early Performance Evaluation of Intel Optane DC Persistent Memory in DBMS. In Proceedings of the 16th International Workshop on Data Management on New Hardware (DaMoN’20).
  • Xiang et al. (2022) Lingfeng Xiang, Xingsheng Zhao, Jia Rao, Song Jiang, and Hong Jiang. 2022. Characterizing the Performance of Intel Optane Persistent Memory: A Close Look at Its on-DIMM Buffering. In Proceedings of the Seventeenth European Conference on Computer Systems (EuroSys’22).
  • Yang et al. (2020) Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. 2020. An Empirical Guide to the Behavior and Use of Scalable Persistent Memory. In 18th USENIX Conference on File and Storage Technologies (USENIX FAST 20).
  • Yang et al. (2015) Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. 2015. NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems. In 13th USENIX Conference on File and Storage Technologies (USENIX FAST’15).
  • Yang et al. (2013) J Joshua Yang, Dmitri B Strukov, and Duncan R Stewart. 2013. Memristive devices for computing. Nature nanotechnology 8, 1 (2013), 13.
  • Zardoshti et al. (2020) Pantea Zardoshti, Michael Spear, Aida Vosoughi, and Garret Swart. 2020. Understanding and improving persistent transactions on optane™ DC memory. In 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS’20).
  • Zhang et al. (2022) Bowen Zhang, Shengan Zheng, Zhenlin Qi, and Linpeng Huang. 2022. NBTree: A Lock-Free PM-Friendly Persistent B+-Tree for EADR-Enabled PM Systems. Proc. VLDB Endow. 15, 6 (2022).
  • Zhang et al. (2018) Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A High-Performance Graph DSL. Proc. ACM Program. Lang. 2, OOPSLA (2018).
  • Zuo et al. (2018) Pengfei Zuo, Yu Hua, and Jie Wu. 2018. Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory. In 13th USENIX Symposium on Operating Systems Design and Implementation (USENIX OSDI’18).
  • Zuo et al. (2019) Pengfei Zuo, Yu Hua, and Jie Wu. 2019. Level Hashing: A High-Performance and Flexible-Resizing Persistent Hashing Index Structure. ACM Trans. Storage 15, 2 (2019).