HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: letltxmacro

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2309.07984v3 [cs.AR] 17 Jan 2024
\LetLtxMacro\oldhl

Hardware-Software Co-design for
Broad Acceleration on Commercial PIM Architectures

Johnathan Alsop
AMD Research
[email protected]
   Shaizeen Aga
AMD Research
[email protected]
   Mohamed Ibrahim
AMD Research
[email protected]
   Mahzabeen Islam
AMD Research
[email protected]
   Nuwan Jayasena
AMD Research
[email protected]
   Andrew Mccrabb
University of Michigan
[email protected]
Abstract

Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) machine learning (ML) primitives. While such proposals are reasonable given the growing importance of ML, as memory is a pervasive component, there is a case for a more inclusive PIM design that can accelerate primitives across domains.

In this work, we ascertain the capabilities of commercial PIM proposals to accelerate various primitives across domains. We first begin with outlining a set of characteristics, termed PIM-amenability-test, which aid in assessing if a given primitive is likely to be accelerated by PIM. Next, we apply this test to primitives under study to ascertain efficient data-placement and orchestration to map the primitives to underlying PIM architecture. We observe here that, even though primitives under study are largely PIM-amenable, existing commercial PIM proposals do not realize their performance potential for these primitives. To address this, we identify bottlenecks that arise in PIM execution and propose hardware and software optimizations which stand to broaden the acceleration reach of commercial PIM designs (improving average PIM speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we believe emerging commercial PIM proposals add a necessary and complementary design point in the application acceleration space, hardware-software co-design is necessary to deliver their benefits broadly.

1 Introduction

As applications of both commercial and scientific importance continue to demand more memory bandwidth, memory vendors are reassessing processing in memory (PIM) as a potential solution. With PIM, in/near memory compute units work in tandem with traditional processors to enable higher effective memory bandwidth (potentially an order of magnitude or more) over that available externally (e.g., to CPUs, GPUs, ASICs, etc.). Recently, multiple memory vendors have proposed commercially viable PIM designs (which we will refer to as "commercial PIM" or simply "PIM")  [34, 33].

While proposed commercial PIM designs add a necessary and complementary design point in the application acceleration space, the primary driving factor influencing their design is the proliferation of machine learning (ML) workloads. This is a reasonable strategy given the commercial importance of ML. However, in this work, we address a broader question: how capable are these commercial PIM designs of accelerating important primitives across domains? We believe this question is worthwhile for several reasons. First, current PIM designs are largely geared toward limited primitives in ML (e.g., dense matrix-vector computations), while ML itself is comprised of a much broader set of (often memory-limited) computations [53, 39]. Second, memory is a pervasive component regardless of the processor it is coupled with (e.g., CPU, GPU, ASIC, etc.). As such, even with prioritizing ML, a more inclusive PIM system design is likely to provide both holistic and broad acceleration (ML and non-ML parts of an application). Additionally, a PIM system design with inclusivity in mind is more likely to weather fast-paced workload evolution (e.g., as manifested by ML).

In this work, we ascertain the capabilities of commercial PIM designs to accelerate primitives across domains. We begin with an overview of commercial PIM proposals to establish a strawman PIM system representative of them. Next, we develop a PIM-amenability-test which aids a programmer in assessing if a given primitive is likely to be accelerated by commercial PIM and helps guide the programmer toward an efficient offload of computation to PIM.

Next, we choose to focus on a set of primitives across domains: scientific (wave simulation - wavesim-volume and wavesim-flux primitives), machine learning (sparse skinny gemms - ss-gemm), and graph analytics (push primitive). We choose these primitives both for their importance in their source domains and also as, together, they provide a broad set of scenarios for us to assess commercial PIM designs. Further, to assess the benefits PIM provides beyond existing state-of-the-art solutions, we start with GPU-accelerated baselines of these primitives. We then apply our proposed PIM-amenability-test to the primitives under study, which in turn helps us ascertain data-placement and compute orchestration to map the primitives to underlying PIM architecture efficiently. We believe that the above process serves as a good template to study and map new primitives to emerging commercial PIM designs.

We then present a performance modelling methodology and observe that, even though the primitives under study are largely PIM-amenable, existing commercial PIM designs do not realize their performance potential for these primitives. This is true even with careful data-placement and compute orchestration. To understand this performance gap, we further analyze the PIM executions and observe that bottlenecks in baseline accelerator executions are exacerbated when a computation is offloaded to PIM (e.g., DRAM row activation overheads). We also observe that PIM acceleration is sensitive to input-dependent cache locality, and how current compute orchestration for commercial PIM designs opens up unique opportunities that software can exploit (e.g., sparsity-aware orchestration).

Following up on our initial vision of a more inclusive PIM design, we identify hardware augmentations and software optimizations to address the above identified challenges and also harness opportunities present. Specifically, we identify three dimensions for our proposed optimizations: architecture-aware (mitigating row activation overheads via careful scheduling), sparsity-aware (leveraging data sparsity via fine-grain PIM orchestration), and cache-aware (allowing input-dependent cache locality via careful PIM offload). Finally, we perform pertinent limit studies anchoring on PIM architecture design decisions and identify their effect on performance. We show how our proposed optimizations stand to broaden the acceleration reach of commercial PIM designs, achieving speedups of up to 2.68x, 3.17x, and 2.43x in scientific, ML, and graph analytics domains respectively (of an available upper-bound of 4x).

Overall, we believe that while emerging commercial PIM designs add a necessary and complementary design point in the application acceleration space, careful attention to the inclusivity of PIM is important. As memory is an omnipresent component in any system, any acceleration that memory can deliver stands to complement current (and future) processor-side optimizations. Further, while prioritizing ML needs is reasonable in the near term, a more inclusive PIM design stands to holistically accelerate both ML and non-ML computations. Finally, hardware-software co-design is necessary to truly and broadly realize PIM acceleration. In summary we believe that, while emerging commercial PIM designs do hold promise, we demonstrate that applying PIM to a broader range of primitives can greatly enhance their utility.

Our work makes the following contributions:

  • To the best of our knowledge, this is the first work to evaluate emerging commercial PIM designs across primitives from a broad spectrum of domains.

  • We develop a PIM-amenability-test, comprised of a set of characteristics and associated heuristics, which aids a programmer in assessing if a given primitive is likely to be accelerated by commercial PIM.

  • We identify data-map** and careful compute orchestration for primitives under study and show that, despite such efforts, commercial PIM systems in their current form do not realize their performance potential.

  • We identify bottlenecks and opportunities unique to commercial PIM systems and propose hardware augmentations and software optimizations to broaden acceleration reach of commercial PIM designs (improving average PIM speedups from 1.12x to 2.49x relative to a GPU baseline).

  • Overall, we make a case for an inclusive PIM design which, while prioritizing commercially dominant primitives (ML), incorporates design changes that enable broad acceleration with PIM.

2 Background

In this section, we first discuss and motivate our assumed baseline system. Next, we provide a background on recent commercial PIM prototypes to present a representative strawman PIM design we evaluate in this work. Finally, we conclude with a discussion on the domains and primitives we focus on in this work.

2.1 Baseline System - GPU + HBM

The left side of Figure 1 depicts the baseline system studied in this work: a GPU coupled with HBM memory [1].

GPU: While PIM can be coupled with any processor (CPUs, GPUs), our evaluation assumes a GPU for multiple reasons. First, over the past decade, GPUs have emerged as performant and programmable accelerators for a diverse range of highly parallel compute workloads. As such, a GPU processor allows us a strong baseline against which we demonstrate PIM benefits. Second, there exists a real PIM prototype [34] coupled with GPUs allowing us a baseline architecture to assess. Finally, as GPU compute throughput is increasing more rapidly than memory bandwidth, many emerging GPU workloads are likely to be memory bandwidth bound.

High Bandwidth Memory (HBM): High bandwidth memory is a specialized form of DRAM that attains high bandwidth and energy efficiency via high density interconnects and 3D stacking. As illustrated in Figure 1, each HBM module is a 3D-stack of DRAM dies and a base logic die connected using low-power through silicon vias (TSVs). HBM can be tightly integrated with a processor (in this case a GPU) die on a common substrate such as a silicon interposer [32] with an order of magnitude more I/O interconnects [34] than conventional DRAM.

Each HBM DRAM die is composed of pseudo-channels (pCHs), which further comprise multiple banks that share the data bus associated with a pCH. The address associated with a baseline read/write request specifies the pCH and bank where the data resides along with the row and column address within the bank. On a read request, the specified row is activated in the bank (row activation), which causes the data in the row to be read out to the row-buffer associated with the bank (row open) after DRAM row activation delay. Subsequently, the column decoder selects a DRAM word from the row buffer based on specified column address (column access). Row activation delay overhead can be mitigated in DRAM by exploiting row locality (subsequent accesses to the open row do not incur activation latency) and bank parallelism (column commands to different banks keep the data bus utilized while one bank is activating a row). Note that the basic sequence of operations for an HBM memory access is similar to that of DDR DRAM.

Refer to caption
Figure 1: PIM design based on PIM-HBM [34].

2.2 Commercial PIM Designs

Recently, two designs for DRAM-attached commercial PIM have been announced by Samsung and SK Hynix. We discuss the details of each design in this section.

HBM-PIM: Samsung’s proposed PIM architecture [34, 30] places ALUs and associated register files on the periphery of DRAM banks (Figure 1). This design does not disturb the bank and sub-array structures of the memory, improving its viability for commercialization. The PIM ALU is a 256b-wide SIMD datapath that performs sixteen 16b operations in parallel, and is matched to the input/output width of the DRAM cell arrays of the bank. The register files can be used for intermediate results or for staging data from an open DRAM row to reduce the frequency of row activations. Notably, the PIM units do not contain instruction fetch or other "frontend" capabilities, reducing their area costs, and they execute instructions in response to commands issued from the GPU processor. These GPU commands are issued subject to fixed timing constraints, similar to how traditional memory operations are issued. The key benefit of this architecture is memory bandwidth amplification, which is achieved by broadcasting the commands to all banks (or a subset of banks) of a pCH (normal load/store operations only access a single bank at a time). This is possible because the data from each bank goes to the associated PIM unit rather than being transmitted across the shared pCH. The memory functions as a standard DRAM when the PIM capabilities are not used.

Refer to caption
Figure 2: Domains and primitives under study.

The authors also describe a prototype implementation named PIM-HBM that is fabricated as an extension to HBM2 and is evaluated in silicon. In the prototype, each PIM unit is shared between a pair of banks, demonstrating the flexibility of the design to provision a PIM unit per bank, per pair of banks, or other grou** of banks to tradeoff performance vs area/cost considerations. The PIM ALUs of the prototype support a limited but generic set of ALU operations, which can presumably be extended in subsequent implementations.

GDDR-PIM: The SK Hynix design [33, 35, 28] describes a PIM system based on GDDR6 that is specifically targeted for ML inference applications and is tailored to matrix-vector multiplications and non-linear activation functions. Despite the specific application focus, this design takes a very similar approach to the PIM-HBM in that it places compute units on the periphery of DRAM banks and relies on the GPU for instruction triggering in place of native frontend hardware. The datapath is also in a 256b SIMD configuration matched to the bank input/output data width. In the evaluated prototype implementation, a PIM unit is instantiated for each bank of the GDDR6 memory module.

Commercial PIM Performance Space: Table 1 provides key performance metrics for the above commercial PIM designs (per-device compute and data bandwidth). We also include state-of-the-art GPU (AMD Instinct MI250 Accelerator) performance metrics (per-HBM stack) for comparison. As depicted, PIM data bandwidth is considerably higher than memory bandwidth available to the GPU, while GPU compute capability is considerably higher than that of PIM.

PIM Strawman: For our analyses, we distill a PIM design based on the basic characteristics common to the two memory vendor PIM proposals above but lean closer to HBM-PIM for two reasons. First, HBM-PIM is the more flexible of the two in terms of programmability, and our interest is in further broadening the applicability of PIM. Second, as many modern high-performance GPUs use HBM DRAM, HBM-PIM provides a natural comparison point.

Table 1: Commercial PIMs relative to GPU
Property MI250-GPU HBM-PIM GDDR-PIM
Mem clock (GHz) 1.6 1.2 1.0
FP16 TFLOPS 45 1.2 1
PIM data BW (GB/s) n/a 1229 1024
Mem BW (GB/s) 400 307 64

2.3 Domains and Primitives

The goal of our work is to assess commercial PIM designs across a broader spectrum of domains than hitherto studied. To this end, we depict the domains and our chosen primitives in Figure 2. We assume 16bit data-types in our work along the lines of support in HBM-PIM design [34] (See Section 6 for implications to other data types).

2.3.1 Scientific - Wave Simulation

The solution of partial differential equations (PDEs) is critical to many large-scale problems in HPC systems. One such use case, wave simulation, requires solving the wave equation to model the propagation of waves through different media and is used extensively in domains including medical imaging, earthquake modeling, oil and gas exploration, and antenna and radar modeling.

The Discontinuous Galerkin Method (DGM) is a popular algorithm for wave simulation due to its scalability [49]. Like many PDE solvers, DGM discretizes the wave space into a mesh of elements which are distributed among processors in the system (Figure 2a). It then iteratively executes a volume computation, a flux computation, along with communication and support computations to model wave propagation. The volume computation (termed wavesim-volume primitive) performs computations local to each mesh element; the flux computation (termed wavesim-flux primitive) propagates conditions at the boundaries (faces) of each mesh element. These two computations dominate execution time for most simulation tasks and as such we focus on these in our work.

Refer to caption
Figure 3: Characteristics of interest for PIM-amenability-test.

2.3.2 Machine Learning - Sparse Skinny GEMMs

Machine learning (ML) continues to become ever more pervasive. At the heart of many ML [44] workloads is General Matrix-Matrix multiplication (GEMM). Unlike prior commercial PIM evaluations which focus on skinny GEMMs (the non-shared dimension is small) which are dense, in this work, we focus on GEMMs where one of the matrix inputs is also sparse (has many zeros, as in Figure 2b). We term these sparse skinny gemms (ss-gemm) and they manifest in many ML inference scenarios (e.g., Deep Learning Recommendation Model (DLRM) [43] with small batch sizes).

2.3.3 Graph Analytics - Push-based Computation

Graph analytics attempts to derive insights by analyzing the connectivity of graphs and data associated with its edges and vertices. Graph analytics is regularly used for navigation, chemical and biological modeling, social network monitoring and analysis, and many more applications.

Many common graph analytics workloads operate by iteratively propagating vertex properties (pull or push) across graph edges to neighboring vertices. In pull-based algorithms, a local vertex is processed by reading properties from each of its neighbor vertices and updating the local vertex based on what was read. In push-based algorithms (Figure 2c), a local vertex is processed by reading its properties and updating neighbor vertices based on what was read (using atomic RMWs to avoid race conditions). Push implementations have been found to offer attractive performance properties for many graph algorithms and inputs [12, 45], and they are widely used in GPU graph frameworks [48, 23] and benchmark suites [41, 20]. As such, we focus on a push-based algorithm primitive (termed push-primitive).

3 PIM-amenability

Given the recency of commercial PIM designs, to effectively and broadly use them, a methodology is needed to assess if a given primitive can benefit via PIM acceleration along with guidance on efficiently map** the primitive to PIM. In this section, we present PIM-amenability-test which aims to do both, and we describe how we apply it to the primitives under study.

3.1 PIM-amenability Test

We begin this section by discussing a list of characteristics which help evaluate if a given computation is likely to benefit from PIM. We term the evaluation of a given computation against these characteristics as the PIM-amenability-test (depicted in Figure 3). We also discuss how this evaluation can aid the programmer in deducing efficient data-placement and compute orchestration when map** a primitive to PIM.

Note that our proposed characteristics should be viewed holistically rather than in isolation; that is, manifesting only one of the characteristics does not guarantee PIM acceleration. On the other hand, weakness in a characteristic can in some cases be overcome via optimizations (discussed below). Further, as with any new acceleration solution, while amenability is to be assessed first, efficient computation orchestration is necessary to realize acceleration. We focus in this section on amenability and discuss some of our efficient orchestration learnings in Section 4.1. We only assume in the following discussion that the programmer starts with a state-of-art GPU implementation of the computation.

3.1.1 Memory bandwidth limited

PIM’s primary performance benefit comes from increasing effective memory bandwidth.111Some forms of PIM may offer improved memory latency as well, but this work focuses on the bandwidth benefits. Therefore, it will not improve performance for workloads that do not stress available memory bandwidth. Memory bandwidth sensitivity will depend on both the workload and the target architecture. This property can be tested analytically by calculating the algorithmic op/byte ratio (Figure 3a) and determining whether it falls in the compute-limited or memory-limited range on a roofline model for the target architecture.

PIM-amenable heuristic: Low algorithmic op/byte ratio

3.1.2 Memory residency and low on-chip reuse

Even for workloads that are limited by memory bandwidth, residency and on-chip reuse of computation’s inputs/outputs can preclude PIM amenability. Should a computation’s inputs/outputs be resident in on-chip structures or manifest enough reuse when moved from memory to on-chip structures, the computation is less likely to attain acceleration with PIM. For the former, employing PIM would necessitate flushing data to memory, resulting in added data movement overhead. For the latter, with enough reuse, moving data closer to the processor to take advantage of the fast and low energy caches and compute available at the processor is a better solution than using slower compute available in memory.

We can test for both memory residency and low on-chip reuse by determining the proportion of computations that require accessing physical memory versus those that only access on-chip structures (Figure 3b). Comparing this heuristic to a PIM memory bandwidth multiplier (Section 2.2) can be used to test if an application manifests this PIM-amenable characteristic. Application dataflow can indicate if a computation’s inputs/outputs are likely to be resident in on-chip structures. Computations benefiting from cache-aware optimizations (e.g., tiling, kernel fusion) will also manifest good on-chip reuse.

PIM-amenable heuristic: Ratio of memory accesses and on-chip structure accesses > PIM bandwidth multiplier

3.1.3 Operand Locality

As discussed in Section 2.2, compute units in commercial PIM designs are associated with specific memory bank(s). As such, interacting operands in a computation should map to the same bank to effectively harness PIM acceleration (Figure 3c). We term this property operand locality in our discussion. In the absence of operand locality, costly GPU-orchestrated data transfers between banks will be necessary, which will eat into PIM acceleration.

We propose an operand interaction-centric testing of operand locality. Single operand scenarios (e.g., in-place updates) trivially manifest operand locality. Commutative interactions between multiple elements in a single data structure (e.g., reductions) are also trivial, as interactions between elements in the same bank can simply be performed first. For multi-operand, multi-data structure cases, we observe that localized operand interactions, wherein small subsets of operands within multiple data structures interact with each other, are generally PIM-amenable. A good example of this behavior is element-wise computations (e.g., in vector sum, a single element in each data structure interacts with a single element in another data structure) which can achieve operand locality via careful co-alignment at allocation  [7]. In some cases, localized operand interaction can be induced via data map** (e.g., packing a matrix in matrix-vector multiplication). In such cases, the costs of doing so have to be factored in when assessing PIM impact.

PIM-amenable heuristic: Localized operand interaction

3.1.4 Aligned Data Parallelism

The bandwidth boost attained in PIM is possible via execution of the same operation in parallel across multiple banks. As discussed in Section 2.2, as memory operations have associated row and column addresses, this bank-parallel execution can be employed when operands in different banks in a computation are at the same row/column locations (e.g., see accessing operand a across banks vs accessing operand c in Figure 3d). Note that, within a single DRAM word (256bit single DRAM column), interacting operands therein (e.g., 32bit operands) also have to align (depicted as SIMD alignment in Figure 3d). In absence of this, costly shift operations will be necessary222Shifter implementations can be costly in DRAM technology due to the limited number of metal layers.. We term these properties together as aligned data parallelism. As processors often spread a contiguous physical address chunk across multiple channels/banks, ensuring interacting PIM operands are interleaved similarly at allocation time can help attain this characteristic.

PIM-amenable heuristic: Address-interleaving aware allocations

3.2 PIM-amenability for Primitives

We evaluate PIM-amenability of the primitives under study (Section 2.3) using our above test. We also discuss how our test gave good guidance on map** primitives to PIM. In addition to primitives under study, we also study a vector-sum primitive, which has been mapped to commercial PIM by prior works. We do so to both evaluate our test against a known PIM-amenable computation and to provide a comparison point for our studied primitives. Finally, our analysis below considers these computations in isolation. That is, dataflow considerations have to be factored-in when other (possibly non-PIM) primitives are involved.

Vector Sum: We depict the vector-sum primitive in Figure 3a. This primitive manifests low op/byte (0.17), no data reuse, and localized operand interaction (a single element in each structure interacts with a single element in other structures). By co-aligning input structures at memory allocation, it can attain aligned data parallelism. As such, this primitive is highly PIM-amenable.

Refer to caption
Figure 4: (a) Methodology to offload primitives to PIM. Applying proposed methodology to primitives under study (b, c, d).

Wave Simulation: With low op/byte (0.43-1.72), wave simulation primitives under study (wavesim-volume and wavesim-flux) are likely to be memory bandwidth-limited for most architectures. Further, for problem sizes that do not fit in cache, these primitives manifest low on-chip reuse. While largely relying on localized interaction, these primitives do manifest interactions between neighboring mesh element faces (flux in Figure 2a). As such, careful memory allocation is necessary to maximize map** of interacting neighbors to the same DRAM bank and further, not all computation can be mapped to PIM (interactions between neighboring elements in different banks). Finally, these primitives operate on large regular grids of elements which can be harnessed to achieve aligned data parallelism. Overall, wave simulation primitives show promise in terms of PIM amenability, albeit care is necessary to attain operand locality.

Sparse Skinny GEMMs: With one of the input matrices being skinny (small N𝑁Nitalic_N for GEMM M×N×K𝑀𝑁𝐾M\times N\times Kitalic_M × italic_N × italic_K), the ss-gemm primitive manifests low op/byte (0.5-2 with N4𝑁4N\leq 4italic_N ≤ 4) and low reuse of data and can benefit from PIM for sufficiently large problem sizes. By streaming the skinny matrix to PIM units and kee** the other matrix stationary in memory, operand locality can be simplified. Further, unless the row-size of the matrix resident in memory is considerably large, aligned data parallelism requires considerable care for ss-gemm and we discuss how this guided our data-map** (Section 4).

Push-based Computation: The dominant computation for push-primitive is the access and update of neighboring nodes (Figure 2c) which manifests low op/byte (0.25). Sufficiently large problem sizes can stress on-chip structures, albeit on-chip reuse is dependent on connectivity of the graph, leading to selective offloads to PIM. With memory-resident graph (neighbor nodes) and streaming source node data, operand locality is trivial. However, irregularity of accesses to neighbor nodes precludes most aligned data parallelism. As we will discuss, this limits PIM potential for push-primitive (Section 4).

4 Baseline PIM

We discuss in this section a general methodology, guided by our PIM-amenability-test, to offload a given primitive to PIM. Next, we use it to discuss how we offloaded primitives under study to PIM. We then discuss PIM performance and also detail sources of inefficiency.

4.1 PIM Orchestration Considerations

We start with a discussion of some general considerations in offloading computations to PIM. In the context of our system, a computation is offloaded to PIM via launching a pim-kernel. This is analogous to a GPU kernel, except a pim-kernel issues pim-instructions. These instructions cause pim-commands to be enqueued at the memory controller. Said pim-commands trigger either computations on operands in DRAM, data movement between DRAM and registers, etc. In order to preserve register dependencies, pim-commands are issued in FIFO order from the memory controller queue. Note that, while we focus on the orchestration of standalone primitives, additional considerations will be necessary (e.g., managing caches) when other computations (non-PIM) are involved that access the same data as PIM primitives (see Section 6).

4.2 Offloading Primitives to PIM

We first discuss our general PIM offload methodology followed by applying it to primitives under study. We also include vector-sum discussion for comparative purposes.

4.2.1 PIM Offload Methodology

While assessing the offloading of a primitive to PIM, as discussed in Section 3, a programmer can first assesses the primitive using our proposed PIM-amenability-test. Assuming PIM amenability exists, we discuss here a methodology which can serve as a general template to offload a primitive to PIM. We depict it in Figure 4a.

As discussed in Section 3.1, ensuring operand locality and aligned data parallelism is critical to attaining PIM acceleration. To that end, identifying operands or data structures ( 1) and, more crucially, identifying interactions between them ( 2) are the first steps in offloading to PIM. Subsequently, operands are placed in DRAM banks ( 3) such that costly inter-bank communication, cross SIMD operations, and (where possible) inter-row interactions are avoided. Finally, ( 4), a stream of pim-instructions is deduced to orchestrate the computation over PIM units.

In many cases, compute orchestration follows naturally once data-placement is deduced. However, we also observe that allocation/management of pim-registers to minimize row activation overheads (i.e., staging data from open rows into pim-registers) is often necessary. Note that GPU implementations similarly optimize for effective use of registers.

4.2.2 Vector Sum

Data placement: As vector-sum manifests elementwise interaction between two input and one output array, allocating structures such that elements at a given offset are mapped to the same bank [6] ensures PIM-amenable data placement. Note that this is a common operand interaction scenario for PIM-amenable computations.

Command orchestration: Command orchestration for vector-sum is done via broadcast pim-commands, which sequentially read input values, perform adds, and write output values. Further, effective use of pim-registers to stage data from DRAM to minimize row activation overheads is needed.

4.2.3 Wave Simulation

Data placement: Wave simulation largely operates over arrays and employs three types of operand interaction: elementwise, reduction, and neighboring mesh element. Of these, while elementwise interaction can be tackled via data placement as was done for vector-sum, reduction and neighbor interaction warrant more care. For the former, blocked data placement is employed (discussed below in the context of ss-gemm primitive which also employs this). For the latter, array (grid) elements are placed such that, to the extent possible, neighboring faces reside in the same bank as depicted in Figure 4b.

Command orchestration: Despite their PIM amenable properties, wavesim primitives exhibit complex interaction patterns between operands which complicate orchestration. Considerable care is necessary to effectively utilize available registers while avoiding memory spills and lowering row activation overheads. While we hand schedule the computation in our analysis, existing compiler methods for register allocation [18, 16] can be adapted for PIM-specific cost awareness.

Refer to caption
Figure 5: Workload ss-gemm data map**.

4.2.4 Sparse Skinny GEMMs

Data placement: To harness broadcast pim-commands and concomitant performance, we place the input dense matrix in a blocked format as depicted in Figure 5, which is tailored to both the dimensions of the matrix and the address interleaving of the system. This layout has multiple attractive properties: skinny matrix values can be broadcast as an immediate PIM operand on the data bus, and accumulation of partial products avoids inter-bank, intra-SIMD, and (to the extent possible) inter-row operations.

Command orchestration: Compute orchestration follows from data placement: an element of the skinny matrix is broadcast to banks and partial results are accumulated in pim-registers before being written to memory.

4.2.5 Push-based Computation

Data placement: Variation in graph connectivity precludes the use of broadcast commands and co-location of interacting neighbors (source and destination nodes in Figure 4d). Instead, single-bank pim-commands execute in-place destination updates, avoiding operand locality layout constraints.

Command orchestration: Compute orchestration for push-primitive also follows from its data placement. The source node’s value is read from memory, then updates to neighboring nodes are calculated and applied using single-bank pim-commands (namely, a pim-ADD command loads the current value and adds an operand supplied on the data bus, placing the result in a pim-register, and a pim-store command stores this result to memory).

4.3 Performance Analysis

Table 2: Parameters for performance model [2].
#Banks per Channel/4-high Stack 16 / 512
Bandwidth per Pin 4.8 Gb/s
GPU Mem. Bandwidth per Stack 614.4 GB/s
Row Buffer Size 1024 B
DRAM Parameters tRP=15ns, tCCDL=3.33ns, tRAS=33ns
PIM Parameters
#PIM Units per Stack = 256
#PIM Registers per ALU = 16
Peak HBM Bandwidth 614.4 GB/s

We first discuss our performance model assumptions and follow that with analyzing the attained PIM acceleration for primitives under study. Finally, we conclude with discussing unique challenges that arise for computation offloads to PIM and also an opportunity which, if addressed and harnessed respectively, can unlock further PIM acceleration.

4.3.1 Performance Models

In our work, we use analytical models to evaluate performance. Our choice is guided by the following reasons. First, commercial PIM designs are still only available as functional prototypes. Second, we aim to study primitives considering realistic problem sizes, where PIM is likely to be beneficial. This renders GPU simulators difficult to use due to long simulation times. Finally, we assume a strong GPU baseline, even considering future software optimizations (see below).

GPU Performance Model: For our GPU baseline, we assume the execution time is primarily a function of memory bandwidth (assumed to be 90% of peak) and data accessed. Further, we assume perfect reuse with two exceptions: inter-timestep reuse is not modeled for wavesim (we assume polynomial degree p=2𝑝2p=2italic_p = 2, 729729729729 data points per element, and 65K65𝐾65K65 italic_K elements per GPU, which is too large to fit in cache), and cache locality for push-primitive is based on actual cache hit rates measured using rocprof [3] with push-based workloads from graphBIG [41] (specifically, hit rates of 44%, 20%, and 57% are observed for roadnet-usa [20], a synthetic power-law graph with 1M nodes and 10M edges, and a synthetic power-law graph with 10M nodes and 100M edges, respectively).

We believe this to be a fair assumption for memory-limited workloads which manifest low op/byte ratios as discussed in Section 3.2. Further, we assume HBM3 memory [2] in our analysis (Table 2). We do so to be both forward-looking and avail our baseline GPU with the best possible memory bandwidth. As such, this provisions a compelling baseline against which to compare PIM acceleration benefits.

Further, for ss-gemm, we assume an optimized GPU baseline which can exploit row-sparsity to both avoid loading the zero rows and computing on them. We estimate this sparsity by analyzing row-sparsity occurrence for computations in MLPerf DLRM-based recommendation model [43] using the Terabyte Click Logs testing dataset [19].

PIM Performance Model: For PIM, we first deduce detailed pim-commands (Section 4.2) and subsequently take into account DRAM timings (Table 2) and operations (row activation, etc.) to determine PIM execution time. Multi-bank pim-commands are issued in-order at half the rate333As dictated by the tCCDLsubscript𝑡𝐶𝐶𝐷𝐿t_{CCDL}italic_t start_POSTSUBSCRIPT italic_C italic_C italic_D italic_L end_POSTSUBSCRIPT timing parameter for back-to-back requests to the same bank group, as opposed to the minimum possible time between reads/writes: tCCDSsubscript𝑡𝐶𝐶𝐷𝑆t_{CCDS}italic_t start_POSTSUBSCRIPT italic_C italic_C italic_D italic_S end_POSTSUBSCRIPT. of regular reads/writes as is the case with the HBM-PIM design [34]. Single-bank pim-commands can be freely reordered and can be issued at the same rate as regular reads/writes. Further, push-primitive updates are also assumed to occur atomically, which can be guaranteed by existing per-address ordering assumptions in the memory controller.

Refer to caption
Figure 6: Commercial PIM speedup relative GPU. For ss-gemm, N represents skinny matrix width. For push-primitive, L2-HR represents the hit rate measured at L2 cache for each graph evaluated.

4.3.2 PIM Speedup Analysis

We depict PIM speedups for studied primitives in Figure 6. For our PIM strawman design, the upper bound for performance is about 4x assuming the baseline GPU can utilize 100% of peak memory bandwidth (optimistic). As depicted, while vector-sum attains over 2.6x speedup, primitives under study in this work fare far worse, delivering between 0.23x-1.66x speedups vis-a-vis GPU.

For the ss-gemm primitive, except very skinny matrices (N=2𝑁2N=2italic_N = 2), PIM incurs increasing slowdown (between 25-77%) as N𝑁Nitalic_N increases compared to the GPU. This is expected because, as data reuse improves, moving data to the GPU and exploiting the reuse on chip is more beneficial. Similar trends are observed for push-primitive as cache hit rate improves. Overall, we observe that, even with considerable care to place data appropriately and orchestrate computation efficiently, commercial PIM designs do not attain broad acceleration.

4.3.3 Challenges/Opportunity for PIM Acceleration

We discuss in this section further analysis of our PIM speedups, focusing on unique challenges when computations are offloaded to PIM and also a unique opportunity. We also discuss how the identified challenges are more a function of commercial PIM designs and not specific to primitives under study. That is, addressing these challenges will likely enable broad acceleration with commercial PIM.

Challenge - Row-activations: A key impediment to PIM acceleration for all primitives (including vector-sum) is row activation overheads - that is, latency costs to open DRAM rows. In fact, this accounts for 27% and 50% of total latency for wavesim-volume and wavesim-flux primitives. In the baseline GPU, bank-level parallelism, wherein row activation latency in one bank is overlapped with data access from another bank, can be employed. However, this is not possible for multi-bank pim-commands. Further, while pim-registers can be used to stage data from rows to lower row activations, this is difficult to realize for primitives with complex operand interactions and high intermediate results which also consume registers (e.g., wavesim primitives).

Challenge - Cache reuse: Primitives whose access patterns exhibit high cache locality are a poor fit for PIM, since on-chip reuse exploited by the GPU can outweigh the bandwidth benefit of PIM. In some cases, statically determining reuse potential and/or which accesses will exhibit cache reuse is difficult because locality and reuse potential is dependent on the input data (e.g., for ss-gemm and push-primitive). With a binary PIM-offload policy (all or nothing), while the GPU can harness the benefits of locality, baseline PIM does not, leading to poor performance.

Challenge - Registers/command bandwidth: We also observe in this work that PIM architecture decisions, specifically the number of registers and command bandwidth available to single-bank pim-commands, can limit PIM acceleration. While we discussed register space implications above, command bandwidth is a unique bottleneck for push-primitive which relies on single-bank PIM commands. While the regular GPU access rate is limited by data bandwidth, and multi-bank PIM command rate is limited by ALU resources, single-bank PIM command rate, in the case of push-primitive, is exclusively limited by command bandwidth especially since this primitive relies on commands which do not send operands on the data bus (pim-store, Section 4.2.5).

Opportunity - Sparsity: For our ss-gemm GPU implementation, we assume a GPU baseline that can exploit sparsity. If, similarly, PIM can exploit sparsity, perhaps even finer-grain sparsity than is possible at the GPU, PIM acceleration can be improved further. To this end, we observe that commercial PIM designs orchestrate computation using pim-commands issued subject to fixed timing constraints (similar to how traditional memory operations are issued). These commands are issued at fine granularity (a pim-command accesses at most one 32B DRAM word per bank). This, as we show below, can be used to exploit fine-grain data sparsity with PIM.

5 Optimized PIM

Refer to caption
Figure 7: Optimizations for improving commercial PIM performance.

We discuss in this section some optimizations which can help improve acceleration availed by commercial PIM across varying domains and subsequently analyze the impact of these optimizations on PIM acceleration.

5.1 Targeted PIM Optimizations

We discuss here both hardware augmentations and software optimizations that can help address challenges and harness opportunity discussed in Section 4.3.3.

5.1.1 PIM Architecture-aware

We first tackle row activation overhead evident in PIM executions. To this end, we design a PIM architecture-aware optimization, which we term architecture-aware row activation. In the baseline PIM design, row activation for all banks is incurred on the critical path and is followed by compute commands, each of which are issued to a subset (even or odd) of the available banks (see Baseline schedule in Figure 7a), as the PIM unit/ALU is shared by two DRAM banks. Note that kee** PIM compute commands in program order is necessary to honor register dependencies.

Instead of this schedule, we propose to first split all-bank row activations into separate even and odd bank activations. This allows for a decoupled parallel schedule depicted as PIM arch-aware schedule in Figure 7a. This schedule hides activation latency behind useful work by eagerly activating the next row in one subset of banks while compute commands are being performed in the opposite subset. It also does not impact the order of compute commands or serialized activation latencies and dependencies within odd and even banks, which ensures functionally correct execution. Memory controllers can be augmented to generate/issue such architecture-aware row activations. Further, a compiler pass can also aid in the generation of architecture-aware row activations.

5.1.2 Sparsity-aware

A unique opportunity that commercial PIM designs avail (Section 4.3.3) is thanks to their reliance on pim-commands issued at fine-granularity subject to fixed DRAM timing constraints (similar to how traditional memory operations are issued). This design decision can be harnessed at the software level to perform additional checks at the processor to opportunistically skip issuing commands when certain conditions (e.g., sparsity) are met. Figure 7b illustrates how this works for ss-gemm; before issuing a pim-command to multiply a value of the sparse skinny matrix with a column in the dense matrix, the sparse matrix value is inspected; if this value is zero, then the command is skipped, improving performance.

Our above sparsity-aware PIM optimization has several notable benefits. First, this functionality enables PIM to harness dynamic sparsity (execution time sparsity) which is typically hard to exploit in GPUs [54]. Further, our proposed functionality allows harnessing of sparsity without reliance on specialized sparsity formats which incur data-transformation and metadata overheads. Finally, sparsity-aware PIM allows exploiting of finer-grain sparsity than possible for GPU. The PIM design can easily exploit element-level sparsity whereas GPU harnesses row-level sparsity: skip loading rows with all zeros for the skinny matrix (note, dynamic element-level sparsity at the GPU will require creating specialized sparsity-aware data formats at runtime).

5.1.3 Cache-aware

A binary PIM offload decision hurts PIM performance (push-primitive) by obviating input-dependent caching benefits for PIM executions (Section 4.3.3). Instead, we believe that selective offloads to PIM that consider cache reuse is a more superior strategy. To this end, we make a case for cache-aware PIM optimization, where existing fine-grain pim-instructions are offloaded to PIM under the guidance of a locality predictor. Note that such techniques have been studied by prior works (e.g., dynamic schemes [9], offline schemes [5]) and can be augmented to work with commercial PIM designs.

5.1.4 Limit Studies

While so far we have discussed targeted optimizations which can broaden commercial PIM acceleration reach, we also believe that PIM architecture decisions, specifically the command bandwidth available to single-bank pim-commands and the number of PIM registers, are important determinants of PIM acceleration. To that end, we also study in this work the effects of varying these on PIM performance. We believe that careful attention to these decisions will be necessary to avail broad PIM acceleration.

5.2 Performance Analysis

We evaluate the implications of optimizations and techniques we discussed above on PIM performance below. Optimizations are employed largely in a targeted manner, focused towards the primary bottlenecks of each primitive. Since wavesim primitives exhibit register pressure that exacerbates activation overhead, we study architecture-aware row activation and increased register resources for these primitives. Since ss-gemm exhibits dynamic sparsity, we study sparsity-aware PIM for this primitive. Since push-primitive exhibits input-dependent cache locality and is limited by command bandwidth, we study cache-aware PIM and increased command bandwidth resources for this primitive. While these bottlenecks are specific to primitives under study, we do believe that these are fundamental challenges which will be experienced more widely as PIM is harnessed more widely. Also note that these optimizations are complementary and can be employed in tandem in future inclusive PIM designs.

5.2.1 Wave Simulation

Refer to caption
Figure 8: Optimized PIM speedup for wavesim primitives.

Wavesim acceleration improvements with architecture-aware row activation and increased register resources444Baseline PIM in line with commercial PIM designs assumes sixteen registers. are depicted in Figure 8. For the wavesim-volume primitive, architecture-aware optimization improves PIM speedup from 1.5x to 2.04x. Further, this optimization entirely eliminates row activation overheads for this primitive such that more registers do not improve performance. This is in contrast to the wavesim-flux primitive which exhibits higher register pressure and row activation overheads. At lower register counts (16), architecture-aware activation does not improve performance because there are not enough commands per row activation to hide parallel activation latency or to amortize serial activation latency. However, more resources reduce register pressure, enabling this optimization to better hide activation latency and achieve up to 2.63×2.63\times2.63 × speedup over the GPU baseline.

5.2.2 Sparse Skinny GEMMs

Refer to caption
Figure 9: Optimized PIM speedup for ss-gemm.

For ss-gemm, we focus on implications of our sparsity-aware PIM optimization (depicted in Figure 9). We observe here that sparsity-aware PIM significantly improves the PIM speedup (more than 3×3\times3 ×) with expected tapering in benefits with increased reuse at the GPU (increasing N𝑁Nitalic_N). Further, it also allows PIM to manifest acceleration in scenarios where baseline PIM manifested a slowdown (speedup of 1.07×1.07\times1.07 × for N=8𝑁8N=8italic_N = 8, while baseline PIM suffers from 57% slowdown).

5.2.3 Push-based Computation

Refer to caption
Figure 10: Optimized PIM speedup for push-primitive.

As PIM acceleration for push-primitive is most hindered from cache unawareness and lower command bandwidth for single-bank pim-commands, we preferentially study the effects of both of these for this primitive and depict the resulting acceleration in Figure 10. For this study, we model the effects of a locality-based predictor, using a cache model (16-way, 4MB, LRU replacement) which classifies updates to graph nodes in push-primitive as either likely manifesting reuse (performed in cache) or not (performed in PIM). As before, we evaluate these workloads on three graph inputs with varying degrees of locality, each of which is labeled with its observed L2 cache hit rate. Overall, cache-aware PIM prevents performance degradation related to the cache reuse observed in the baseline PIM, leading to an average speedup of 1.20×1.20\times1.20 × (max 1.39×1.39\times1.39 ×).

Further, we also model an optimized GPU baseline wherein the GPU can also leverage the locality predictor to reduce access granularity (i.e. use 32B rather than 64B accesses) for updates that do not benefit from caching. We term this cache-aware GPU, and it achieves up to 1.68×1.68\times1.68 × speedup relative to the baseline GPU. Although cache-aware PIM reduces data transferred across the memory interface relative to cache-aware GPU, it does not reduce the command bandwidth demand, and the strictly modeled DRAM latency requirements actually lead to worse performance for cache-aware PIM.

Additional command bandwidth only benefits single-bank PIM commands that do not carry data (i.e., push-primitive pim-store commands). With 4×4\times4 × as much command bandwidth,555A 4x multiplier roughly corresponds to how much additional command bandwidth is possible if underutilized data bandwidth could carry command information. PIM performance improves further to exceed cache-aware GPU performance for all inputs and provide up to 2.02×2.02\times2.02 × speedup relative to the baseline GPU.

Overall, while the introduction of commercial PIM designs reveals new bottlenecks for many primitives, there clearly is opportunity to address these bottlenecks to support a broader range of primitives.

6 Discussion

This section details additional PIM design considerations that we don’t include in our evaluation, but which are nevertheless important for efficient PIM execution.

Secondary PIM benefits: While we focus on performance benefits, PIM execution offers secondary benefits such as power efficiency and compute utilization. Power benefit arises from the reduction in data movement enabled by PIM. All studied primitives reduce how much data is transferred across the memory interface (by up to 71%percent7171\%71 % for wavesim-volume, 98%percent9898\%98 % for wavesim-flux, 87%percent8787\%87 % for ss-gemm, and 46%percent4646\%46 % for push-primitive), and all but push-primitive reduce how many memory requests are sent from the GPU to memory. In addition, PIM kernels do not use GPU compute resources and use relatively low memory bandwidth to issue PIM commands. PIM therefore also enables co-scheduling of complementary (e.g., compute-intensive) kernels on the GPU.

Data precision: Since current commercial PIM designs are geared towards error-tolerant ML workloads, they do not support higher precision operations than FP16. This can pose challenges for scientific and graph analytics domains, which often assume higher precision data types. Studies like this work can potentially motivate commercial PIM support for higher precision arithmetic, which is one path to resolving this disconnect. However, there have also been multiple recent efforts to enable the use of low-precision arithmetic for HPC and graph workloads [4, 26, 22, 11]. Although these efforts generally aim to reduce the memory footprint or leverage the low-precision matrix engines that exist in ML-optimized GPUs, many of the associated insights can apply to (and are further motivated by) commercial PIM designs as well.

Application-level considerations: Although our PIM-amenability-test (Section 3.1) is designed to be generally applicable, our evaluation of individual PIM primitives does not consider overheads that arise when PIM and non-PIM kernels interact. Commercial PIM systems assume that the target data is present in memory, which can require cache flushes or the use of uncached memory buffers to ensure PIM consistency when communicating between PIM and non-PIM kernels. The associated overheads of such requirements (flush traffic and latency, copies to uncached buffers, and/or reuse prevented by these actions) should be incorporated in the cache reuse and memory residency test when evaluating a primitive in the context of a larger application.

Similarly, kernel fusion is a common optimization wherein multiple computations are fused to avoid round trips to memory. Note that such fusion can be employed for PIM offloads too. If offloading to PIM prevents kernel fusion for GPU, this cost should be factored in assessing PIM-amenability.

Address hashing: Many systems use address hashing to improve memory efficiency by effectively shuffling how addresses are mapped to physical memory units. However, this can conflict with PIM’s need for control over data placement to ensure operand locality and, depending on the hashing function, may even preclude PIM execution. Possible solutions include disabling hashing when PIM is enabled, exposing hashing to software (e.g., page coloring to identify the same hashing group), or limiting address hashing to bits that are irrelevant to PIM map** (e.g., only hash bank bits which are shared by one PIM unit or row bits only).

Memory reliability, security: Error detection and correction (EDC) are important features that help ensure the integrity of DRAM access, but incur overhead to verify DRAM loads and set the proper metadata bits for DRAM stores. Similarly, memory encryption/decryption may be needed for privacy-sensitive systems, requiring similar encryption/decryption on the path to and from memory. If EDC or encryption are to be supported with PIM, these functions must be replicated near-bank, requiring additional area overhead. Though they can be pipelined to avoid impacting bandwidth, they also add latency that can appear on the critical path of PIM operations that access memory. However, in many cases this latency can be hidden by interleaving PIM operations to independent parallel data, in the same way that registers are used to exploit row locality.

7 Related Work

Many recent efforts study compute orchestration on different PIM architectures. In this section we summarize some of the most relevant past studies not previously discussed.

In contrast to the near-bank coupling of HBM-PIM and GDDR-PIM, many recent PIM architectures are based on compute units implemented on a "base" logic die 3D-stacked under a set of DRAM dies [14, 15, 17, 51, 42, 10, 13]. However, as HBM already provisions high bandwidth to an external processor via 2.5D-stacking, these approaches do not provide sufficient bandwidth amplification in the context of HBM.

Some prior approaches have proposed application-specific capabilities on DDR DIMMs  [31, 29]. However, our focus is on extending in-memory capabilities to a broader set of workloads. Some prior efforts on broader PIM acceleration incorporate full-blown compute cores in DDR DRAM [21]. However, this approach has a higher area cost in DRAM compared to the approaches we consider here due to the presence of instruction fetch and sequencing hardware.

Despite the above-mentioned differences, several of the prior PIM approaches can also be considered complementary to this work as they do not impact DRAM bank architecture and can be deployed in tandem with HBM-PIM-like techniques.

Some PIM architectures integrate compute and memory more tightly than HBM-PIM and GDDR-PIM. These architectures attain extreme parallelism by executing bulk bitwise functions directly on bitline outputs [36, 46] or by leveraging the physical properties of non-volatile memory (NVM) to perform analog operations [38, 37, 50]. However, these systems are more limited in the types of primitives they can accelerate. While ReRAM efficiently implements dot products for weight-stationary inference and bitwise operations can be chained to implement general arithmetic, the accuracy, precision, and programmability of these systems are limited, precluding their use for many compute domains. When such architectures are leveraged for irregular workloads (e.g., GraphR [47]) the focus is energy and area efficiency rather than improved data bandwidth.

Past work has also studied the role of various forms of PIM for accelerating wave simulation [27], graph analytics [8, 52, 55, 40], sparse ML [24], and many other primitives [25]. However, each of these targets a PIM architecture that differs in significant ways (they use domain-specific or loosely-coupled architectures) from the commercial PIM designs studied in this work.

8 Conclusion

To the best of our knowledge, this is the first work to evaluate emerging commercial PIM designs across primitives from a broad set of domains. To this end, we first deduce a PIM-amenability-test, which can be used to assess PIM’s potential for accelerating primitives, and also to guide efficient orchestration and data-placement. Using this test, we observe that commercial PIM designs, which today are rightly geared toward (a narrow set of) ML primitives, do not accelerate the primitives under study even though these primitives exhibit PIM-amenable properties. To address this, we identify bottlenecks unique to these PIM designs along with hardware/software optimizations that overcome these bottlenecks, improving average PIM speedups from 1.12x to 2.49x relative to a GPU baseline. Overall, our work shows that, while emerging commercial PIM designs hold promise, for broad acceleration, a more inclusive PIM design will be necessary.

References

  • [1] “Jedec high bandwidth memory (hbm) dram,” https://www.jedec.org/standards-documents/docs/jesd235a, 2013.
  • [2] “Jedec publishes hbm3 update to high bandwidth memory (hbm) standard,” https://www.jedec.org/news/pressreleases/jedec-publishes-hbm3-update-high-bandwidth-memory-hbm-standard, 2022.
  • [3] “rocprofiler developer tool,” https://github.com/ROCm-Developer-Tools/rocprofiler, 2022.
  • [4] A. Abdelfattah, H. Anzt, E. G. Boman, E. Carson, T. Cojean, J. Dongarra, M. Gates, T. Grützmacher, N. J. Higham, S. Li et al., “A survey of numerical methods utilizing mixed precision arithmetic,” arXiv preprint arXiv:2007.06674, 2020.
  • [5] A. Addisie, H. Kassa, O. Matthews, and V. Bertacco, “Heterogeneous memory subsystem for natural graph analytics,” in 2018 IEEE International Symposium on Workload Characterization (IISWC).   IEEE, 2018, pp. 134–145.
  • [6] S. Aga, N. Jayasena, and M. Ignatowski, “Co-ml: a case for collaborative ml acceleration using near-data processing,” in Proceedings of the International Symposium on Memory Systems, 2019, pp. 506–517.
  • [7] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 481–492.
  • [8] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015, pp. 105–117.
  • [9] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), 2015, pp. 336–348.
  • [10] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,” ACM SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 336–348, 2015.
  • [11] H. Anzt, G. Flegar, T. Grützmacher, and E. S. Quintana-Ortí, “Toward a modular precision ecosystem for high-performance computing,” The International Journal of High Performance Computing Applications, vol. 33, no. 6, pp. 1069–1078, 2019.
  • [12] M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler, “To push or to pull: On reducing communication and synchronization in graph computations,” in Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, 2017, pp. 93–104.
  • [13] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan et al., “Google workloads for consumer devices: Mitigating data movement bottlenecks,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018, pp. 316–331.
  • [14] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, R. Ausavarungnirun, K. Hsieh, N. Ha**azar, K. T. Malladi, H. Zheng et al., “Conda: Efficient cache coherence support for near-data accelerators,” in Proceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 629–642.
  • [15] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh, K. T. Malladi, H. Zheng, and O. Mutlu, “Lazypim: An efficient cache coherence mechanism for processing-in-memory,” IEEE Computer Architecture Letters, vol. 16, no. 1, pp. 46–50, 2016.
  • [16] P. Briggs, K. D. Cooper, and L. Torczon, “Improvements to graph coloring register allocation,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 16, no. 3, pp. 428–455, 1994.
  • [17] J. Choe, A. Huang, T. Moreshet, M. Herlihy, and R. I. Bahar, “Concurrent data structures with near-data-processing: An architecture-aware implementation,” in The 31st ACM Symposium on Parallelism in Algorithms and Architectures, 2019, pp. 297–308.
  • [18] F. C. Chow and J. L. Hennessy, “The priority-based coloring approach to register allocation,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 12, no. 4, pp. 501–536, 1990.
  • [19] Criteo, “Criteo terabyte click logs dataset,” https://ailab.criteo.com/criteo-1tb-click-logs-dataset/.
  • [20] T. A. Davis et al., “Suitesparse: A suite of sparse matrix software,” 2015. [Online]. Available: https://people.engr.tamu.edu/davis/suitesparse.html
  • [21] F. Devaux, “The true processing in memory accelerator,” in 2019 IEEE Hot Chips 31 Symposium (HCS).   IEEE Computer Society, 2019, pp. 1–24.
  • [22] J. Domke, E. Vatai, A. Drozd, P. ChenT, Y. Oyama, L. Zhang, S. Salaria, D. Mukunoki, A. Podobas, M. WahibT et al., “Matrix engines for high performance computing: A paragon of performance or gras** at straws?” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).   IEEE, 2021, pp. 1056–1065.
  • [23] Z. Fu, M. Personick, and B. Thompson, “Mapgraph: A high level api for fast development of high performance graph analytics on gpus,” in Proceedings of workshop on GRAph data management experiences and systems, 2014, pp. 1–6.
  • [24] C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “Sparsep: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 6, no. 1, pp. 1–49, 2022.
  • [25] J. Gómez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking memory-centric computing systems: Analysis of real processing-in-memory hardware,” in 2021 12th International Green and Sustainable Computing Conference (IGSC).   IEEE, 2021, pp. 1–7.
  • [26] A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers,” in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2018, pp. 603–613.
  • [27] B. Hanindhito, R. Li, D. Gourounas, A. Fathi, K. Govil, D. Trenev, A. Gerstlauer, and L. John, “Wave-pim: Accelerating wave simulation using processing-in-memory,” in 50th International Conference on Parallel Processing, 2021, pp. 1–11.
  • [28] M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. Vijaykumar, “Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2020, pp. 372–385.
  • [29] L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H.-H. S. Lee et al., “Recnmp: Accelerating personalized recommendation with near-memory processing,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).   IEEE, 2020, pp. 790–803.
  • [30] J. H. Kim, S.-H. Kang, S. Lee, H. Kim, Y. Ro, S. Lee, D. Wang, J. Choi, J. So, Y. Cho et al., “Aquabolt-xl hbm2-pim, lpddr5-pim with in-memory processing, and axdimm with acceleration buffer,” IEEE Micro, vol. 42, no. 3, pp. 20–30, 2022.
  • [31] Y. Kwon, Y. Lee, and M. Rhu, “Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 740–753.
  • [32] C.-C. Lee, C. Hung, C. Cheung, P.-F. Yang, C.-L. Kao, D.-L. Chen, M.-K. Shih, C.-L. C. Chien, Y.-H. Hsiao, L.-C. Chen et al., “An overview of the development of a gpu with integrated hbm on silicon interposer,” in 2016 IEEE 66th Electronic Components and Technology Conference (ECTC).   IEEE, 2016, pp. 1439–1444.
  • [33] S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, K. Vladimir, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y. Jun, K. Cho, I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho, “A 1ynm 1.25v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,” in 2022 IEEE International Solid- State Circuits Conference (ISSCC), vol. 65, 2022, pp. 1–3.
  • [34] S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware architecture and software stack for pim based on commercial dram technology : Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 43–56.
  • [35] W. J. Lee, C. H. Kim, Y. Paik, J. Park, I. Park, and S. W. Kim, “Design of processing-“inside”-memory optimized for dram behaviors,” IEEE Access, vol. 7, pp. 82 633–82 648, 2019.
  • [36] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Drisa: A dram-based reconfigurable in-situ accelerator,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 288–301.
  • [37] Y. Li, T. Bai, X. Xu, Y. Zhang, B. Wu, H. Cai, B. Pan, and W. Zhao, “A survey of mram-centric computing: From near memory to in memory,” IEEE Transactions on Emerging Topics in Computing, 2022.
  • [38] S. Mittal, “A survey of reram-based architectures for processing-in-memory and neural networks,” Machine learning and knowledge extraction, vol. 1, no. 1, pp. 75–114, 2018.
  • [39] D. G. Murray, J. Šimša, A. Klimovic, and I. Indyk, “Tf.data: A machine learning data processing framework,” Proc. VLDB Endow., vol. 14, no. 12, p. 2945–2958, jul 2021. [Online]. Available: https://doi.org/10.14778/3476311.3476374
  • [40] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim, “Graphpim: Enabling instruction-level pim offloading in graph computing frameworks,” in 2017 IEEE International symposium on high performance computer architecture (HPCA).   IEEE, 2017, pp. 457–468.
  • [41] L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C.-Y. Lin, “Graphbig: understanding graph computing in the context of industrial solutions,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2015, pp. 1–12.
  • [42] R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. Costa, J. Doi, C. Evangelinos et al., “Active memory cube: A processing-in-memory architecture for exascale systems,” IBM Journal of Research and Development, vol. 59, no. 2/3, pp. 17–1, 2015.
  • [43] M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini et al., “Deep learning recommendation model for personalization and recommendation systems,” arXiv preprint arXiv:1906.00091, 2019.
  • [44] S. Pati, S. Aga, N. Jayasena, and M. D. Sinclair, “Demystifying bert: System design implications,” in 2022 IEEE International Symposium on Workload Characterization (IISWC), 2022, pp. 296–309.
  • [45] G. Salvador, W. H. Darvin, M. Huzaifa, J. Alsop, M. D. Sinclair, and S. V. Adve, “Specializing coherence, consistency, and push/pull for gpu graph analytics,” in 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).   IEEE, 2020, pp. 123–125.
  • [46] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory accelerator for bulk bitwise operations using commodity dram technology,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 273–287.
  • [47] L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Graphr: Accelerating graph processing using reram,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).   IEEE, 2018, pp. 531–543.
  • [48] Y. Wang, Y. Pan, A. Davidson, Y. Wu, C. Yang, L. Wang, M. Osama, C. Yuan, W. Liu, A. T. Riffel et al., “Gunrock: Gpu graph analytics,” ACM Transactions on Parallel Computing (TOPC), vol. 4, no. 1, pp. 1–49, 2017.
  • [49] L. C. Wilcox, G. Stadler, C. Burstedde, and O. Ghattas, “A high-order discontinuous galerkin method for wave propagation through coupled elastic–acoustic media,” Journal of Computational Physics, vol. 229, no. 24, pp. 9373–9396, 2010.
  • [50] Y. Xi, B. Gao, J. Tang, A. Chen, M.-F. Chang, X. S. Hu, J. Van Der Spiegel, H. Qian, and H. Wu, “In-memory learning with analog resistive switching memory: A review and perspective,” Proceedings of the IEEE, vol. 109, no. 1, pp. 14–42, 2020.
  • [51] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, “Top-pim: Throughput-oriented programmable processing in memory,” in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, 2014, pp. 85–98.
  • [52] M. Zhang, Y. Zhuo, C. Wang, M. Gao, Y. Wu, K. Chen, C. Kozyrakis, and X. Qian, “Graphp: Reducing communication for pim-based graph processing with efficient data partition,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).   IEEE, 2018, pp. 544–557.
  • [53] M. Zhao, N. Agarwal, A. Basant, B. Gedik, S. Pan, M. Ozdal, R. Komuravelli, J. Pan, T. Bao, H. Lu, S. Narayanan, J. Langman, K. Wilfong, H. Rastogi, C.-J. Wu, C. Kozyrakis, and P. Pol, “Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 1042–1057. [Online]. Available: https://doi.org/10.1145/3470496.3533044
  • [54] M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 359–371.
  • [55] Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang, and X. Qian, “Graphq: Scalable pim-based graph processing,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 712–725.