HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: interval
  • failed: duckuments
  • failed: filecontents
  • failed: extdash

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2312.02880v2 [cs.AR] 18 Mar 2024

PULSAR: Simultaneous Many-Row Activation
for Reliable and High-Performance Computing
in Off-the-Shelf DRAM Chips

Ismail Emir Yuksel ETH Zurich Yahya Can Tugrul ETH Zurich F. Nisa Bostanci ETH Zurich Abdullah Giray Yaglikci ETH Zurich Ataberk Olgun ETH Zurich Geraldo F. Oliveira ETH Zurich Melina Soysal ETH Zurich Haocong Luo ETH Zurich Juan Gomez Luna ETH Zurich Mohammad Sadrosadati ETH Zurich Onur Mutlu ETH Zurich
Abstract

Data movement between the processor and the main memory is a first-order obstacle against improving performance and energy efficiency in modern systems. To address this obstacle, Processing-using-Memory (PuM) is a promising approach where bulk-bitwise operations are performed leveraging intrinsic analog properties within the DRAM array and massive parallelism across DRAM columns. Unfortunately, 1) modern off-the-shelf DRAM chips do not officially support PuM operations and 2) existing techniques of performing PuM operations on off-the-shelf DRAM chips suffer from two key limitations. First, these techniques have low success rates, i.e., only a small fraction of DRAM columns can correctly execute PuM operations, because they operate beyond manufacturer-recommended timing constraints, causing these operations to be highly susceptible to noise and process variation. Second, these techniques have limited compute primitives, preventing them from fully leveraging parallelism across DRAM columns and thus hindering their performance benefits.

We propose PULSAR, a new technique to enable high-success-rate and high-performance PuM operations in off-the-shelf DRAM chips. PULSAR leverages our new observation that a carefully-crafted sequence of DRAM commands simultaneously activates up to 32 DRAM rows. PULSAR overcomes the limitations of existing techniques by 1) replicating the input data to improve the success rate and 2) enabling new bulk bitwise operations (e.g., many-input majority, Multi/RowInit, and Bulk-Write) to improve the performance.

Our analysis on 120 off-the-shelf DDR4 chips from two major manufacturers shows that PULSAR achieves 24.18%percent24.1824.18\%24.18 % higher success rate and 121%percent121121\%121 % higher performance over seven arithmetic-logic operations compared to FracDRAM, a state-of-the-art off-the-shelf DRAM-based PuM technique.

1 Introduction

Data movement between the processor and the main memory is a first-order obstacle against improving performance and energy efficiency in modern systems [70, 13]. Many prior works [1, 10, 15, 18, 20, 83, 23, 25, 31, 34, 38, 61, 60, 81, 96, 98, 97, 102, 100, 103, 99, 104, 105, 107, 108, 130, 26] propose Processing-using-Memory (PuM) techniques to alleviate data movement bottlenecks, where computation is performed directly within memory arrays (e.g., DRAM) by leveraging the intrinsic analog operating properties of the memory device. PuM significantly reduces data movement, thereby lowering both energy consumption and execution time (i.e., improving system performance). PuM can be enabled in modern systems via 1) various modifications to DRAM chips [100, 97, 83, 104, 102, 103, 96, 61, 130, 7] or 2) violating the timing constraints without the need of any modifications to DRAM chips [80, 25, 26, 51, 50].

Prior work proposes PuM techniques that experimentally demonstrates that three sets of PuM operations can be executed in unmodified off-the-shelf DRAM chips: 1) bitwise logic and arithmetic operations based on three-input majority functions, i.e., MAJ3 [25, 26], 2) bulk-data copy at DRAM row granularity [25, 98, 78] (called, RowClone), and 3) generating security primitives (e.g., in-DRAM true random number generation, physical unclonable functions [51, 50, 80, 78]). Unfortunately, these operations suffer from two key problems that significantly limit their applicability.

Success Rate. MAJ3 operation is based on a multiple-row activation that connects a bitline to multiple cells by simultaneously activating multiple DRAM rows. We define the success rate the percentage of bitlines that reliably and correctly perform MAJ3 operation. Unfortunately, MAJ3 operations in modern off-the-shelf DRAM chips have low success rate. This is because the multiple-row activation 1) is not officially supported by the DRAM manufacturers as it requires operating beyond manufacturer-recommended timing constraints, and 2) can result in a smaller deviation on the bitline voltage than the reliable sensing margin due to noise and manufacturing process variation. These negative factors lead to preventing all bitlines from reliably and successfully performing a MAJ3 operation. Prior work attempts to improve the success rate of MAJ3 in old-generation DRAM chips (i.e., DDR3) [26]. However, the current off-the-shelf DRAM chips (i.e., DDR4) still suffer from low success rates (e.g., 78.85% on average across DDR4 chips we test) in MAJ3 operations (§3.1.1). Consequently, MAJ3 in modern DRAM chips have poor reliability and frequently produce incorrect results.

Performance. PuM techniques [25, 26] in unmodified off-the-shelf DRAM chips are limited in functionality, which significantly hinders their performance. Although these techniques can perform basic operations such as two-operand bitwise AND/OR (e.g., A • B) and RowClone (COPY A \rightarrow B[25, 78, 26], many modern applications would benefit from executing (e.g., data analytics [100, 45, 114], databases [121, 123, 30], and graph processing [61, 10, 31]) more complex operations, such as many-input (i.e., more than two) bitwise AND/OR operations (e.g., A • B • C) and many row initialization (e.g., COPY A \rightarrow [B, C]). Due to limited functionality, prior works sequentially execute the basic PuM operations to perform complex PuM operations. However, sequentially executing basic operations leads to high latency and low throughput.

In this paper, we propose PULSAR111We name our technique as PULSAR, a PuM Technique that Leverages Simultaneous Activation of Many Rows., a new PuM technique that improves the success rate and performance of PuM operations in unmodified off-the-shelf DRAM chips. We experimentally demonstrate using 120 off-the-shelf DDR4 DRAM chips from two major DRAM manufacturers that a carefully crafted sequence of DRAM commands simultaneously activates many rows (i.e., 32). PULSAR leverages this new observation and demonstrates a proof-of-concept where off-the-shelf DRAM chips can be used to execute PuM operations with much higher success rate and performance than the state-of-the-art [26]. PULSAR overcomes the two key problems of the existing techniques [26, 25] by 1) replicating the input data across different DRAM rows to improve the success rate and 2) enabling new PuM operations (e.g., Multi/RowInit, many-input charge-sharing operations, and Bulk-Write) to provide significant performance improvements. Input Replication. PULSAR replicates (i.e., stores multiple copies of) each majority operation’s input on all simultaneously activated rows. During multiple row activation, these multiple copies contribute to charge sharing and thus increase the net deviation in bitline voltage. For example, performing a MAJ3 operation by simultaneously activating six rows that contain two copies of each input results in 44.06% higher net deviation in bitline voltage than activating three rows that store only one copy of each input (§5.1). Larger deviation in bitline voltage greatly reduces the effects of electrical noise and process variation on the results of MAJ3 operations. We present the first characterization of the success rate of MAJ3 operations in DDR4 using 120 off-the-shelf DRAM chips. Our results show that PULSAR executes MAJ3 operations with a 97.91% success rate, which is 24.18% higher than that of the state-of-the-art technique [26].

New PuM Primitives. Activating N rows simultaneously (where N is up to 32) in off-the-shelf DRAM chips, enables more complex operations. PULSAR introduces new PuM primitives that perform bulk data operations on multiple (up to N) operands with a single simultaneous activation: Multi/RowInit, many-input charge-sharing operations, and Bulk-Write. The Multi/RowInit primitive allows for the copying of one row into N rows simultaneously. Many input charge-sharing operations enable majority operations with up to N inputs (e.g., MAJ5 and MAJ7). The Bulk-Write operation enables writing to N rows with only one write command. These primitives increase the throughput and reduce the latency of two PuM operations: 1) majority-based computation and 2) cold-boot-attack defense.

Majority-based Computation. To our knowledge, for the first time, we demonstrate a proof-of-concept that off-the-shelf DRAM chips can execute MAJM operations (i.e., MAJ3+ operations) with high reliability. We study the throughput and the latency of majority-based computations in off-the-shelf DRAM chips using arithmetic and logic operations. Our results show that PULSAR improves performance by 121% on average compared to the state-of-the-art technique [26].

Cold-Boot-Attack Defense. We propose content destruction for cold boot attacks that leverage the new PuM primitives that PULSAR introduces. Our results show that PULSAR speeds up content destruction in off-the-shelf DRAM chips by 7.75×\times× compared to the FracDRAM [26]-based content destruction technique.

This paper makes the following key contributions:

  • We demonstrate, through an extensive experimental characterization of 120 modern DRAM chips from two major manufacturers that modern DRAM chips can simultaneously activate up to 32 DRAM rows.

  • We introduce PULSAR, a new PuM technique that leverages simultaneous activation of up to 32 rows. PULSAR improves the success rate and performance of PuM operations in off-the-shelf DRAM chips. PULSAR demonstrates a proof-of-concept that off-the-shelf DRAM chips are able to execute MAJ3 operations with a 97.91% success rate, which is 24.18% higher than the state-of-the-art [26].

  • To our knowledge, for the first time, PULSAR demonstrates more than three-inputs MAJ operations with a very high success rate (73.93% for MAJ5 on average across the DRAM modules that we test) and core primitives called Multi/RowInit and Bulk-Write that significantly reduces the latency of many row initialization.

  • We show that PULSAR significantly improves the performance of seven arithmetic and logic operations over the state-of-the-art mechanism [26] by 2.21×\times× and significantly reduce the latency of content destruction for cold boot attack in off-the-shelf-DRAM by 7.55×\times×.

2 Background

This section briefly details DRAM organization, operation, timings, and PuM operations in off-the-shelf DRAM chips.

2.1 DRAM Organization

Fig. 1 shows the organization of DRAM-based memory systems. A memory channel connects the processor (CPU) to a DRAM module where a module consists of multiple DRAM ranks. A rank is formed by a set of DRAM chips operated in lockstep. A DRAM chip has multiple DRAM banks each of which is composed of many DRAM subarrays. Within a subarray, DRAM cells form a two-dimensional structure interconnected over bitlines and wordlines. The row decoder in a subarray decodes the row address and drives the wordline out of many. A row of DRAM cells on the same wordline is referred to as a DRAM row. The DRAM cells in the same column are connected to the sense amplifier via a bitline. A DRAM cell stores the binary data value in the form of electrical charge on a capacitor (VDD or 0 V) and this data is accessed through an access transistor, which is driven by the wordline to conduct the cell capacitor to the bitline.

Refer to caption
Figure 1: DRAM Organization.

2.2 DRAM Operation and Timing

Operation. Data stored in a DRAM array is internally accessed in a DRAM row granularity. In the closed state, all wordlines are de-asserted and all bitlines are precharged to VDD/2 in a bank. To access a row, the data needs to be fetched to the sense amplifier. To do so, the memory controller issues an ACT command to assert the wordline and enable the sense amplifier. When the wordline is asserted, the cell capacitor connects to the bitline and shares its charge causing a small voltage deviation on the bitline voltage. After, the sense amplifier is enabled to sense and amplify the small voltage deviation towards VDD or 0 V, depending on the cell data. Once the data is fetched to the sense amplifiers and the cell’s data is restored, the memory controller may issue WR/RD commands to write to/read from the row. To access another row, the bank needs to be in the closed (i.e., precharged) state. To do so, the memory controller issues a Phys. Rev. E command to disable sense amplifiers, de-assert the wordline, and precharging the bitlines to VDD/2. Once the bank is precharged, the memory controller can access another row.

Timing. To ensure correct operation, the memory controller must obey the DRAM timing parameters specified in the DRAM interface standards (e.g., DDR4 [43]) by Joint Electron Device Engineering Council (JEDEC). We describe the most relevant timing constraints in the scope of this paper. The memory controller must wait for the latency of sensing the row’s data and fully restoring a DRAM cell’s charge (tRASsubscript𝑡𝑅𝐴𝑆t_{RAS}italic_t start_POSTSUBSCRIPT italic_R italic_A italic_S end_POSTSUBSCRIPT) before issuing a Phys. Rev. E command after an ACT command. To open another row, the memory controller must wait for the latency of de-asserting a wordline and precharging the bitlines to VDD/2 (tRPsubscript𝑡𝑅𝑃t_{RP}italic_t start_POSTSUBSCRIPT italic_R italic_P end_POSTSUBSCRIPT) before issuing another ACT command.

2.3 PuM Operations in Off-the-Shelf DRAM

PuM architectures allow computations to be performed within the memory array in contrast to the traditional architectures, where data has to be constantly transferred between memory and processor units. Off-the-shelf DRAM chips are not officially designed to support PuM operations (i.e., operations that are performed inside of a memory). Although DRAM manufacturers or JEDEC do not officially support PuM operations, the design of off-the-shelf DRAM chips does not fully prevent users from activating mltiple at once by violating tRASsubscript𝑡𝑅𝐴𝑆t_{RAS}italic_t start_POSTSUBSCRIPT italic_R italic_A italic_S end_POSTSUBSCRIPT and tRPsubscript𝑡𝑅𝑃t_{RP}italic_t start_POSTSUBSCRIPT italic_R italic_P end_POSTSUBSCRIPT timing constraints [25, 80, 26, 132]. By doing so, two fundamental PuM operations can be performed in off-the-shelf DRAM chips: 1) MAJ3 and 2) RowClone.

MAJ3. Prior work introduces the idea of multiple row activation (i.e., activating more than one row simultaneously) in off-the-shelf DRAM that enables charge sharing operation between multiple cells and leads to MAJ3 operation across activated rows (i.e., row groups). State-of-the-art mechanism [26] performs four-row activation to enable MAJ3 operations in off-the-shelf DRAM chips. However, with an even number of operands, the MAJ3 cannot be performed due to the equilibrium state (i.e., an equal number of ones and zeros). To address this issue, they propose Frac operation [26]. Frac operation can charge any row to VDD/2, putting the row into a neutral state during multiple row activation. As a result, they enable MAJ3 by activating four rows at once.

RowClone [98]. Prior work [25] enables consecutive activation of two DRAM rows to copy data in DRAM, RowClone, in off-the-shelf DRAM chips. RowClone enables data movement within DRAM in a DRAM row granularity without incurring the energy and execution time costs of transferring data between the DRAM and the computing units.

2.4 Majority-based Computation

Majority gates can be used to implement 1) logic operations such as AND/OR [31, 4, 96, 100, 97, 104, 25]) and XOR operations [5], and 2) full adders [4, 25, 31]. These operations are then used as basic building blocks for the target in-DRAM computation (e.g., addition, multiplication) [4, 6, 61, 25]. However, MAJ gates cannot implement a NOT operation. Therefore, it is not possible to implement building blocks that require the NOT gate (e.g., XOR operation and full adder) with only MAJ gates. Prior work [25] overcomes this limitation in off-the-shelf DRAM chips by storing both the regular and the negated version of a value. The presence of both regular and negated data allows us to perform any arbitrary function as we can implement functionally-complete logic gates (e.g., NAND, NOR). Fig. 2 shows an example of AND/OR ( 1) and full-adder design ( 2) using only majority gates with regular and negated inputs.

Refer to caption
Figure 2: Example of AND/OR and full-adder implementation using only MAJ gates with regular and negated data.

Vertical Data Layout. Supporting bit-shift operations is essential for implementing complex computations, such as addition (e.g., carry propagation). Prior works [31, 4, 25, 100] provide this support by employing a vertical layout for the data in DRAM, such that all bits of an operand are placed in a single DRAM column (i.e., in a single bitline). Doing so eliminates the need for adding extra logic in DRAM to implement shifting and applies bulk bitwise operations to entire rows of DRAM, generating results from bitlines in parallel.

3 Motivation

Modern computing systems require moving data back and forth between computing units (e.g., CPU, GPU) and off-chip main memory to perform computation on the data [70, 13]. Unfortunately, this data movement is a major bottleneck that consumes a large fraction of execution time and energy  [68, 72, 16, 46, 21, 118, 69, 70, 53, 124, 27, 2, 3, 36, 119]. To address this problem, Processing-using-Memory (PuM) emerges as a promising execution paradigm to alleviate the data movement bottleneck in the modern and emerging applications [100, 97, 83, 104, 102, 103, 96, 61, 130, 7]. In PuM, computation takes place inside the memory (e.g., DRAM) by leveraging the analog intrinsic behavior of memory devices, resulting in reduced data movement costs.

DRAM is a prevalent main memory technology that enables PuM in various systems. Prior works demonstrate that PuM operations in off-the-shelf DRAM chips have the potential to improve the performance and energy efficiency of commodity systems greatly [80, 25, 26, 51, 50]. These works enable many fundamental PuM operations in DRAM chips, including but not limited to 1) bitwise arithmetic and logic operations using three-input majority function (MAJ3[25, 26], 2) bulk-data copy operations at DRAM row granularity [25] (known as RowClone [98]), and 3) security primitives (e.g., in-DRAM true random number generation (TRNG) [51, 80] and physical unclonable functions (PUF) [50].

3.1 Limitations of State-of-the-Art

We identify two key limitations of prior work for PuM operations in commodity DRAM chips: 1) success rate and 2) low throughput and high latency.

3.1.1 Success Rate

We define the success rate of a MAJ operation per row group as the percentage of the bitlines that reliably produce the correct output. To analyze the success rate of the MAJ3 operation in off-the-shelf DRAM chips, we conduct MAJ3 experiments using the state-of-the-art mechanism: FracDRAM [26] on 12 modern off-the-shelf DRAM modules from SK Hynix, following the methodology in §6.1.1.

Fig. 3 shows FracDRAM’s MAJ3-success-rate distribution across different row groups (y-axis) for different DRAM modules (x-axis) in a box-and-whiskers plot.222A box-and-whiskers plot emphasizes the important metrics of a dataset’s distribution. The box is lower-bounded by the first quartile and upper-bounded by the third quartile. The inter-quartile range (IQR𝐼𝑄𝑅IQRitalic_I italic_Q italic_R) is the distance between the first and third quartiles (i.e., box size). Whiskers show the minimum and maximum values. The red dashed line represents the reported average success rate of MAJ3 in DDR3 modules [26]. We make two key observations based on Fig. 3. First, FracDRAM has a low average success rate of 78.85% across all tested DRAM chips. Second, FracDRAM’s MAJ3 success rate significantly reduces (19.37% on average) across newer generations of DRAM chips from DDR3 to DDR4. Based on this observation, we expect the success rate of MAJ3 operations to reduce even more as DRAM continues to scale down in newer generations (e.g., DDR5).

We conduct SPICE simulations to investigate the reasons behind MAJ3’s low success rate across different DRAM modules following the methodology in §5.1. We analyze the effect of manufacturing process variation on MAJ3’s success rate for all the possible inputs (i.e., (0,0,0) to (1,1,1)). To do so, we conduct a Monte Carlo analysis over 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT iterations, where we inject 10 %timesabsentpercent\text{\,}\mathrm{\char 37}start_ARG end_ARG start_ARG times end_ARG start_ARG % end_ARG, 20 %timesabsentpercent\text{\,}\mathrm{\char 37}start_ARG end_ARG start_ARG times end_ARG start_ARG % end_ARG, 30 %timesabsentpercent\text{\,}\mathrm{\char 37}start_ARG end_ARG start_ARG times end_ARG start_ARG % end_ARG, and 40 %timesabsentpercent\text{\,}\mathrm{\char 37}start_ARG end_ARG start_ARG times end_ARG start_ARG % end_ARG variation to capacitor and transistor parameters.

Refer to caption
Figure 3: Distribution of the MAJ3 success rate of state-of-the-art mechanism across 12 off-the-shelf DDR4 chips.

Fig. 3(a) shows how manufacturing process variation (colors) affects MAJ3’s success rate (y-axis) with different input patterns (x-axis) based on our SPICE simulation results. We make two key observations from Fig. 3(a). First, in all 1’s and all 0’s input patterns (i.e., (0,0,0) and (1,1,1)), MAJ3 works with a 100% success rate as all activated cells in a bitline try to pull the bitline to the same voltage level, resulting in safe sensing operation. Second, MAJ3 operations that have at least one different value in the input data pattern (i.e., (0,0,1) to (1,1,0)) produce incorrect results with an increasing trend as the process variation percentage increases, up to 46.58%. This is because some of the activated cells attempt to pull the bitline to a level, whereas the others attempt to pull the bitline to the opposite level. Depending on the cell’s characteristics, this operation produce incorrect results and thus lowers the success rate of the MAJ3. To investigate further, we analyze the distribution of the bitline deviation when four rows are simultaneously activated to perform MAJ3 with an input pattern that has two logic-1 (e.g.,MAJ3(1,1,0)).

Refer to caption
(a)
Refer to caption
(b)
Figure 4: The success rate of MAJ3 in different input patterns (a) and the distribution of the bitline deviation (b) for various process variations.

Fig. 3(b) presents a box-and-whiskers plot2 that demonstrates the effect of manufacturing process variation (x-axis) on the bitline voltage’s net deviation (y-axis) when four rows are simultaneously activated to perform MAJ3(1,1,0). As a comparison point, we evaluate the deviation on the bitline when a single row that stores 1 (i.e., VDD) is activated (i.e., nominal activation operation) for the corresponding process variation percentages. We make two key observations from Fig. 3(b). First, activating multiple rows to perform MAJ3 with two logic-1 input patterns reduces the bitline deviation by 41.14% on average, compared to activating a single row. This is because the activated cells store conflicting data (i.e., not all 1s or all 0s), thus trying to pull bitlines to opposite voltage levels. Second, manufacturing process variation significantly affects the deviation on the bitline voltage distribution, i.e., boxes get wider as the process variation increases from left to right in Fig. 3(b). Increased variation can cause MAJ3 operation to compute an incorrect result. This is because process variation can cause variations in cell capacitance and affect the behavior of transistors and bitlines, as well as the latency of wordline assertion. We conclude that these variations can affect the success rate of the charge-sharing operation and, in turn, the correctness of its results.

3.1.2 Performance

PuM operations in off-the-shelf-DRAM chips [25, 26] are limited in functionality by only MAJ3 [25, 26, 100] and RowClone [98, 25] operations, which requires them to execute complex procedures by sequentially performing these operations many times and thus hinders their performance benefits. For instance, 1) to perform more than two-operand AND/OR operations, prior works need to perform multiple MAJ3 operations since MAJ3 can perform only two-operand AND/OR operations and 2) to initialize N rows, N RowClone operations are needed as each of them can initialize only one row at a time. These limitations lead to reducing the potential advantages of PuM operations in off-the-shelf DRAM chips.

Increasing the number of operands in MAJ operations (e.g., MAJ5) can significantly improve the throughput of many applications, such as data analytics [100, 45, 114], databases [121, 123, 30], and graph processing [61, 10, 31]. To demonstrate the potential benefits of enabling more than three-input MAJ we model 4 different MAJ operations: MAJ3, MAJ5, MAJ7, and MAJ9. All operation models assume equal latency values based on the state-of-the-art MAJ3 operation [26] to show the potential benefit of different majority operations. Note that the actual latency of these operations may be higher than what is assumed in this evaluation.

Fig. 5 shows the performance speedup of operations that are based on MAJ5, MAJ7, MAJ9 over the MAJ3 for 1) three bit-wise logic operations: AND, OR, and XOR, and 2) four bit-serial arithmetic operations addition (ADD), subtraction (SUB), multiplication (MUL), and division (DIV). We implement the arithmetic operations based on full-adder designs using MAJ3 and MAJ5 operations. This is because the full adder design utilizes up to five-input majority operations [75, 4]. Based on Fig. 5, we observe that increasing the number of operands in majority operations results in significantly higher speedups for both arithmetic and logic microbenchmarks (e.g., MAJ9 has 2.73 ×timesabsent\text{\,}\timesstart_ARG end_ARG start_ARG times end_ARG start_ARG × end_ARG higher speedup over MAJ3 on average across logic microbenchmarks). Therefore, we conclude that extending PuM functionality in off-the-shelf DRAM chips greatly enhances performance for many workloads.

Refer to caption
Figure 5: Speedup over the MAJ3 in seven microbenchmarks.

4 Simultaneous Many Row Activation

We find that by carefully crafting a specific sequence of ACT \rightarrow Phys. Rev. E \rightarrow ACT (APA) DRAM commands with reduced timings, 2, 4, 8, 16, and 32 rows in the same subarray can be activated simultaneously. We characterize 120 modern off-the-shelf DRAM chips from two major manufacturers using an FPGA-based off-the-shelf DRAM testing infrastructure (§4.1). To explain the potential mechanism behind our observation, we analyze the row decoder circuitry of a DRAM bank in an off-the-shelf DRAM chip. We hypothesize that the hierarchical structure of row decoder design with multiple pre-decoding schemes allows us to simultaneously activate many rows. We present a hypothetical row decoder circuitry that explains activating many rows simultaneously (§4.2).

4.1 Real DRAM Chip Characterization

We demonstrate that multiple (up to 32) row activation works reliably on 120 DRAM chips that come from two major manufacturers. Table 1 provides a list of the DRAM modules along with the chip identifier (Chip ID), manufacturing date (Date), die revision (Die Rev.), chip density (Chip Dens.), and DRAM organization (ranks, banks, and pins).

Manufacturer Module Chip ID Date Die Chip Organization SA NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT%
(yy-ww) Rev. Dens. Ranks Banks Pins Size 2 4 8 16 32
SK Hynix H0-6 H5AN4G8NMFR Unknown M 4Gb 1 16 x8 512-640 2.07 %times2.07percent2.07\text{\,}\mathrm{\char 37}start_ARG 2.07 end_ARG start_ARG times end_ARG start_ARG % end_ARG 10.65 %times10.65percent10.65\text{\,}\mathrm{\char 37}start_ARG 10.65 end_ARG start_ARG times end_ARG start_ARG % end_ARG 25.37 %times25.37percent25.37\text{\,}\mathrm{\char 37}start_ARG 25.37 end_ARG start_ARG times end_ARG start_ARG % end_ARG 26.81 %times26.81percent26.81\text{\,}\mathrm{\char 37}start_ARG 26.81 end_ARG start_ARG times end_ARG start_ARG % end_ARG 9.91 %times9.91percent9.91\text{\,}\mathrm{\char 37}start_ARG 9.91 end_ARG start_ARG times end_ARG start_ARG % end_ARG
(Mfr. H) H7-11 H5AN4G8NAFR Unknown A 4Gb 1 16 x8 512 2.49 %times2.49percent2.49\text{\,}\mathrm{\char 37}start_ARG 2.49 end_ARG start_ARG times end_ARG start_ARG % end_ARG 12.63 %times12.63percent12.63\text{\,}\mathrm{\char 37}start_ARG 12.63 end_ARG start_ARG times end_ARG start_ARG % end_ARG 30.77 %times30.77percent30.77\text{\,}\mathrm{\char 37}start_ARG 30.77 end_ARG start_ARG times end_ARG start_ARG % end_ARG 35.33 %times35.33percent35.33\text{\,}\mathrm{\char 37}start_ARG 35.33 end_ARG start_ARG times end_ARG start_ARG % end_ARG 1.83 %times1.83percent1.83\text{\,}\mathrm{\char 37}start_ARG 1.83 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Micron M0-3 OUE75 D9ZFW 20-46 E 16Gb 1 16 x16 1024 1.91 %times1.91percent1.91\text{\,}\mathrm{\char 37}start_ARG 1.91 end_ARG start_ARG times end_ARG start_ARG % end_ARG 12.92 %times12.92percent12.92\text{\,}\mathrm{\char 37}start_ARG 12.92 end_ARG start_ARG times end_ARG start_ARG % end_ARG 32.87 %times32.87percent32.87\text{\,}\mathrm{\char 37}start_ARG 32.87 end_ARG start_ARG times end_ARG start_ARG % end_ARG 20.83 %times20.83percent20.83\text{\,}\mathrm{\char 37}start_ARG 20.83 end_ARG start_ARG times end_ARG start_ARG % end_ARG 0 %times0percent0\text{\,}\mathrm{\char 37}start_ARG 0 end_ARG start_ARG times end_ARG start_ARG % end_ARG
(Mfr. M) M4-5 1LB75 D9XPG 21-26 B 16Gb 1 16 x16 1024 1.47 %times1.47percent1.47\text{\,}\mathrm{\char 37}start_ARG 1.47 end_ARG start_ARG times end_ARG start_ARG % end_ARG 8.11 %times8.11percent8.11\text{\,}\mathrm{\char 37}start_ARG 8.11 end_ARG start_ARG times end_ARG start_ARG % end_ARG 15.27 %times15.27percent15.27\text{\,}\mathrm{\char 37}start_ARG 15.27 end_ARG start_ARG times end_ARG start_ARG % end_ARG 11.06 %times11.06percent11.06\text{\,}\mathrm{\char 37}start_ARG 11.06 end_ARG start_ARG times end_ARG start_ARG % end_ARG 0 %times0percent0\text{\,}\mathrm{\char 37}start_ARG 0 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Table 1: Summary of DDR4 DRAM chips tested.

Infrastructure. We conduct real DRAM chip experiments on DRAM Bender [94, 77], an FPGA-based DDR4 testing infrastructure that provides precise control of the DDR commands issued to a DRAM module. Fig. 6 shows our experimental setup that consists of four main components: 1) the Xilinx Alveo U200 FPGA board [128] programmed with DRAM Bender 2) a host machine that generates the sequence of DRAM commands that we use in our tests, 3) rubber heaters that clamp the DRAM module on both sides to avoid fluctuations in ambient temperature, and 4) a MaxWell FT200 [67] temperature controller that controls the heaters and keeps the DRAM chips at the target temperature.

Refer to caption
Figure 6: DDR4 DRAM Bender experimental setup.

Verification Experiment. We 1) initialize N rows that would be activated simultaneously, which we referred to as the N row groups (NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT), with a predetermined data pattern, 2) perform ACT \rightarrow Phys. Rev. E \rightarrow ACT command sequence (APA) with reduced timings on the NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT to simultaneously activate multiple rows, 3) issue a WR command while all rows in NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT are active, and 4) precharge the bank and individually read each row in NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT while adhering to the manufacturer-recommended DRAM timing parameters. If the rows are activated with APA command sequence, WR command overwrites the predetermined data pattern with the new one. We observe that all rows in NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT are updated with the newly written data pattern. We observe that up to 32 rows can be activated simultaneously in one major DRAM manufacturer, while up to 16 can be activated in another major DRAM manufacturer.

Finding All NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT in a Subarray. We successfully reverse engineer the number of subarrays and subarray size (listed in Table 1) using RowClone [25], which is used in many prior works to determine subarray boundaries [95, 78]. To investigate which rows are simultaneously activated in a subarray, we perform ACT RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT \rightarrow Phys. Rev. E \rightarrow ACT RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT command sequence with reduced timing parameters, where the RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the firstly activated row and the RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is the secondly activated row. We test every possible RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT combinations of this sequence and record the row addresses that are simultaneously activated in a subarray. We present in Table 1 the percentage of the number of rows that can be activated simultaneously out of all two-row address pairs in a subarray (NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT%) across different DRAM chips and manufacturers.

4.2 Hypothetical Row Decoder Design

The row decoder circuitry in a DRAM bank decodes the n-bit row address (RA) and asserts a wordline out of 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT wordlines. Modern DRAM chips have multiple tiers of decoding stages (pre-decode and decode stages) to reduce latency, area, and power consumption [8, 120, 115]. We analyze the row decoder circuitry of an off-the-shelf DRAM chip, H8 module which has 216superscript2162^{16}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT rows in a bank. We observe that in H8, the subarrays consist of 29superscript292^{9}2 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT rows, and the total number of subarrays in a bank is 27superscript272^{7}2 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT. We present a hypothesis regarding the row decoder circuitry that allows simultaneous activation of many rows and the sequence of operations that occur in the row decoder when ACT and Phys. Rev. E commands are issued.

Row Address Indexing. Based on the characterization results, we hypothesize that the lower-order 9 bits of the RA are used to index the row within a subarray, while the higher-order 7 bits are used to index the corresponding subarray.

Row Decoder Design. Fig. 7 illustrates the potential row decoder circuitry of a DRAM bank in an off-the-shelf DRAM module that consists of two decoding stages: 1) Global Wordline Decoder (GWLD) ( 1) and 2) Local Wordline Decoder (LWLD) ( 2). When an ACT command is issued, three operations occur in order. First, GWLD decodes the higher-order 7 bits of the RA (RA[9:15]) and drives the corresponding Global Wordline (GWL𝐺𝑊𝐿GWLitalic_G italic_W italic_L) that is connected to the LWLD of the corresponding DRAM subarray. Second, Stage 1 of LWLD predecodes the lower-order 9 bits of the RA (RA[0:8]) in five tiers of predecoders (Predecoder A/B/C/D/E, 3) and latches the predecoded address bits (PA0,PA1,,PE3subscript𝑃𝐴0subscript𝑃𝐴1subscript𝑃𝐸3P_{A0},P_{A1},...,P_{E3}italic_P start_POSTSUBSCRIPT italic_A 0 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_A 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_E 3 end_POSTSUBSCRIPT), a total of 18 bits. Third, Stage 2 of LWLD decodes the predecoded P𝑃Pitalic_P signals to assert the corresponding Local Wordline (LWL𝐿𝑊𝐿LWLitalic_L italic_W italic_L) in the Stage 2, which consists of 64 sub-decoder trees ( 4). When a Phys. Rev. E command is issued, the latched predecoded P𝑃Pitalic_P signals are reset to de-assert the corresponding LWL𝐿𝑊𝐿LWLitalic_L italic_W italic_L.

Refer to caption
Figure 7: Hypothetical Row Decoder Design.

Activating Multiple Rows: A Walk-Through. Reducing the latency between Phys. Rev. E and the second ACT commands (i.e., tRPsubscript𝑡𝑅𝑃t_{RP}italic_t start_POSTSUBSCRIPT italic_R italic_P end_POSTSUBSCRIPT) prevents the reset operation and allows the predecoders to latch the next RA without deasserting the RA targeted by the first ACT command. Hence, after the second ACT command, depending on the target addresses of APA sequence, one or two latches of each pre-decoder in LWLD can be set. By changing the row addresses targeted by two ACT commands, we can control the number and addresses of the simultaneously activated rows in a subarray.

Fig. 8 demonstrates an example of how the hypothetical row decoder design enables simultaneously activating four rows in the same bank when an APA command sequence targeting Row 0 (00002subscript00002...0000_{2}… 0000 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) and Row 7 (01112subscript01112...0111_{2}… 0111 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) (i.e., ACT 0 \rightarrow Phys. Rev. E \rightarrow ACT 7) is issued. When the first ACT 0 is received, the predecoders assert PA0subscript𝑃𝐴0P_{A0}italic_P start_POSTSUBSCRIPT italic_A 0 end_POSTSUBSCRIPT and PB0subscript𝑃𝐵0P_{B0}italic_P start_POSTSUBSCRIPT italic_B 0 end_POSTSUBSCRIPT signals. The asserted PA0subscript𝑃𝐴0P_{A0}italic_P start_POSTSUBSCRIPT italic_A 0 end_POSTSUBSCRIPT and PB0subscript𝑃𝐵0P_{B0}italic_P start_POSTSUBSCRIPT italic_B 0 end_POSTSUBSCRIPT signals drive LWL0𝐿𝑊subscript𝐿0LWL_{0}italic_L italic_W italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. When the ACT 7 is received, the predecoders assert PA1subscript𝑃𝐴1P_{A1}italic_P start_POSTSUBSCRIPT italic_A 1 end_POSTSUBSCRIPT and PB3subscript𝑃𝐵3P_{B3}italic_P start_POSTSUBSCRIPT italic_B 3 end_POSTSUBSCRIPT signals. Due to issuing ACT 7 command with reduced timings, the signals PA0subscript𝑃𝐴0P_{A0}italic_P start_POSTSUBSCRIPT italic_A 0 end_POSTSUBSCRIPT and PB0subscript𝑃𝐵0P_{B0}italic_P start_POSTSUBSCRIPT italic_B 0 end_POSTSUBSCRIPT are not yet de-asserted, and thus all of PA0subscript𝑃𝐴0P_{A0}italic_P start_POSTSUBSCRIPT italic_A 0 end_POSTSUBSCRIPT, PA1subscript𝑃𝐴1P_{A1}italic_P start_POSTSUBSCRIPT italic_A 1 end_POSTSUBSCRIPT, PB0subscript𝑃𝐵0P_{B0}italic_P start_POSTSUBSCRIPT italic_B 0 end_POSTSUBSCRIPT, and PB3subscript𝑃𝐵3P_{B3}italic_P start_POSTSUBSCRIPT italic_B 3 end_POSTSUBSCRIPT signals are set simultaneously, and thus the decoder tree asserts all of LWL0𝐿𝑊subscript𝐿0LWL_{0}italic_L italic_W italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, LWL1𝐿𝑊subscript𝐿1LWL_{1}italic_L italic_W italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, LWL6𝐿𝑊subscript𝐿6LWL_{6}italic_L italic_W italic_L start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, and LWL7𝐿𝑊subscript𝐿7LWL_{7}italic_L italic_W italic_L start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT wordlines, thereby simultaneously activating all of rows 0, 1, 6, and 7.

Refer to caption
Figure 8: Example of activating multiple rows in hypothetical row decoder design. The red colors represent asserted signals.

Fig. 9 depicts a higher-abstraction level for the hierarchical row decoder tree in the first subarray when an APA command sequence targets Row 256 and Row 287. Each node represents a signal that is used in the decoding process. The first (the root) node is the output of GWLD (GWL0𝐺𝑊subscript𝐿0GWL_{0}italic_G italic_W italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), other nodes are the predecoded address signals (PA0,PA1,,PE3subscript𝑃𝐴0subscript𝑃𝐴1subscript𝑃𝐸3P_{A0},P_{A1},...,P_{E3}italic_P start_POSTSUBSCRIPT italic_A 0 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_A 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_E 3 end_POSTSUBSCRIPT). Each edge between nodes represents the AND gate of the nodes. Each box represents the predecoders E/D/C/B/A (starting from root to leaf), which is the level of the row decoder tree. When we issue ACT 256, it is decoded through the circuitry and asserts the corresponding predecoded address signals (PE2subscript𝑃𝐸2P_{E2}italic_P start_POSTSUBSCRIPT italic_E 2 end_POSTSUBSCRIPT, PD0subscript𝑃𝐷0P_{D0}italic_P start_POSTSUBSCRIPT italic_D 0 end_POSTSUBSCRIPT, PC0subscript𝑃𝐶0P_{C0}italic_P start_POSTSUBSCRIPT italic_C 0 end_POSTSUBSCRIPT, PB0subscript𝑃𝐵0P_{B0}italic_P start_POSTSUBSCRIPT italic_B 0 end_POSTSUBSCRIPT, PA0subscript𝑃𝐴0P_{A0}italic_P start_POSTSUBSCRIPT italic_A 0 end_POSTSUBSCRIPT), highlighted as red on the left side of the figure. When the ACT 287 is issued with the reduced timings, PC3subscript𝑃𝐶3P_{C3}italic_P start_POSTSUBSCRIPT italic_C 3 end_POSTSUBSCRIPT, PB3subscript𝑃𝐵3P_{B3}italic_P start_POSTSUBSCRIPT italic_B 3 end_POSTSUBSCRIPT, and PA1subscript𝑃𝐴1P_{A1}italic_P start_POSTSUBSCRIPT italic_A 1 end_POSTSUBSCRIPT are also latched, resulting in activating eight rows in a subarray.

We can formulate our observation as follows: to activate 2Nsuperscript2𝑁2^{N}2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT rows, N different predecoders have to latch two signals. For instance, to activate four rows, we issue APA commands by targeting the rows that only latch the two outputs of two different predecoders as illustrated in Fig. 8. Hence, to activate thirty-two rows, APA command sequence needs to target such rows that make all predecoders latch two outputs (e.g., ACT 127 \rightarrow Phys. Rev. E \rightarrow ACT 128). We hypothesize that the upper bound for the number of rows that are simultaneously activated depends on the number of predecoders. The examined module has five predecoders, and thus we activate up to 25superscript252^{5}2 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT rows.

Refer to caption
Figure 9: A high-level abstraction of row decoder tree. The red colors represent asserted signals.

5 PULSAR

PULSAR leverages multiple (up to 32) row activation and demonstrates a proof-of-concept to improve success rate and the performance of PuM operations in off-the-shelf DRAM chips by 1) replicating the inputs and 2) introducing new PuM primitives.

5.1 Input Replication

Although modern off-the-shelf DRAM chips do not officially support MAJ3, it is possible to perform MAJ3 operation in off-the-shelf DRAM chips on four simultaneously activated rows by violating two timing parameters: tRASsubscript𝑡𝑅𝐴𝑆t_{RAS}italic_t start_POSTSUBSCRIPT italic_R italic_A italic_S end_POSTSUBSCRIPT and tRPsubscript𝑡𝑅𝑃t_{RP}italic_t start_POSTSUBSCRIPT italic_R italic_P end_POSTSUBSCRIPT [26]. This mechanism can reduce the deviation on the bitline voltage, depending on the data that are stored in activated cells (Fig. 3(b)), making it highly susceptible to electrical noise and process variation. Hence, state-of-the-art mechanism-based [26] MAJ3 operations suffer from a low success rate. To improve the low success rate of MAJ3 operations, PULSAR increases the deviation on the bitline voltage towards safe sensing margins. PULSAR achieves this by storing multiple copies of each input on all simultaneously activated rows, i.e., replicating the input operands.

Input replication exploits the majority Boolean algebra rule, where replicating input operands maintains the functionality (e.g., MAJ6(A,B,C,A,B,C) = MAJ3(A,B,C)). Fig. 10 illustrates MAJ3(A, B, C) utilizing 4-, 8-, 16-, 32-row activation with input replication. The state-of-the-art 4-row activation-based MAJ3, FracDRAM [26] places one row in the neutral state (N) while activating four rows simultaneously. For N-row activation-based MAJ3 with N>4absent4>4> 4, all inputs are replicated to the maximum extent possible. The remaining rows are then set to the neutral state.

Refer to caption
Figure 10: MAJ3(A, B, C) utilizing 4-, 8-, 16-, 32-row activation with input replication.

We hypothesize that by leveraging input replication, PULSAR increases the deviation on the bitline voltage towards the safe thresholds and, thus, reduces the effect of process variation. To study our hypothesis, we conduct SPICE simulations and analyze the effect of input replication on the success rate of the sensing operation for MAJ3(1,1,0) operations. We use the reference 55 nm DRAM model from Rambus [91] and scale it based on the ITRS roadmap [40, 117] to model the 22 nm technology node following the PTM transistor models [74]. Fig. 11 shows the effect of process variation on the sensing operation when N rows are activated (where N{1,4,8,16,32}𝑁1481632N\in\{1,4,8,16,32\}italic_N ∈ { 1 , 4 , 8 , 16 , 32 }) simultaneously. Fig. 10(a) depicts the deviation on the bitline voltage distribution (y-axis) for different process variation percentages (x-axis). Each NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT=1absent1=1= 1 box represents the bitline voltage deviation distribution for a single row activation. Boxes for other NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT values show the bitline voltage deviation distribution for 4-, 8-, 16-, and 32-row activation scenarios. Fig. 11b shows the success rate corresponding to the MAJ3 operations based on N-row activation, where N{4,8,16,32}𝑁481632N\in\{4,8,16,32\}italic_N ∈ { 4 , 8 , 16 , 32 }.

Refer to caption
(a)
Refer to caption
(b)
Figure 11: The effect of input replication bitline deviation (a) and the success rate of MAJ3 (b) for various NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT across different process variations using SPICE simulations.

We make three key observations based on Fig. 11. First, increasing the number of rows that are simultaneously activated increases the deviation on the bitline voltage in every process variation percentage. On average, using thirty-two rows to perform MAJ3 (i.e., ten copies for each operand and two neutral rows) have 159.05% higher deviation voltage than the FracDRAM. Second, activating more than eight rows always results in a higher deviation voltage than single-row activation on average for every process variation percentage. Third, input replication results in a higher success rate under all process variation percentages. Increasing process variation results in a lower success rate for MAJ3 operations with less or no input replication, such as MAJ3 with 4-row activation. The success rate of MAJ3 based on 4-row activation reduces by 46.58% when process variation increases from 0% to 40%. In contrast, the success rate of MAJ3 with 32-row activation reduces only by 0.01%. We conclude that input replication increases the deviation on the bitline voltage towards the safer sensing margins and reduces the effect of process variation on MAJ3 operation’s success rate.

5.2 New PuM Primitives

PULSAR introduces new PuM primitives enabled by multiple (up to 32) row activation: Multi/RowInit, many-input charge-sharing operations, and Bulk-Write. These new PuM primitives improve the performance of PuM techniques in off-the-shelf DRAM chips. For all the examples that describe the compute primitives, assume activating an arbitrary row RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, precharging and activating another arbitrary row RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT (ACT RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT \rightarrow Phys. Rev. E \rightarrow ACT RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT) activates eight rows simultaneously.

5.2.1 Multi/RowInit

Multi/RowInit copies the content of a row to multiple different rows at once. Fig. 12 demonstrates how the content of RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is copied to eight rows by issuing the ACT RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT \rightarrow Phys. Rev. E \rightarrow ACT RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT command sequence that activates eight rows simultaneously. Initially, the cells in RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT are charged to VDD, while the other rows are at GND ( 0).

Refer to caption
Figure 12: Multi/RowInit.

Multi/RowInit operation consists of four steps. First, PULSAR issues ACT RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT to assert the wordline and to connect RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT to the bitline ( 1). Second, PULSAR issues Phys. Rev. E after tRASsubscript𝑡𝑅𝐴𝑆t_{RAS}italic_t start_POSTSUBSCRIPT italic_R italic_A italic_S end_POSTSUBSCRIPT. By obeying the tRASsubscript𝑡𝑅𝐴𝑆t_{RAS}italic_t start_POSTSUBSCRIPT italic_R italic_A italic_S end_POSTSUBSCRIPT parameter, PULSAR ensures the sense amplifier senses the RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT correctly and drives bitlines to the RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT’s charge, VDD ( 2). Third, PULSAR issues ACT by violating tRPsubscript𝑡𝑅𝑃t_{RP}italic_t start_POSTSUBSCRIPT italic_R italic_P end_POSTSUBSCRIPT. The last ACT command interrupts the Phys. Rev. E command. By doing so, PULSAR 1) prevents the bitline from being precharged to VDD/2, 2) keeps RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and the sense amplifier enabled, and 3) simultaneously enables the remaining seven rows ( 3). Finally, since this mechanism keeps the sense amplifier enabled that already latched the content of RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, all activated rows are fully charged to RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT data, VDD ( 4).

Leveraging Multi/RowInit primitive, PULSAR can copy one row’s data to 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT rows by simultaneously activating 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT rows, where n\interval15𝑛\interval15n\in\interval{1}{5}italic_n ∈ 15 in off-the-shelf DRAM chips.

5.2.2 Many-Input Charge-Sharing

PULSAR utilizes a many-input charge-sharing mechanism to extend the off-the-shelf-DRAM-based PuM functionality. Fig. 13 depicts the many-input charge-sharing mechanism that performs eight-input majority operation by issuing the ACT RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT \rightarrow Phys. Rev. E \rightarrow ACT RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT command sequence to activate eight rows simultaneously. Initially, RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is charged to VDD, while the remaining rows activated by the command sequence are at GND ( 0).

Refer to caption
Figure 13: Many-input Charge Sharing.

The many-input charge-sharing mechanism consists of four steps. First, PULSAR issues an ACT command to assert the wordline of RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and thus connects RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT to the bitline ( 1). Second, PULSAR issues Phys. Rev. E command immediately after the first ACT in <<< 3ns. This way, the RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT does not have sufficient time to share its charge fully with bitline ( 2). Third, PULSAR sends the last ACT command by greatly violating tRPsubscript𝑡𝑅𝑃t_{RP}italic_t start_POSTSUBSCRIPT italic_R italic_P end_POSTSUBSCRIPT (<<< 3ns), which prevents de-asserting the RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and activates the remaining seven rows, making all eight rows share their charge with the bitline and resulting in an eight-input majority operation. Since the majority of the activated cells have GND in this example, this leads to a negative deviation on the bitline ( 3). Finally, the sense amplifier amplifies the negative deviation and drives bitline to GND, leading to full discharge to all eight rows ( 4).

Leveraging many-input charge-sharing mechanism, PULSAR extends the functionality of in-DRAM operations by enabling (2n12𝑛12n-12 italic_n - 1)-input majority operations where n\interval216𝑛\interval216n\in\interval{2}{16}italic_n ∈ 216 in off-the-shelf DRAM chips by simultaneously activating up to 32 rows. PULSAR utilizes the prior work’s compute primitive [26] to perform majority operations when an even number of rows are simultaneously activated by neutralizing rows in the charge-sharing process. For instance, in the Fig. 13, putting 1) three rows into a neutral state enables MAJ5, and 2) one row into a neutral state enables MAJ7 operation.

5.2.3 Bulk-Write Mechanism

PULSAR introduces a compute primitives that writes data to multiple rows at once, which we call the Bulk-Write. PULSAR performs the Bulk-Write operation in two steps. First, PULSAR issues ACT RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT \rightarrow Phys. Rev. E \rightarrow ACT RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to perform charge-sharing among eight rows using the mechanism from §5.2.2. Second, PULSAR issues a WR command to write data to all activated rows in a single operation. Since the activated rows are connected to bitline, WR command drives the bitline to the input data, making all activated rows overwrite their data and storing the input data from WR command. Leveraging this mechanism, PULSAR greatly extends the multiple write operations into one Bulk-Write operation. The Bulk-Write operation can be extended to write data to up to 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT rows simultaneously, where n\interval15𝑛\interval15n\in\interval{1}{5}italic_n ∈ 15.

6 Use Cases

This section presents our characterization and evaluation of 120 real DRAM chips using the infrastructure described in §4.1. We demonstrate the effectiveness of PULSAR on two333We believe that PULSAR can be leveraged from other PuM operations. We discuss other potential use cases in §7. fundamental off-the-shelf-DRAM-based PuM use cases: 1) majority-based computation and 2) cold-boot-attack defense.

We demonstrate that PULSAR 1) significantly increases the success rate of MAJ, and 2) achieves significant performance gain over the state-of-the-art mechanisms.

6.1 Majority Operation

We experimentally characterize many-input majority operations (denoted as MAJM, where M{3,5,7,9}𝑀3579M\in\{3,5,7,9\}italic_M ∈ { 3 , 5 , 7 , 9 }) across different data patterns and the N rows activated simultaneously (denoted as NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT, where N{4,8,16,32}𝑁481632N\in\{4,8,16,32\}italic_N ∈ { 4 , 8 , 16 , 32 }), in real DDR4 chips through experimental evaluations. To our knowledge, our evaluations provide the first comprehensive effort to 1) characterize the success rate of the MAJ3 operation in real DDR4 chips and 2) demonstrate new operations, such as MAJ5 and MAJ7 with high reliability.

We evaluate PULSAR using majority-based arithmetic and logic microbenchmarks. Our results show that introducing new operations leads to significant performance gains in the evaluated microbenchmarks.

6.1.1 Success Rate of Majority Operations

We perform majority operations in off-the-shelf DRAM chips in four steps: we 1) initialize NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT to perform MAJM with a data pattern, 2) perform Frac operation444For the Mfr. M, Frac operation is not supported. However, we observe that the sense amplifiers of these modules are always biased to one or zero (i.e., not random) depending on the cell polarity (i.e., true or anti). Initializing the neutral rows with all zeros/ones enables majority operation. into one or multiple rows (depending on the values of N and M) to make them neutral during charge-sharing, 3) execute a charge-sharing operation (described in §5.2.2) on the NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT, and 4) read back the values in the row buffer.

Success Rate. We define a metric to evaluate majority operations, which we call the success rate. Success rate refers to the percentage of bitlines (a total of 65536) that produce correct output in all trials per NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT. If a bitline produces an incorrect result at least once, we refer to this bitline as an unstable bitline that cannot be used to perform majority operations. For example, if an NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT has a 25% success rate, it means a quarter of the bitlines (i.e., 16384 of the bitlines) are stable (i.e., produce correct results all the time) and can be used to perform majority operations.

Data Pattern Dependence. We analyze how the data patterns used in initializing NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT affect the result of MAJM operations. We initialize rows in the NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT with two different data patterns: 1) all ones/zeros pattern: either all ones or all zeros, and 2) random pattern: random data. We conduct our experiments on randomly selected 100 different NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT in a subarray for three randomly selected subarrays in each bank, which results in a total of 4.8K NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT for each tested module. We repeatedly perform the MAJM operation 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT times for random data patterns and 2Msuperscript2𝑀2^{M}2 start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT times for all ones/zeros patterns (i.e., all truth table inputs for a given M).

Fig. 14 shows a box-and-whiskers plot2 of the MAJ3 success rate of NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT for every N value (x-axis) across different module groups. The state-of-the-art mechanism for MAJ3 is based on NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT=4absent4=4= 4, FracDRAM [26]. We make four key observations from Fig. 14. First, PULSAR achieves 97.91% (up to 100%) success rate by activating thirty-two rows, 24.18% higher success rate than the FracDRAM on average. Second, the data pattern significantly affects the success rate of MAJ3 operation. We hypothesize that this occurs due to interference between cells located in close proximity, as demonstrated in prior work [49]. Therefore, this phenomenon affects the deviation on a bitline during charge-sharing, leading to incorrect results. Third, in all module groups, increasing N results in a higher success rate as it makes the deviation on the bitline closer to the safe margins, as explained in §5.1. Fourth, Mfr. M has a higher success rate than Mfr. H in all NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT. We hypothesize that Mfr. M can have more robust sense amplifiers than Mfr. H. This can allow the sense amplifier to safely amplify the reduced deviation on the bitline voltage correctly. We conclude that input replication greatly increases the success rate of MAJ3 operation in all tested modules.

Fig. 15 shows a box-and-whiskers plot2 of the MAJ3, MAJ5, MAJ7, and MAJ9 success rate of NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT for every N value (x-axis) across all 21 modules we test using random data pattern. We omit the majority operations (i.e., MAJ11+ for Mfr. H and MAJ9+ for Mfr. M) that have <<<1% success rate on average. We make three key observations from Fig. 15. First, PULSAR can reliably perform MAJ5 operation with 73.93% (up to 99.61%) and MAJ7 operation with 29.28% (up to 81.92%) success rate. Second, as the number of inputs of the majority operation increases, the success rate decreases. We hypothesize that when we increase the number of inputs, the number of copies from each input decreases, making the deviation on the bitline closer to unreliable sensing margins. Third, Mfr. M outperforms Mfr. H in every MAJM in terms of success rate, which can be due to the hypothesis in Fig. 14’s observations. We conclude that by leveraging input replication, PULSAR increases the success rate of the majority operation regardless of the number of input operands in both manufacturers.

Refer to caption
Figure 14: MAJ3 Success Rate of NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT for every N value across different DDR4 Modules.
Refer to caption
Figure 15: MAJ3, MAJ5, MAJ7, and MAJ9 success rate of NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT for every N value across different DRAM manufacturers.

Spatial Distribution of Success Rate. We study the spatial distribution of success rate of MAJ3 across every subarray in a DRAM bank of H0 module. In each subarray, we randomly select 100 NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT for every N and perform MAJ3 operation using a random data pattern. Fig. 16 depicts how a NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT’s average success rate varies across subarrays in a DRAM bank.

We make two key observations from Fig. 16. First, PULSAR significantly increases the success rate of the majority operation in every subarray on average. Second, the overall success rate distribution follows an M-like pattern. The success rate peaks in the first quarter of subarrays and descends in the second quarter of subarrays. This trend repeats itself in the second half of the bank. We hypothesize that this pattern results from the effects of systematic process variation. We conclude that regardless of the spatial location of an NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT, PULSAR outperforms the FracDRAM by 66.23% in average success rate across all subarrays.

6.1.2 Majority-based Computation

In this section, we study the potential benefits of enabling MAJM operations in off-the-shelf DRAM chips on microbenchmarks. We analyze 1) performance gain (i.e., speedup on execution time) using new majority operations and 2) the sensitivity of a number of rows that are simultaneously activated (NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT) to performance gain.

Majority operation can be used to implement 1) logic operations such as AND/OR [31, 4, 96, 100, 97, 104, 25]) and XOR operations [5], and 2) full adder operations [4, 25, 31]. These operations are then used as basic building blocks for the target in-DRAM computation (e.g., addition, multiplication) [4, 6, 61, 25].

Refer to caption
Figure 16: Average MAJ3 success rate across all subarrays in a DRAM bank.

Real DRAM Chip Experiments. We tightly schedule the DRAM commands to perform majority operations and measure the execution time using the DRAM Bender. We evaluate the execution time of seven arithmetic & logic microbenchmarks for two vendors (MAJ3, MAJ5, and MAJ7 for Mfr. M and MAJ3, MAJ5, MAJ7, and MAJ9 for Mfr. H). For each majority operation, we choose the NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT that produces the highest throughput across all 21 tested DRAM modules. We perform 32-bit logic (bitwise and, or, and xor reductions) and arithmetic (addition, subtraction, multiplication, and division) computations on 8KB elements. We use 65536-element (DRAM row size) two-input vectors (e.g., A and B) where each element of the vectors is a 32-bit integer. Each element of A and B that has the same index (e.g., A[X] and B[X] in column X) is stored in the same column.

We evaluate PULSAR by employing the framework described in the prior work [25], which is based on bit-serial computation and stores the negated value of operands in the same subarray as the original operands, computed beforehand in the CPU. Bit-serial computation, using a vertical layout where operands are aligned along bitlines, applies bulk bitwise operations to entire rows of DRAM, generating results from bitlines in parallel. This approach enables PuM to perform operations efficiently [31, 4, 25, 100]. We refer the reader to the prior work [25] for the details of the framework.

Fig. 17 shows the performance of the majority operations of two manufacturers in seven microbenchmarks normalized to the state-of-the-art mechanism, FracDRAM [26] (i.e., MAJ3 with four rows), which is the blue dashed line. We make three key observations. First, PULSAR outperforms FracDRAM in all microbenchmarks. On average, PULSAR provide 2.21×\times× (1.46×\times×) performance improvement over FracDRAM in Mfr. M (Mfr. H). Second, increasing the number of operands in the MAJ provides more performance. MAJ7 provides 1.62×\times× (1.31×\times×) of the performance improvement provided by MAJ5 in Mfr. M (Mfr. H). Third, in Mfr. H, MAJ9 incurs performance degradation by 2.14×\times×. This is because MAJ9 has a poor success rate (maximum 35.35% success rate, shown in Fig. 15), which requires repeatedly performing the MAJ9, resulting in higher latency. We conclude that PULSAR significantly achieves 2.21×\times× better performance than FracDRAM by enabling new PuM primitives.

Refer to caption
Figure 17: Speedup over the state-of-the-art (MAJ3) in seven arithmetic & logic microbenchmarks.

Sensitivity to NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT. We study the effect of the number of rows that are activated simultaneously (NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT) on majority-based computation performance. Increasing NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT increases the success rate due to input replication. However, it can also increase the initialization cost since more rows are required to be initialized. We evaluate the execution time of seven microbenchmarks based on majority operations. To further analyze the potential benefits and limitations, we study the impact of initialization latency and the success rate on the performance of microbenchmarks for various numbers of rows that are used to realize the majority operation. We study four different scenarios: 1) RealExp: real experiment results, i.e., using empirical latency and success rate values, 2) RealInit: 100% success rate with empirical latency, 3) RealSR: empirical success rate with no initialization latency, and 4) Ideal: 100% success rate with no initialization latency.

Fig. 18 shows the average speedup of using 8, 16, and 32 rows to perform MAJM over the FracDRAM across all microbenchmarks. We make two key observations from Fig. 18 for Mfr. M. First, increasing the success rate results in only negligible performance improvement due to the already high empirical success rate (100% for MAJ5 and 99.95% for MAJ7). Second, in all MAJM, since the success rate is high, increasing the number of rows only increases the overhead of initialization latency and thus degrades the performance. We make two key observations for Mfr. H. First, providing a 100% success rate increases the performance by 2.55×\times× on average as Mfr. H has low empirical success rate (99.61% for MAJ5 and 81.92% for MAJ7, and 35.35% for MAJ9). Second, increasing the number of rows can improve the performance as it enables a better success rate. We conclude that for both Mfr. H and Mfr. M, reducing the initialization latency improves the performance of MAJM operations that have a high success rate and can even achieve maximum performance.

Refer to caption
Figure 18: Performance sensitivity to NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPT of Mfr. M (top) and Mfr. H (bottom) modules. All bars represent the average speedup over FracDRAM across seven microbenchmarks.

6.2 Content Destruction for Cold Boot Attack

A cold boot attack is a physical attack on DRAM that involves hot-swap** a DRAM chip and reading out the contents of the DRAM chip  [9, 29, 32, 35, 58, 63, 73, 106, 116, 134]. Cold boot attacks are possible because the data stored in DRAM is not immediately lost when the chip is powered off. This is due to the capacitive nature of DRAM cells that can hold their data up to several seconds [9, 48, 64, 65, 88] or minutes [32]. This effect can be exacerbated with low temperatures, resulting in DRAM cells retaining their content even longer.

A practical and secure way to mitigate cold boot attacks is to destroy the DRAM content rapidly during power-off/on [82, 28]. PULSAR can quickly write a predetermined value (e.g., all-zeros) to many rows with Bulk-Write and copy this value to many other rows using Multi/RowInit. This way, PULSAR can be used to rapidly destroy the DRAM content.

Evaluation. We evaluate PULSAR-based content destruction with varying numbers of rows that are simultaneously activated, from 2 to 32. PULSAR-based content destruction with N-row activation can leverage up to N-row activation (e.g., 16-row activation can use 2-, 4-, 8-, and 16-row activation but cannot use 32-row activation). PULSAR-based content destruction choose the NRGsubscript𝑁𝑅𝐺N_{RG}italic_N start_POSTSUBSCRIPT italic_R italic_G end_POSTSUBSCRIPTs with a greedy algorithm to effectively destruct the contents of all rows in a bank by issuing the least number of APA command sequence. We compare PULSAR-based content destruction to 1) RowClone [25]-based and 2) FracDRAM [26]-based content destruction. The RowClone-based content destruction is implemented as a two-step process. First, it issues a WR command to write predetermined data to an arbitrary row. Second, it performs RowClone to overwrite the content of the DRAM rows, making the original content inaccessible. The FracDRAM-based content destruction is implemented to repeatedly send the Frac operation to every row to put the rows into a neutral state, making them store VDD/2. We schedule the DRAM commands to perform all content destruction operations (i.e., Bulk-Write, Multi/RowInit, RowClone, and Frac) and measure the execution time to overwrite all the data in a bank of off-the-shelf DRAM module (H7).

Fig. 19 shows the speedup in execution time for content destruction normalized to the RowClone-based content destruction’s execution time.

Refer to caption
Figure 19: Speedup over the RowClone-based content destruction in a DRAM bank.

We make two key observations based on the Fig. 19. First, PULSAR-based content destruction with 4-, 8-, 16-, 32-row activation outperforms both RowClone-based content destruction and FracDRAM-based content destruction up to 20.87×\times× and 7.55×\times×, respectively. Second, increasing the number of simultaneously activated rows increases the speedup of PULSAR/based technique. Because increasing the number of operands in Multi/RowInit and Bulk-Write decreases the total number of operations. We conclude that PULSAR-based content destruction outperforms both techniques and destroys DRAM content significantly faster than the state-of-the-art techniques.

7 Discussion

This section discusses 1) how PULSAR can be leveraged in addition to the use cases in §6, and 2) PULSAR’s limitations.

Extending PULSAR’s Use Cases. We demonstrate the effectiveness of PULSAR in two use cases. However, PULSAR can be leveraged to improve other PuM systems. This section discusses two additional use cases. First, an end-to-end framework, using a DRAM chip as a PuM substrate, can leverage PULSAR. For example, SIMDRAM [31] automatically creates desired complex operations (e.g., addition and multiplication) by employing majority-inverter graphs to accelerate a broad range of workloads, including graph processing, databases, neural networks, and genome analysis. Unfortunately, SIMDRAM uses only MAJ3 due to the low success rate of majority operations with more inputs (e.g., MAJ5). SIMDRAM can be extended by leveraging PULSAR’s input replication technique to reliably execute MAJ operations with more inputs (e.g., MAJ5, MAJ7, and MAJ9) and thus, to achieve higher performance. However, an end-to-end system needs to address several key challenges, such as (1) programming interface, (2) compiler support, and (3) end-to-end system integration. Designing an end-to-end system for PULSAR is a direction that future work can explore. Second, PULSAR can be used to generate physical unclonable function (PUF) and true random number (TRN). Prior works experimentally demonstrate that it is possible to generate high throughput PUF [50] and TRN [51, 80] in off-the-shelf DRAM chips by violating timing constraints. Unfortunately, the throughputs of these works are bound by the latency of initializing multiple DRAM rows before each PUF and TRN generation. These works can use PULSAR’s Multi/RowInit and bulk-write primitives to reduce their initialization latency. We leave the exploration of these use cases for future work.

Limitations. We identify limitations of PULSAR under four categories. First, even though off-the-shelf DRAM chips allow simultaneously activating multiple DRAM rows, they do not provide the user with the flexibility of choosing which rows to activate. Second, all of the tested DRAM chips that successfully perform multiple-row activation is from Micron (Mfr. M) and SK Hynix (Mfr. H). We conduct experiments on 64 DRAM chips from one major manufacturer, Samsung. Unfortunately, we do not observe a successful multiple-row activation in any of the tested Samsung chips. We hypothesize that these DRAM chips have internal circuitry that ignores the Phys. Rev. E command or the second ACT command when the timing parameters (tRPsubscript𝑡𝑅𝑃t_{RP}italic_t start_POSTSUBSCRIPT italic_R italic_P end_POSTSUBSCRIPT and tRASsubscript𝑡𝑅𝐴𝑆t_{RAS}italic_t start_POSTSUBSCRIPT italic_R italic_A italic_S end_POSTSUBSCRIPT) are greatly violated, which agrees with the hypotheses of prior work [132].Unlike prior works [26, 25, 80, 132], PULSAR achieves a high success rate on DRAM chips from Mfr. M, which requires a deep understanding of the hierarchical row decoder to choose the set of DRAM rows to target for two ACT commands. Third, PULSAR is capable of performing many-input majority operations (theoretically, up to MAJ31). However, PULSAR cannot reliably perform majority operations with more than nine inputs (i.e., MAJ9+) due to very low success rates (§6.1.1). Four, PULSAR potentially have an effect on transient errors in DRAM chips. In our experiments, described in §6.1.1, we check for bitflips in the whole DRAM bank. We do not observe any errors in rows outside of the row group across any of the tested DRAM chips. We believe that investigating all potential effects of PULSAR on any type of transient error requires rigorous analysis and extensive exploration, which warrants its own study.

PULSAR is not an execution model that is immediately usable. We demonstrate a proof-of-concept of performing multi-row activation in real off-the-shelf DRAM chips and its potential benefits in improving the success rate and the performance of previously proposed PuM operations. Our work contributes towards a better understanding of the capability of real off-the-shelf DRAM chips. We hope and expect that DRAM manufacturers will adopt our approach in future DRAM chips and officially support PULSAR. We conclude that none of these limitations fundamentally prevent a system designer from using off-the-shelf DRAM chips to perform PuM operations and thus benefit from PULSAR’s high reliability and performance benefits. We hope and expect future DRAM chips to officially support simultaneous many-row activation and alleviate all of these limitations.

8 Related Work

To our knowledge, this is the first work that demonstrates a proof-of-concept that off-the-shelf DDR4 DRAM chips are capable of simultaneously activating up to 32 rows. PULSAR leverages this new observation and improves the success rate and the performance of PuM operations compared to the state-of-the-art PuM technique [26].

Multiple Row Activation in Off-the-shelf DRAM. Many prior works propose various forms of PuM operations in off-the-shelf DRAM devices using multiple row activation [80, 25, 26, 132]. ComputeDRAM [25] presents a DRAM command sequence (APA) enabling the triple row activation, resulting in a bitwise AND/OR function by violating timing parameters between consecutive DRAM commands. FracDRAM[26] stores fractional values in off-the-shelf DDR3 devices by employing a DRAM command sequence (ACT \rightarrow Phys. Rev. E) with reduced timing parameters. By leveraging the fractional values stored in DRAM, FracDRAM provides an improved success rate in MAJ3 operation and implements a physical unclonable function in DRAM. FracDRAM observes that up to 16 rows can be simultaneously activated in off-the-shelf DDR3 chips. However, FracDRAM does not provide any characterization or hypothesis of the reason behind this observation. PULSAR introduces many row activations (up to 32 rows) by choosing the target rows that are activated carefully. PULSAR proposes a hypothetical row decoder design that explains how many rows can be activated simultaneously. PULSAR improves the success rate of existing MAJ3 operations and improves the performance of PuM applications by introducing new PuM primitives based on many row activation.

Other works enable different functionalities using simultaneous many-row activation. QUAC-TRNG [80] introduces quadruple row activation and exploits this phenomenon to generate true random numbers in off-the-shelf DRAM chips. QUAC-TRNG proposes a hypothetical row decoder design that enables quadruple row activation. HiRA [132] introduces a hidden row activation mechanism by simultaneously opening two rows, leveraging the hidden row activation to implement a refresh-based RowHammer mitigation mechanism. PULSAR can be used to potentially extend these mechanisms as these proposals leverage many-row activation.

Other Off-the-shelf-DRAM-based PuM. Prior works design off-the-shelf-DRAM-based mechanisms to implement TRNG and PUF. DRAM-based TRNGs generate true random numbers by violating timing parameters [51, 110], using retention-based failures [47, 109] and using startup values [19, 112]. DRAM-based PUFs generate device-specific signatures using retention-based failures [47, 109, 131], by violating timing parameters [50], by exploiting write access latencies [33], and using startup values [111]. These operations can leverage the functionality of PULSAR to reduce their initialization latency, thereby increasing their throughput.

Modified-DRAM-based PuM. Prior works propose modification into DRAM to perform PuM operations [61, 60, 59, 66, 81, 86, 84, 87, 85, 89, 90, 93, 96, 98, 97, 101, 102, 100, 103, 99, 104, 105, 107, 108, 113, 122, 125, 129, 130, 133, 135, 136, 137, 138]. RowClone [98] is a low-cost DRAM architecture that can perform bulk data movement operations inside DRAM chips. Ambit [100] modifies the DRAM circuitry to perform bitwise MAJ3 (and thus bitwise AND/OR) by activating three rows simultaneously and bitwise NOT operations in DRAM. Unfortunately, these mechanisms require changes to DRAM chips and are not applicable to off-the-shelf DRAM chips.

9 Conclusion

We introduce PULSAR, a proof-of-concept technique that enables high-success-rate and high-performance PuM operations in off-the-shelf DRAM chips. PULSAR leverages the key observation that by issuing a carefully crafted sequence of DRAM commands, up to 32 rows can be activated simultaneously. PULSAR improves 1) the success rate through input data replication and 2) performance by enabling new PuM primitives. Our experimental results, conducted on 120 off-the-shelf DDR4 DRAM chips from two major manufacturers, demonstrate the effectiveness of PULSAR on two use cases. PULSAR achieves significant improvement over the state-of-the-art in terms of both success rate and performance. Our results show that compared to the state-of-the-art mechanism, PULSAR has 24.18% higher success rate and improves the performance in majority-based microbenchmarks by 2.21×2.21\times2.21 × on average.

Acknowledgements

We thank the SAFARI Research Group members for providing a stimulating intellectual environment. We acknowledge the generous gifts from our industrial partners, including Google, Huawei, Intel, and Microsoft. This work is supported in part by the Semiconductor Research Corporation (SRC), the ETH Future Computing Laboratory (EFCL), and the AI Chip Center for Emerging Smart Systems (ACCESS).

References

  • [1] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute Caches,” in HPCA, 2017.
  • [2] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,” in ISCA, 2015.
  • [3] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture,” in ISCA, 2015.
  • [4] M. F. Ali, A. Jaiswal, and K. Roy, “In-Memory Low-Cost Bit-Serial Addition Using Commodity DRAM Technology,” in TCAS I, 2019.
  • [5] E. Alkaldy, K. Navi, and F. Sharifi, “A novel design approach for multi-input xor gate using multi-input majority function,” Arabian Journal for Science and Engineering, vol. 39, pp. 7923–7932, 2014.
  • [6] S. Angizi and D. Fan, “GraphiDe: A Graph Processing Accelerator Leveraging In-DRAM-Computing,” in GLSVLSI, 2019.
  • [7] S. Angizi and D. Fan, “ReDRAM: A Reconfigurable Processing-in-DRAM Platform for Accelerating Bulk Bit-Wise Operations,” in ICCAD, 2019.
  • [8] F. Bai, S. Wang, X. Jia, Y. Guo, B. Yu, H. Wang, C. Lai, Q. Ren, and H. Sun, “A low-cost reduced-latency dram architecture with dynamic reconfiguration of row decoder,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 31, no. 1, pp. 128–141, 2022.
  • [9] J. Bauer, M. Gruhn, and F. C. Freiling, “Lest we forget: Cold-boot attacks on scrambled ddr3 memory,” Digital Investigation, vol. 16, pp. S65–S74, 2016, dFRWS 2016 Europe. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1742287616300032
  • [10] M. Besta, R. Kanakagiri, G. Kwasniewski, R. Ausavarungnirun, J. Beránek, K. Kanellopoulos, K. Janda, Z. Vonarburg-Shmaria, L. Gianinazzi, I. Stefan et al., “SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems,” in MICRO, 2021.
  • [11] M. Besta, Z. Vonarburg-Shmaria, Y. Schaffner, L. Schwarz, G. Kwasniewski, L. Gianinazzi, J. Beranek, K. Janda, T. Holenstein, S. Leisinger et al., “Graphminesuite: Enabling high-performance and programmable graph mining algorithms with set algebra,” arXiv preprint arXiv:2103.03653, 2021.
  • [12] M. N. Bojnordi and E. Ipek, “Pardis: A programmable memory controller for the ddrx interfacing standards,” in ISCA, 2012.
  • [13] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, “Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks,” in ASPLOS, 2018.
  • [14] C.-Y. Chan and Y. E. Ioannidis, “Bitmap index design and evaluation,” in Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 1998, pp. 355–366.
  • [15] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory,” in ISCA, 2016.
  • [16] J. Dean and L. A. Barroso, “The Tail at Scale,” CACM, 2013.
  • [17] L. Deng, “The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web],” IEEE Signal Processing Magazine, 2012.
  • [18] Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang, “DrAcc: A DRAM Based Accelerator for Accurate CNN Inference,” in DAC, 2018.
  • [19] C. Eckert, F. Tehranipoor, and J. A. Chandy, “Drng: Dram-based random number generation using its startup value behavior,” in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS).   IEEE, 2017, pp. 1260–1263.
  • [20] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, and R. Das, “Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks,” in ISCA, 2018.
  • [21] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware,” in ASPLOS, 2012.
  • [22] N. Firasta, M. Buxton, P. **bo, K. Nasri, and S. Kuo, “Intel AVX: New Frontiers in Performance Improvements and Energy Efficiency,” Intel Corp., 2008, white paper.
  • [23] D. Fujiki, S. Mahlke, and R. Das, “Duality Cache for Data Parallel Acceleration,” in ISCA, 2019.
  • [24] C. Gao, X. Xin, Y. Lu, Y. Zhang, J. Yang, and J. Shu, “Parabit: processing parallel bitwise operations in nand flash memory based ssds,” in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 59–70.
  • [25] F. Gao, G. Tziantzioulis, and D. Wentzlaff, “ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs,” in MICRO, 2019.
  • [26] F. Gao, G. Tziantzioulis, and D. Wentzlaff, “Fracdram: Fractional values in off-the-shelf dram,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, pp. 885–899.
  • [27] S. Ghose, T. Li, N. Ha**azar, D. S. Cali, and O. Mutlu, “Demystifying Complex Workload–DRAM Interactions: An Experimental Study,” in SIGMETRICS, 2020.
  • [28] T. C. Group, “Tcg platform reset attack mitigation specification,” TCG, 2008.
  • [29] M. Gruhn and T. Müller, “On the practicability of cold boot attacks,” in 2013 International Conference on Availability, Reliability and Security, 2013, pp. 390–397.
  • [30] Z. Guz, M. Awasthi, V. Balakrishnan, M. Ghosh, A. Shayesteh, T. Suri, and S. Semiconductor, “Real-time analytics as the killer application for processing-in-memory,” Near Data Processing (WoNDP), pp. 10–2, 2014.
  • [31] N. Ha**azar, G. F. Oliveira, S. Gregorio, J. D. Ferreira, N. M. Ghiasi, M. Patel, M. Alser, S. Ghose, J. Gómez-Luna, and O. Mutlu, “SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM,” in ASPLOS, 2021.
  • [32] J. A. Halderman, S. D. Schoen, N. Heninger, W. Clarkson, W. Paul, J. A. Calandrino, A. J. Feldman, J. Appelbaum, and E. W. Felten, “Lest we remember: Cold-boot attacks on encryption keys,” Commun. ACM, vol. 52, no. 5, p. 91–98, may 2009. [Online]. Available: https://doi.org/10.1145/1506409.1506429
  • [33] M. S. Hashemian, B. Singh, F. Wolff, D. Weyer, S. Clay, and C. Papachristou, “A robust authentication methodology using physically unclonable functions in dram arrays,” in 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).   IEEE, 2015, pp. 647–652.
  • [34] Z. He, L. Yang, S. Angizi, A. S. Rakin, and D. Fan, “Sparse BD-Net: A Multiplication-Less DNN with Sparse Binarized Depth-Wise Separable Convolution,” JETC, vol. 16, no. 2, pp. 1–24, 2020.
  • [35] C. Hilgers, H. Macht, T. Müller, and M. Spreitzenbarth, “Post-mortem memory analysis of cold-booted android devices,” in 2014 Eighth International Conference on IT Security Incident Management & IT Forensics, 2014, pp. 62–75.
  • [36] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler, “Transparent Offloading and Map** (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems,” in ISCA, 2016.
  • [37] T. Hussain, A. Haider, and E. Ayguadé, “Pmss: A programmable memory system and scheduler for complex memory patterns,” Journal of Parallel and Distributed Computing, 2014.
  • [38] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision,” in ISCA, 2019.
  • [39] Intel Corp., “Intel i7-6700k skylake,” 2015, ark.intel.com/products/88195.
  • [40] International Technology Roadmap for Semiconductors, “ITRS Reports,” http://www.itrs2.net/itrs-reports.html, 2015.
  • [41] JEDEC, JESD209-4B: Low Power Double Data Rate 4 (LPDDR4) Standard, 2017.
  • [42] JEDEC, JESD209-5A: LPDDR5 SDRAM Standard, 2020.
  • [43] JEDEC, JESD79-4C: DDR4 SDRAM Standard, 2020.
  • [44] JEDEC, JESD79-5: DDR5 SDRAM Standard, 2020.
  • [45] S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, and S. Xu, “Bluedbm: An appliance for big data analytics,” ACM SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 1–13, 2015.
  • [46] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, “Profiling a Warehouse-Scale Computer,” in ISCA, 2015.
  • [47] C. Keller, F. Gurkaynak, H. Kaeslin, and N. Felber, “Dynamic Memory-based Physically Unclonable Function for the Generation of Unique Identifiers and True Random Numbers,” in ISCAS, 2014.
  • [48] S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C. Wilkerson, and O. Mutlu, “The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study,” in SIGMETRICS, 2014.
  • [49] S. Khan, D. Lee, and O. Mutlu, “PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM,” in DSN, 2016.
  • [50] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency–Reliability Tradeoff in Modern Commodity DRAM Devices,” in HPCA, 2018.
  • [51] J. S. Kim, M. Patel, H. Hassan, L. Orosa, and O. Mutlu, “D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput,” in HPCA, 2019.
  • [52] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior,” in MICRO, 2010.
  • [53] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM,” in ISCA, 2012.
  • [54] Y. Kim, V. Seshadri, D. Lee, J. Liu, O. Mutlu, Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM,” in ISCA, 2012.
  • [55] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simulator,” CAL, 2016.
  • [56] A. Krizhevsky, “Convolutional Deep Belief Networks on CIFAR-10,” https://www.cs.toronto.edu/~kriz/conv-cifar10-aug2010.pdf, 2010.
  • [57] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving Memory Bank-Level Parallelism in the Presence of Prefetching,” in MICRO, 2009.
  • [58] H. T. Lee, H. Kim, Y.-J. Baek, and J. H. Cheon, “Correcting errors in private keys obtained from cold boot attacks,” in Information Security and Cryptology - ICISC 2011, H. Kim, Ed.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2012.
  • [59] S. Li, A. O. Glova, X. Hu, P. Gu, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “SCOPE: A Stochastic Computing Engine for DRAM-Based In-Situ Accelerator,” in MICRO, 2018.
  • [60] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “DRISA: A DRAM-Based Reconfigurable In-Situ Accelerator,” in MICRO, 2017.
  • [61] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-Volatile Memories,” in DAC, 2016.
  • [62] Y. Li and J. M. Patel, “BitWeaving: Fast Scans for Main Memory Data Processing,” in SIGMOD, 2013.
  • [63] S. Lindenlauf, H. Höfken, and M. Schuba, “Cold boot attacks on ddr2 and ddr3 sdram,” in 2015 10th International Conference on Availability, Reliability and Security, 2015, pp. 287–292.
  • [64] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, O. Mutlu, J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” in ISCA, 2013.
  • [65] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-Aware Intelligent DRAM Refresh,” in ISCA, 2012.
  • [66] T. A. Manning, “Apparatuses and Methods for Comparing Data Patterns in Memory,” 2018, US Patent 9,934,856.
  • [67] Maxwell, “FT20X User Manual,” https://www.maxwell-fa.com/upload/files/base/8/m/311.pdf.
  • [68] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in IMW, 2013.
  • [69] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Enabling Practical Processing in and Near Memory For Data-Intensive Computing,” in DAC, 2019.
  • [70] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Processing Data Where It Makes Sense: Enabling In-Memory Computation,” in Microprocessors and Microsystems, 2019.
  • [71] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
  • [72] O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory Systems,” SUPERFRI, 2014.
  • [73] T. Müller, A. Dewald, and F. Freiling, “Aesse: a cold-boot resistant implementation of aes,” 04 2010, pp. 42–47.
  • [74] Nanoscale Integration and Modeling (NIMO) Group, ASU, “Predictive Technology Model (PTM),” http://ptm.asu.edu/, 2012.
  • [75] K. Navi, M. H. Moaiyeri, R. F. Mirzaee, O. Hashemipour, and B. M. Nezhad, “Two new low-power full adders based on majority-not gates,” Microelectronics journal, vol. 40, no. 1, pp. 126–130, 2009.
  • [76] NVidia, “Nvidia titan x,” 2016, nvidia.com/en-us/geforce/products/10series/titan-xp/.
  • [77] A. Olgun, H. Hassan, A. G. Yağlıkcı, Y. C. Tuğrul, L. Orosa, H. Luo, M. Patel, E. Oğuz, and O. Mutlu, “DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips,” arXiv:2211.05838 [cs.AR], 2022.
  • [78] A. Olgun, J. G. Luna, K. Kanellopoulos, B. Salami, H. Hassan, O. Ergin, and O. Mutlu, “Pidram: A holistic end-to-end fpga-based framework for processing-in-dram,” ACM Transactions on Architecture and Code Optimization, vol. 20, no. 1, pp. 1–31, 2022.
  • [79] A. Olgun, J. G. Luna, K. Kanellopoulos, B. Salami, H. Hassan, O. Ergin, and O. Mutlu, “PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM,” TACO, 2022.
  • [80] A. Olgun, M. Patel, A. G. Yağlıkçı, H. Luo, J. S. Kim, N. Bostanci, N. Vijaykumar, O. Ergin, and O. Mutlu, “QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips,” in ISCA, 2021.
  • [81] G. F. Oliveira, J. Gómez-Luna, S. Ghose, A. Boroumand, and O. Mutlu, “Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud,” IEEE Micro, 2022.
  • [82] L. Orosa, Y. Wang, M. Sadrosadati, J. S. Kim, M. Patel, I. Puddu, H. Luo, K. Razavi, J. Gómez-Luna, H. Hassan, N. Mansouri-Ghiasi, S. Ghose, and O. Mutlu, “Codic: A low-cost substrate for enabling custom in-dram functionalities and optimizations,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 484–497.
  • [83] J. Park, R. Azizi, G. F. Oliveira, M. Sadrosadati, R. Nadig, D. Novo, J. Gómez-Luna, M. Kim, and O. Mutlu, “Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory,” in MICRO, 2022.
  • [84] F. Parveen, S. Angizi, Z. He, and D. Fan, “Low Power In-Memory Computing Based on Dual-Mode SOT-MRAM,” in ISLPED, 2017.
  • [85] F. Parveen, S. Angizi, Z. He, and D. Fan, “IMCS2: Novel Device-to-Architecture Co-Design For Low-Power In-Memory Computing Platform using Coterminous Spin Switch,” in IEEE Trans. Magn., 2018.
  • [86] F. Parveen, Z. He, S. Angizi, and D. Fan, “Hybrid Polymorphic Logic Gate with 5-Terminal Magnetic Domain Wall Motion Device,” in ISVLSI, 2017.
  • [87] F. Parveen, Z. He, S. Angizi, and D. Fan, “HielM: Highly Flexible In-Memory Computing using STT MRAM,” in ASP-DAC, 2018.
  • [88] M. Patel, J. S. Kim, and O. Mutlu, “The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions,” in ISCA, 2017.
  • [89] A. S. Rakin, S. Angizi, Z. He, and D. Fan, “PIM-TGAN: A Processing-in-Memory Accelerator for Ternary Generative Adversarial Networks,” in ICCD, 2018.
  • [90] A. K. Ramanathan, G. S. Kalsi, S. Srinivasa, T. M. Chandran, K. R. Pillai, O. J. Omer, V. Narayanan, and S. Subramoney, “Look-Up Table Based Energy Efficient Processing in Cache Support for Neural Network Acceleration,” in MICRO, 2020.
  • [91] Rambus Inc., “Rambus Power Model,” https://www.rambus.com/energy/.
  • [92] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European conference on computer vision.   Springer, 2016, pp. 525–542.
  • [93] S. H. S. Rezaei, M. Modarressi, R. Ausavarungnirun, M. Sadrosadati, O. Mutlu, and M. Daneshtalab, “NoM: Network-on-Memory for Inter-Bank Data Transfer in Highly-Banked Memories,” in CAL, 2020.
  • [94] SAFARI Research Group, “DRAM Bender — GitHub Repository,” https://github.com/CMU-SAFARI/DRAM-Bender, 2022.
  • [95] SAFARI Research Group, “PiDRAM Source Code,” https://github.com/CMU-SAFARI/PiDRAM, 2022.
  • [96] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM,” arXiv:1611.09988 [cs:AR], 2016.
  • [97] V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. A. Kozuch†, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Fast Bulk Bitwise AND and OR in DRAM,” in CAL, 2015.
  • [98] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. Mowry, “RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization,” in MICRO, 2013.
  • [99] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “RowClone: Accelerating Data Movement and Initialization Using DRAM,” arXiv:1805.03502 [cs.AR], 2018.
  • [100] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
  • [101] V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-Unit Strided Accesses,” in MICRO, 2015.
  • [102] V. Seshadri and O. Mutlu, “The Processing Using Memory Paradigm: In-DRAM Bulk Copy, Initialization, Bitwise AND and OR,” arXiv:1610.09603 [cs.AR], 2016.
  • [103] V. Seshadri and O. Mutlu, “Simple Operations in Memory to Reduce Data Movement,” in Adv. Comput., 2017.
  • [104] V. Seshadri and O. Mutlu, “In-DRAM Bulk Bitwise Execution Engine,” arXiv:1905.09822, 2019.
  • [105] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars,” in ISCA, 2016.
  • [106] P. Simmons, “Security through amnesia: A software-based solution to the cold boot attack on disk encryption,” in Proceedings of the 27th Annual Computer Security Applications Conference, ser. ACSAC ’11.   New York, NY, USA: Association for Computing Machinery, 2011, p. 73–82. [Online]. Available: https://doi.org/10.1145/2076732.2076743
  • [107] L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning,” in HPCA, 2017.
  • [108] L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “GraphR: Accelerating Graph Processing Using ReRAM,” in HPCA, 2018.
  • [109] S. Sutar, A. Raha, and V. Raghunathan, “D-PUF: An Intrinsically Reconfigurable DRAM PUF for Device Authentication in Embedded Systems,” in CASES, 2016.
  • [110] B. B. Talukder, J. Kerns, B. Ray, T. Morris, and M. T. Rahman, “Exploiting dram latency variations for generating true random numbers,” in 2019 IEEE International Conference on Consumer Electronics (ICCE).   IEEE, 2019, pp. 1–6.
  • [111] F. Tehranipoor, N. Karimian, W. Yan, and J. A. Chandy, “Dram-based intrinsic physically unclonable functions for system-level security and authentication,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 3, pp. 1085–1097, 2016.
  • [112] F. Tehranipoor, W. Yan, and J. A. Chandy, “Robust hardware true random number generators using dram remanence effects,” in 2016 IEEE International Symposium on Hardware Oriented Security and Trust (HOST).   IEEE, 2016, pp. 79–84.
  • [113] Y. Tian, T. Wang, Q. Zhang, and Q. Xu, “ApproxLUT: A Novel Approximate Lookup Table-Based Accelerator,” in ICCAD, 2017.
  • [114] M. Torabzadehkashi, S. Rezaei, A. Heydarigorji, H. Bobarshad, V. Alves, and N. Bagherzadeh, “Catalina: in-storage processing acceleration for scalable big data analytics,” in 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP).   IEEE, 2019, pp. 430–437.
  • [115] M. A. Turi and J. G. Delgado-Frias, “High-performance low-power selective precharge schemes for address decoders,” IEEE transactions on circuits and systems II: express briefs, vol. 55, no. 9, pp. 917–921, 2008.
  • [116] R. Villanueva-Polanco, “Cold boot attacks on bliss,” in Progress in Cryptology – LATINCRYPT 2019: 6th International Conference on Cryptology and Information Security in Latin America, Santiago de Chile, Chile, October 2–4, 2019, Proceedings.   Berlin, Heidelberg: Springer-Verlag, 2019, p. 40–61. [Online]. Available: https://doi.org/10.1007/978-3-030-30530-7_3
  • [117] T. Vogelsang, “Understanding the Energy Consumption of Dynamic Random Access Memories,” in MICRO, 2010.
  • [118] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang et al., “Bigdatabench: A Big Data Benchmark Suite from Internet Services,” in HPCA, 2014.
  • [119] Y. Wang, L. Orosa, X. Peng, Y. Guo, S. Ghose, M. Patel, J. S. Kim, J. G. Luna, M. Sadrosadati, N. M. Ghiasi et al., “FIGARO: Improving System Performance via Fine-Grained In-DRAM Data Relocation and Caching,” in MICRO, 2020.
  • [120] N. H. Weste and D. Harris, CMOS VLSI design: a circuits and systems perspective.   Pearson Education India, 2015.
  • [121] K. Wu, “Fastbit: an efficient indexing technology for accelerating data-intensive science,” in Journal of Physics: Conference Series, vol. 16, no. 1.   IOP Publishing, 2005, p. 556.
  • [122] L. Wu, R. Sharifi, A. Venkat, and K. Skadron, “DRAM-CAM: General-Purpose Bit-Serial Exact Pattern Matching,” in CAL, 2022.
  • [123] M.-C. Wu and A. P. Buchmann, “Encoded bitmap indexing for data warehouses,” in Proceedings 14th International Conference on Data Engineering.   IEEE, 1998, pp. 220–230.
  • [124] W. A. Wulf and S. A. McKee, “Hitting the Memory Wall: Implications of the Obvious,” CAN, 1995.
  • [125] L. Xie, H. A. Du Nguyen, M. Taouil, S. Hamdioui, and K. Bertels, “Fast Boolean Logic Mapped on Memristor Crossbar,” in ICCD, 2015.
  • [126] Xilinx, “Adaptable accelerator cards for data center workloads,” 2021. [Online]. Available: https://www.xilinx.com/products/boards-and-kits/alveo.html
  • [127] Xilinx, “Fpgas and 3d ics,” 2021. [Online]. Available: https://www.xilinx.com/products/silicon-devices/fpga.html
  • [128] Xilinx Inc., “Xilinx Alveo U200 FPGA Board,” https://www.xilinx.com/products/boards-and-kits/alveo/u200.html.
  • [129] X. Xin, Y. Zhang, and J. Yang, “ROC: DRAM-Based Processing with Reduced Operation Cycles,” in DAC, 2019.
  • [130] X. Xin, Y. Zhang, and J. Yang, “ELP2IM: Efficient and Low Power Bitwise Operation Processing in DRAM,” in HPCA, 2020.
  • [131] W. Xiong, A. Schaller, N. A. Anagnostopoulos, M. U. Saleem, S. Gabmeyer, S. Katzenbeisser, and J. Szefer, “Run-time Accessible DRAM PUFs in Commodity Devices,” in CHES, 2016.
  • [132] A. G. Yağlıkçı, A. Olgun, M. Patel, H. Luo, H. Hassan, L. Orosa, O. Ergin, and O. Mutlu, “Hira: Hidden row activation for reducing refresh latency of off-the-shelf dram chips,” arXiv preprint arXiv:2209.10198, 2022.
  • [133] L. Yang, S. Angizi, and D. Fan, “A Flexible Processing-in-Memory Accelerator for Dynamic Channel-Adaptive Deep Neural Networks,” in ASP-DAC, 2020.
  • [134] S. F. Yitbarek, M. T. Aga, R. Das, and T. Austin, “Cold boot attacks are still hot: Security analysis of memory scramblers in modern processors,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 313–324.
  • [135] J. Yu, H. A. Du Nguyen, L. Xie, M. Taouil, and S. Hamdioui, “Memristive Devices for Computation-in-Memory,” in DATE, 2018.
  • [136] J. T. Zawodny and G. E. Hush, “Apparatuses and Methods to Reverse Data Stored in Memory,” 2018, US Patent 9,959,923.
  • [137] Y. Zha and J. Li, “Hyper-AP: Enhancing Associative Processing through a Full-Stack Optimization,” in ISCA, 2020.
  • [138] H. Zhao, A. Goda, K. K. Parat, A. G. Mauri, H. Liu, T. Tanzawa, S. Yamada, and K. Sakui, “Apparatuses and Methods to Control Body Potential in Memory Operations,” 2017, US Patent 9,536,618.

Appendix A PULSAR System Integration

Power Constraints. We count PULSAR’s row activations and issue them with respect to the four activation window (tFAWsubscript𝑡𝐹𝐴𝑊t_{FAW}{}italic_t start_POSTSUBSCRIPT italic_F italic_A italic_W end_POSTSUBSCRIPT) constraint in DDRx DRAM standards [43, 44, 41, 42], which limits the rate of performed activations in a rank to stay under the power budget. Hence, we ensure that the row activations are performed within the power budget of a DRAM rank.

Compatibility with Off-the-Shelf DRAM Chips. We experimentally demonstrate that PULSAR works on 120 off-the-shelf DDR4 DRAM chips from two major DRAM manufacturers. Therefore, PULSAR does not require any modifications to these real DRAM chips.

Compatibility with Different Computing Systems. We discuss PULSAR’s compatibility with three types of computing systems: 1) FPGA-based systems (e.g., PiDRAM [79, 77]), 2) contemporary processors, and 3) systems with programmable memory controllers [37, 12]. First, PULSAR can be easily integrated into all existing FPGA-based systems that use DRAM to store data [79, 126, 127, 77]. We showcase a system integration using DRAM Bender [77] for our performance evaluation as it is widely available and does not require any changes in the processor circuitry (§B). Second, contemporary processors require modifications to their memory controller logic to implement PULSAR. Implementing PULSAR is a design-time decision that requires balancing manufacturing cost with PULSAR’s performance benefits. We show that PULSAR significantly improves system performance (§B), but we leave the analysis of such integration’s hardware complexity for future work. Third, systems that employ programmable controllers [37, 12] can be relatively easily modified to implement PULSAR by programming PULSAR’s operations using the ISA of programmable memory controllers [37, 12].

Appendix B Effect on Real-World Kernels

We present a comprehensive evaluation to provide insights into PULSAR’s performance benefits on nine real-world kernels over a real CPU, a GPU and state-of-the-art commodity DRAM-based PuM techniques.

B.1 Experimental Methodology

Experimental Setup. We evaluate PULSAR using a real end-to-end system that consists of two components: 1) a contemporary computer that hosts the workloads we evaluate (host machine) and 2) an FPGA board that implements DRAMBender [77] connected to the host machine through a PCIe bus. We extend the DRAMBender C++ API to support tightly scheduling DRAM commands for performing PULSAR (i.e., Multi/RowInit, MAJ3, MAJ5, and MAJ7) and FracDRAM (i.e., RowClone and MAJ3) operations.

Algorithm for evaluating PULSAR/FracDRAM. We evaluate PULSAR and FracDRAM in three steps: First, the host machine computes the input operands’ negated values, and both the original and negated data are then stored in the FPGA board’s DRAM module in a vertical data layout (§2.4). Second, we create a DRAM Bender program that implements the workload we test using PULSAR’s new computation primitives (e.g., MAJ5), and we offload the program to the FPGA board to perform PuM operations. Third, the DRAM module performs PuM operations, and the results of the PuM operations are read back from the DRAM module to the application running on the host machine. We repeat this process for each workload, capturing the execution time of each workload by taking the PCIe latency into account.

B.2 Results

We analyze the performance benefits of PULSAR on real-world applications and compare PULSAR against CPU, GPU, and FracDRAM. We use a real multicore CPU (Intel Skylake [39]) and optimize our workloads to leverage AVX-512 instructions [22]. We measure performance on a real high-end GPU (Nvidia Titan V [76]). We capture GPU kernel execution time that excludes data initialization/transfer time. We report the average of five runs for each CPU/GPU data point, each with a warmup phase, to avoid cold cache effects. We capture the execution time of each workload on CPU and GPU.

We conduct evaluations on 9 real-world applications that heavily rely on the evaluated microbenchmarks. We explain these applications under five categories. 1) Convolutional Neural Networks (CNNs): We use XNOR-NET implementations [92] of VGG-13, VGG-16, and LeNet-5 provided by [34], which performs convolutions using a series of bitcount, addition, and XNOR operations. We evaluate the inferences of VGG-13 and -16 using CIFAR-10 [56] and LeNeT-5 using MNIST [17] datasets. 2) k-Nearest Neighbor Classifier (kNN): We apply the kNN classifier to solve the handwritten digits recognition problem using the MNIST dataset. We implement a quantized 8-bit version of the Euclidean distance algorithm entirely in DRAM using PULSAR. 3) Database: We evaluate two workloads: BitMap Indices (BMI) [14] and BitWeawing (BW) [62]. BMI provides high space efficiency and high performance for many queries (e.g., join and scan) in databases compared to traditional B-tree indices. Our BMI workload runs the query: “How many users were active every day for the past month?” on a database that tracks the login activities of 8 million users. Our BW workload evaluates a simple table scan query: select count(*) from T where c1 <= val <= c24) Graph Processing: We evaluate two graph procesing workloads: k-Clique Star (KCS) [10, 11] and Triangle Counting (TC) [10, 11]. KCS aims to find all k-clique stars in a given graph. A k-clique star consists of a k-clique (a set of k fully connected vertices) and additional vertices connected to all k-clique members. Using a set-centric approach [10], we represent vertices and k-cliques in the form of bit-vectors encoding their adjacency to others, which enables us to perform this operation by a set of bitwise operations, similar to [83, 10]. TC involves calculating the total number of 3-cycles (triangles) in a graph, and it can be done by a set of bitwise operations, similar to [10]. 5) Image processing: image segmentation (IMS), an image processing application that aims to break an image into multiple regions depending on a given set of colors. In IMS, each image consists of 800×600 pixels with four colors. We adapt our implementation using the prior PuM works [83, 24].

We evaluate two different configurations of PULSAR and FracDRAM where 1 (PULSAR:1 and FracDRAM:1) and 16 (PULSAR:16 and FracDRAM:16) banks out of all the banks in the DRAM module to leverage bank-level parallelism to maximize DRAM throughput [52, 54, 55, 57, 71].

Refer to caption
Figure 20: Normalized speedup of real-world applications. PULSAR:X and FracDRAM:X uses X DRAM banks for computation.

Fig. 20 shows the performance of PULSAR and our baseline configurations for each application, normalized to that of the multicore CPU. We make three key observations. First, PULSAR:16 greatly outperforms the CPU and GPU baselines, providing 43.38×\times× and 2.65×\times× the performance of the CPU, and GPU, respectively, on average across all nine applications. Second, PULSAR:16 (PULSAR:1) provides 1.59×\times× (1.55×\times×) the performance of FracDRAM:16 (FracDRAM:1), on average, across all nine applications, with a maximum of 2.01×\times× (1.90×\times×) the performance of FracDRAM:16 (FracDRAM:1) for the BW application. Third, even with a single DRAM bank (i.e., PULSAR:1), PULSAR always outperforms the CPU baseline, providing 2.71×\times× the performance of the CPU on average across all applications. This speedup is a direct result of leveraging the high in-DRAM bandwidth in PULSAR to avoid the memory bottleneck in the CPU caused by the large amounts of intermediate data generated in such applications. We conclude that PULSAR is an effective and efficient off-the-shelf DRAM-based technique to accelerate many commonly-used real-world applications.