Search | arXiv e-print repository

Optimal Almost-Balanced Sequences

Authors: Daniella Bar-Lev, Adir Kobovich, Orian Leitersdorf, Eitan Yaakobi

Abstract: This paper presents a novel approach to address the constrained coding challenge of generating almost-balanced sequences. While strictly balanced sequences have been well studied in the past, the problem of designing efficient algorithms with small redundancy, preferably constant or even a single bit, for almost balanced sequences has remained unsolved. A sequence is $\varepsilon(n)$-almost balanc… ▽ More This paper presents a novel approach to address the constrained coding challenge of generating almost-balanced sequences. While strictly balanced sequences have been well studied in the past, the problem of designing efficient algorithms with small redundancy, preferably constant or even a single bit, for almost balanced sequences has remained unsolved. A sequence is $\varepsilon(n)$-almost balanced if its Hamming weight is between $0.5n\pm \varepsilon(n)$. It is known that for any algorithm with a constant number of bits, $\varepsilon(n)$ has to be in the order of $Θ(\sqrt{n})$, with $O(n)$ average time complexity. However, prior solutions with a single redundancy bit required $\varepsilon(n)$ to be a linear shift from $n/2$. Employing an iterative method and arithmetic coding, our emphasis lies in constructing almost balanced codes with a single redundancy bit. Notably, our method surpasses previous approaches by achieving the optimal balanced order of $Θ(\sqrt{n})$. Additionally, we extend our method to the non-binary case considering $q$-ary almost polarity-balanced sequences for even $q$, and almost symbol-balanced for $q=4$. Our work marks the first asymptotically optimal solutions for almost-balanced sequences, for both, binary and non-binary alphabet. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: Accepted to The IEEE International Symposium on Information Theory (ISIT) 2024

arXiv:2310.06989 [pdf, other]

doi 10.1109/TCAD.2023.3322351

TDPP: Two-Dimensional Permutation-Based Protection of Memristive Deep Neural Networks

Authors: Minhui Zou, Zhenhua Zhu, Tzofnat Greenberg-Toledo, Orian Leitersdorf, Jiang Li, Junlong Zhou, Yu Wang, Nan Du, Shahar Kvatinsky

Abstract: The execution of deep neural network (DNN) algorithms suffers from significant bottlenecks due to the separation of the processing and memory units in traditional computer systems. Emerging memristive computing systems introduce an in situ approach that overcomes this bottleneck. The non-volatility of memristive devices, however, may expose the DNN weights stored in memristive crossbars to potenti… ▽ More The execution of deep neural network (DNN) algorithms suffers from significant bottlenecks due to the separation of the processing and memory units in traditional computer systems. Emerging memristive computing systems introduce an in situ approach that overcomes this bottleneck. The non-volatility of memristive devices, however, may expose the DNN weights stored in memristive crossbars to potential theft attacks. Therefore, this paper proposes a two-dimensional permutation-based protection (TDPP) method that thwarts such attacks. We first introduce the underlying concept that motivates the TDPP method: permuting both the rows and columns of the DNN weight matrices. This contrasts with previous methods, which focused solely on permuting a single dimension of the weight matrices, either the rows or columns. While it's possible for an adversary to access the matrix values, the original arrangement of rows and columns in the matrices remains concealed. As a result, the extracted DNN model from the accessed matrix values would fail to operate correctly. We consider two different memristive computing systems (designed for layer-by-layer and layer-parallel processing, respectively) and demonstrate the design of the TDPP method that could be embedded into the two systems. Finally, we present a security analysis. Our experiments demonstrate that TDPP can achieve comparable effectiveness to prior approaches, with a high level of security when appropriately parameterized. In addition, TDPP is more scalable than previous methods and results in reduced area and power overheads. The area and power are reduced by, respectively, 1218$\times$ and 2815$\times$ for the layer-by-layer system and by 178$\times$ and 203$\times$ for the layer-parallel system compared to prior works. △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: 14 pages, 11 figures

arXiv:2308.14007 [pdf, other]

CUDA-PIM: End-to-End Integration of Digital Processing-in-Memory from High-Level C++ to Microarchitectural Design

Authors: Orian Leitersdorf, Ronny Ronen, Shahar Kvatinsky

Abstract: Digital processing-in-memory (PIM) architectures mitigate the memory wall problem by facilitating parallel bitwise operations directly within memory. Recent works have demonstrated their algorithmic potential for accelerating data-intensive applications; however, there remains a significant gap in the programming model and microarchitectural design. This is further exacerbated by the emerging mode… ▽ More Digital processing-in-memory (PIM) architectures mitigate the memory wall problem by facilitating parallel bitwise operations directly within memory. Recent works have demonstrated their algorithmic potential for accelerating data-intensive applications; however, there remains a significant gap in the programming model and microarchitectural design. This is further exacerbated by the emerging model of partitions, which significantly complicates control and periphery. Therefore, inspired by NVIDIA CUDA, this paper provides an end-to-end architectural integration of digital memristive PIM from an abstract high-level C++ programming interface for vector operations to the low-level microarchitecture. We begin by proposing an efficient microarchitecture and instruction set architecture (ISA) that bridge the gap between the low-level control periphery and an abstraction of PIM parallelism into warps and threads. We subsequently propose a PIM compilation library that converts high-level C++ to ISA instructions, and a PIM driver that translates ISA instructions into PIM micro-operations. This drastically simplifies the development of PIM applications and enables PIM integration within larger existing C++ CPU/GPU programs for heterogeneous computing with significant ease. Lastly, we present an efficient GPU-accelerated simulator for the proposed PIM microarchitecture. Although slower than a theoretical PIM chip, this simulator provides an accessible platform for developers to start executing and debugging PIM algorithms. To validate our approach, we implement state-of-the-art matrix operations and FFT PIM-based algorithms as case studies. These examples demonstrate drastically simplified development without compromising performance, showing the potential and significance of CUDA-PIM. △ Less

Submitted 27 August, 2023; originally announced August 2023.

arXiv:2305.04122 [pdf, other]

ConvPIM: Evaluating Digital Processing-in-Memory through Convolutional Neural Network Acceleration

Authors: Orian Leitersdorf, Ronny Ronen, Shahar Kvatinsky

Abstract: Processing-in-memory (PIM) architectures are emerging to reduce data movement in data-intensive applications. These architectures seek to exploit the same physical devices for both information storage and logic, thereby dwarfing the required data transfer and utilizing the full internal memory bandwidth. Whereas analog PIM utilizes the inherent connectivity of crossbar arrays for approximate matri… ▽ More Processing-in-memory (PIM) architectures are emerging to reduce data movement in data-intensive applications. These architectures seek to exploit the same physical devices for both information storage and logic, thereby dwarfing the required data transfer and utilizing the full internal memory bandwidth. Whereas analog PIM utilizes the inherent connectivity of crossbar arrays for approximate matrix-vector multiplication in the analog domain, digital PIM architectures enable bitwise logic operations with massive parallelism across columns of data within memory arrays. Several recent works have extended the computational capabilities of digital PIM architectures towards the full-precision (single-precision floating-point) acceleration of convolutional neural networks (CNNs); yet, they lack a comprehensive comparison to GPUs. In this paper, we examine the potential of digital PIM for CNN acceleration through an updated quantitative comparison with GPUs, supplemented with an analysis of the overall limitations of digital PIM. We begin by investigating the different PIM architectures from a theoretical perspective to understand the underlying performance limitations and improvements compared to state-of-the-art hardware. We then uncover the tradeoffs between the different strategies through a series of benchmarks ranging from memory-bound vectored arithmetic to CNN acceleration. We conclude with insights into the general performance of digital PIM architectures for different data-intensive applications. △ Less

Submitted 6 May, 2023; originally announced May 2023.

arXiv:2304.02336 [pdf, other]

doi 10.1016/j.memori.2023.100034

FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication

Authors: Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar Kvatinsky

Abstract: The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive O(n^2) to O(n log n), and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators su… ▽ More The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive O(n^2) to O(n log n), and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators such as GPUs cannot exploit their full computing capabilities as memory access becomes the bottleneck. Therefore, this paper accelerates the FFT algorithm using digital Processing-in-Memory (PIM) architectures that shift computation into the memory by exploiting physical devices capable of storage and logic (e.g., memristors). We propose an O(log n) in-memory FFT algorithm that can also be performed in parallel across multiple arrays for high-throughput batched execution, supporting both fixed-point and floating-point numbers. Through the convolution theorem, we extend this algorithm to O(log n) polynomial multiplication - a fundamental task for applications such as cryptography. We evaluate FourierPIM on a publicly-available cycle-accurate simulator that verifies both correctness and performance, and demonstrate 5-15x throughput and 4-13x energy improvement over the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial multiplication. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Journal ref: Memories - Materials, Devices, Circuits and Systems, Volume 4, 2023

arXiv:2304.01317 [pdf, other]

Universal Framework for Parametric Constrained Coding

Authors: Daniella Bar-Lev, Adir Kobovich, Orian Leitersdorf, Eitan Yaakobi

Abstract: Constrained coding is a fundamental field in coding theory that tackles efficient communication through constrained channels. While channels with fixed constraints have a general optimal solution, there is increasing demand for parametric constraints that are dependent on the message length. Several works have tackled such parametric constraints through iterative algorithms, yet they require compl… ▽ More Constrained coding is a fundamental field in coding theory that tackles efficient communication through constrained channels. While channels with fixed constraints have a general optimal solution, there is increasing demand for parametric constraints that are dependent on the message length. Several works have tackled such parametric constraints through iterative algorithms, yet they require complex constructions specific to each constraint to guarantee convergence through monotonic progression. In this paper, we propose a universal framework for tackling any parametric constrained-channel problem through a novel simple iterative algorithm. By reducing an execution of this iterative algorithm to an acyclic graph traversal, we prove a surprising result that guarantees convergence with efficient average time complexity even without requiring any monotonic progression. We demonstrate the effectiveness of this universal framework by applying it to a variety of both local and global channel constraints. We begin by exploring the local constraints involving illegal substrings of variable length, where the universal construction essentially iteratively replaces forbidden windows. We apply this local algorithm to the minimal periodicity, minimal Hamming weight, local almost-balanced Hamming weight and the previously-unsolved minimal palindrome constraints. We then continue by exploring global constraints, and demonstrate the effectiveness of the proposed construction on the repeat-free encoding, reverse-complement encoding, and the open problem of global almost-balanced encoding. For reverse-complement, we also tackle a previously-unsolved version of the constraint that addresses overlap** windows. Overall, the proposed framework generates state-of-the-art constructions with significant ease while also enabling the simultaneous integration of multiple constraints for the first time. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2302.08284 [pdf, other]

ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory

Authors: Marcel Khalifa, Barak Hoffer, Orian Leitersdorf, Robert Hanhan, Ben Perach, Leonid Yavits, Shahar Kvatinsky

Abstract: DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive… ▽ More DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive processing-in-memory (PIM). We enable efficient and high-quality classification by uniting the filter and search stages within a single algorithm. Specifically, we propose a custom filtering technique that drastically narrows the search space and a search approach that facilitates approximate string matching through a distance function. ClaPIM is the first PIM architecture for scalable approximate string matching that benefits from the high density of memristive crossbar arrays and the massive computational parallelism of PIM. Compared with Kraken2, a state-of-the-art software classifier, ClaPIM provides significantly higher classification quality (up to 20x improvement in F1 score) and also demonstrates a 1.8x throughput improvement. Compared with EDAM, a recently-proposed SRAM-based accelerator that is restricted to small datasets, we observe both a 30.4x improvement in normalized throughput per area and a 7% increase in classification precision. △ Less

Submitted 5 November, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

arXiv:2206.15165 [pdf, other]

MatPIM: Accelerating Matrix Operations with Memristive Stateful Logic

Authors: Orian Leitersdorf, Ronny Ronen, Shahar Kvatinsky

Abstract: The emerging memristive Memory Processing Unit (mMPU) overcomes the memory wall through memristive devices that unite storage and logic for real processing-in-memory (PIM) systems. At the core of the mMPU is stateful logic, which is accelerated with memristive partitions to enable logic with massive inherent parallelism within crossbar arrays. This paper vastly accelerates the fundamental operatio… ▽ More The emerging memristive Memory Processing Unit (mMPU) overcomes the memory wall through memristive devices that unite storage and logic for real processing-in-memory (PIM) systems. At the core of the mMPU is stateful logic, which is accelerated with memristive partitions to enable logic with massive inherent parallelism within crossbar arrays. This paper vastly accelerates the fundamental operations of matrix-vector multiplication and convolution in the mMPU, with either full-precision or binary elements. These proposed algorithms establish an efficient foundation for large-scale mMPU applications such as neural-networks, image processing, and numerical methods. We overcome the inherent asymmetry limitation in the previous in-memory full-precision matrix-vector multiplication solutions by utilizing techniques from block matrix multiplication and reduction. We present the first fast in-memory binary matrix-vector multiplication algorithm by utilizing memristive partitions with a tree-based popcount reduction (39x faster than previous work). For convolution, we present a novel in-memory input-parallel concept which we utilize for a full-precision algorithm that overcomes the asymmetry limitation in convolution, while also improving latency (2x faster than previous work), and the first fast binary algorithm (12x faster than previous work). △ Less

Submitted 30 June, 2022; originally announced June 2022.

arXiv:2206.04218 [pdf, other]

AritPIM: High-Throughput In-Memory Arithmetic

Authors: Orian Leitersdorf, Dean Leitersdorf, Jonathan Gal, Mor Dahan, Ronny Ronen, Shahar Kvatinsky

Abstract: Digital processing-in-memory (PIM) architectures are rapidly emerging to overcome the memory-wall bottleneck by integrating logic within memory elements. Such architectures provide vast computational power within the memory itself in the form of parallel bitwise logic operations. We develop novel algorithmic techniques for PIM that, combined with new perspectives on computer arithmetic, extend thi… ▽ More Digital processing-in-memory (PIM) architectures are rapidly emerging to overcome the memory-wall bottleneck by integrating logic within memory elements. Such architectures provide vast computational power within the memory itself in the form of parallel bitwise logic operations. We develop novel algorithmic techniques for PIM that, combined with new perspectives on computer arithmetic, extend this bitwise parallelism to the four fundamental arithmetic operations (addition, subtraction, multiplication, and division), for both fixed-point and floating-point numbers, and using both bit-serial and bit-parallel approaches. We propose a state-of-the-art suite of arithmetic algorithms, demonstrating the first algorithm in the literature of digital PIM for a majority of cases - including cases previously considered impossible for digital PIM, such as floating-point addition. Through a case study on memristive PIM, we compare the proposed algorithms to an NVIDIA RTX 3070 GPU and demonstrate significant throughput and energy improvements. △ Less

Submitted 15 April, 2023; v1 submitted 8 June, 2022; originally announced June 2022.

Comments: Accepted to IEEE Transactions on Emerging Topics in Computing (TETC)

arXiv:2206.04200 [pdf, other]

PartitionPIM: Practical Memristive Partitions for Fast Processing-in-Memory

Authors: Orian Leitersdorf, Ronny Ronen, Shahar Kvatinsky

Abstract: Digital memristive processing-in-memory overcomes the memory wall through a fundamental storage device capable of stateful logic within crossbar arrays. Dynamically dividing the crossbar arrays by adding memristive partitions further increases parallelism, thereby overcoming an inherent trade-off in memristive processing-in-memory. The algorithmic topology of partitions is highly unique, and was r… ▽ More Digital memristive processing-in-memory overcomes the memory wall through a fundamental storage device capable of stateful logic within crossbar arrays. Dynamically dividing the crossbar arrays by adding memristive partitions further increases parallelism, thereby overcoming an inherent trade-off in memristive processing-in-memory. The algorithmic topology of partitions is highly unique, and was recently exploited to accelerate multiplication (11x with 32 partitions) and sorting (14x with 16 partitions). Yet, the physical implementation of memristive partitions, such as the peripheral decoders and the control message, has never been considered and may lead to vast impracticality. This paper overcomes that challenge with several novel techniques, presenting efficient practical designs of memristive partitions. We begin by formalizing the algorithmic properties of memristive partitions into serial, parallel, and semi-parallel operations. Peripheral overhead is addressed via a novel technique of half-gates that enables efficient decoding with negligible overhead. Control overhead is addressed by carefully reducing the operation set of memristive partitions, while resulting in negligible performance impact, by utilizing techniques such as shared indices and pattern generators. Ultimately, these efficient practical solutions, combined with the vast algorithmic potential, may revolutionize digital memristive processing-in-memory. △ Less

Submitted 8 June, 2022; originally announced June 2022.

arXiv:2205.13559 [pdf, other]

HashPIM: High-Throughput SHA-3 via Memristive Digital Processing-in-Memory

Authors: Batel Oved, Orian Leitersdorf, Ronny Ronen, Shahar Kvatinsky

Abstract: Recent research has sought to accelerate cryptographic hash functions as they are at the core of modern cryptography. Traditional designs, however, suffer from the von Neumann bottleneck that originates from the separation of processing and memory units. An emerging solution to overcome this bottleneck is processing-in-memory (PIM): performing logic within the same devices responsible for memory t… ▽ More Recent research has sought to accelerate cryptographic hash functions as they are at the core of modern cryptography. Traditional designs, however, suffer from the von Neumann bottleneck that originates from the separation of processing and memory units. An emerging solution to overcome this bottleneck is processing-in-memory (PIM): performing logic within the same devices responsible for memory to eliminate data-transfer and simultaneously provide massive computational parallelism. In this paper, we seek to vastly accelerate the state-of-the-art SHA-3 cryptographic function using the memristive memory processing unit (mMPU), a general-purpose memristive PIM architecture. To that end, we propose a novel in-memory algorithm for variable rotation, and utilize an efficient map** of the SHA-3 state vector for memristive crossbar arrays to efficiently exploit PIM parallelism. We demonstrate a massive energy efficiency of 1,422 Gbps/W, improving a state-of-the-art memristive SHA-3 accelerator (SHINE-2) by 4.6x. △ Less

Submitted 1 June, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

Comments: Accepted to International Conference on Modern Circuits and Systems Technologies (MOCAST) 2022

arXiv:2205.03911 [pdf, other]

Codes for Constrained Periodicity

Authors: Adir Kobovich, Orian Leitersdorf, Daniella Bar-Lev, Eitan Yaakobi

Abstract: Reliability is an inherent challenge for the emerging nonvolatile technology of racetrack memories, and there exists a fundamental relationship between codes designed for racetrack memories and codes with constrained periodicity. Previous works have sought to construct codes that avoid periodicity in windows, yet have either only provided existence proofs or required high redundancy. This paper pr… ▽ More Reliability is an inherent challenge for the emerging nonvolatile technology of racetrack memories, and there exists a fundamental relationship between codes designed for racetrack memories and codes with constrained periodicity. Previous works have sought to construct codes that avoid periodicity in windows, yet have either only provided existence proofs or required high redundancy. This paper provides the first constructions for avoiding periodicity that are both efficient (average-linear time) and with low redundancy (near the lower bound). The proposed algorithms are based on iteratively repairing windows which contain periodicity until all the windows are valid. Intuitively, such algorithms should not converge as there is no monotonic progression; yet, we prove convergence with average-linear time complexity by exploiting subtle properties of the encoder. Overall, we both provide constructions that avoid periodicity in all windows, and we also study the cardinality of such constraints. △ Less

Submitted 25 August, 2022; v1 submitted 8 May, 2022; originally announced May 2022.

Comments: Accepted to The International Symposium on Information Theory and Its Applications (ISITA) 2022

arXiv:2109.09687 [pdf, other]

Making Memristive Processing-in-Memory Reliable

Authors: Orian Leitersdorf, Ronny Ronen, Shahar Kvatinsky

Abstract: Processing-in-memory (PIM) solutions vastly accelerate systems by reducing data transfer between computation and memory. Memristors possess a unique property that enables storage and logic within the same device, which is exploited in the memristive Memory Processing Unit (mMPU). The mMPU expands fundamental stateful logic techniques, such as IMPLY, MAGIC and FELIX, to high-throughput parallel log… ▽ More Processing-in-memory (PIM) solutions vastly accelerate systems by reducing data transfer between computation and memory. Memristors possess a unique property that enables storage and logic within the same device, which is exploited in the memristive Memory Processing Unit (mMPU). The mMPU expands fundamental stateful logic techniques, such as IMPLY, MAGIC and FELIX, to high-throughput parallel logic and arithmetic operations within the memory. Unfortunately, memristive processing-in-memory is highly vulnerable to soft errors and this massive parallelism is not compatible with traditional reliability techniques, such as error-correcting-code (ECC). In this paper, we discuss reliability techniques that efficiently support the mMPU by utilizing the same principles as the mMPU computation. We detail ECC techniques that are based on the unique properties of the mMPU to efficiently utilize the massive parallelism. Furthermore, we present novel solutions for efficiently implementing triple modular redundancy (TMR). The short-term and long-term reliability of large-scale applications, such as neural-network acceleration, are evaluated. The analysis clearly demonstrates the importance of high-throughput reliability mechanisms for memristive processing-in-memory. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: Accepted to 28th International Conference on Electronics Circuits and Systems (ICECS) 2021

arXiv:2108.13378 [pdf, other]

MultPIM: Fast Stateful Multiplication for Processing-in-Memory

Authors: Orian Leitersdorf, Ronny Ronen, Shahar Kvatinsky

Abstract: Processing-in-memory (PIM) seeks to eliminate computation/memory data transfer using devices that support both storage and logic. Stateful logic techniques such as IMPLY, MAGIC and FELIX can perform logic gates within memristive crossbar arrays with massive parallelism. Multiplication via stateful logic is an active field of research due to the wide implications. Recently, RIME has become the stat… ▽ More Processing-in-memory (PIM) seeks to eliminate computation/memory data transfer using devices that support both storage and logic. Stateful logic techniques such as IMPLY, MAGIC and FELIX can perform logic gates within memristive crossbar arrays with massive parallelism. Multiplication via stateful logic is an active field of research due to the wide implications. Recently, RIME has become the state-of-the-art algorithm for stateful single-row multiplication by using memristive partitions, reducing the latency of the previous state-of-the-art by 5.1x. In this paper, we begin by proposing novel partition-based computation techniques for broadcasting and shifting data. Then, we design an in-memory multiplication algorithm based on the carry-save add-shift (CSAS) technique. Finally, we develop a novel stateful full-adder that significantly improves the state-of-the-art (FELIX) design. These contributions constitute MultPIM, a multiplier that reduces state-of-the-art time complexity from quadratic to linear-log. For 32-bit numbers, MultPIM improves latency by an additional 4.2x over RIME, while even slightly reducing area overhead. Furthermore, we optimize MultPIM for full-precision matrix-vector multiplication and improve latency by 25.5x over FloatPIM matrix-vector multiplication. △ Less

Submitted 20 September, 2021; v1 submitted 30 August, 2021; originally announced August 2021.

Comments: Accepted to IEEE Transactions On Circuits And Systems-II (TCAS-II)

arXiv:2107.10308 [pdf, other]

The Bitlet Model: A Parameterized Analytical Model to Compare PIM and CPU Systems

Authors: Ronny Ronen, Adi Eliahu, Orian Leitersdorf, Natan Peled, Kunal Korgaonkar, Anupam Chattopadhyay, Ben Perach, Shahar Kvatinsky

Abstract: Nowadays, data-intensive applications are gaining popularity and, together with this trend, processing-in-memory (PIM)-based systems are being given more attention and have become more relevant. This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the af… ▽ More Nowadays, data-intensive applications are gaining popularity and, together with this trend, processing-in-memory (PIM)-based systems are being given more attention and have become more relevant. This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the affinity of workloads for PIM as opposed to traditional computing. The tool uncovers interesting tradeoffs between, mainly, the PIM computation complexity (cycles required to perform a computation through PIM), the amount of memory used for PIM, the system memory bandwidth, and the data transfer size. Despite its simplicity, the model reveals new insights when applied to real-life examples. The model is demonstrated for several synthetic examples and then applied to explore the influence of different parameters on two systems - IMAGING and FloatPIM. Based on the demonstrations, insights about PIM and its combination with CPU are concluded. △ Less

Submitted 21 July, 2021; originally announced July 2021.

Comments: Accepted to ACM JETC

arXiv:2105.04212 [pdf, other]

Efficient Error-Correcting-Code Mechanism for High-Throughput Memristive Processing-in-Memory

Authors: Orian Leitersdorf, Ben Perach, Ronny Ronen, Shahar Kvatinsky

Abstract: Inefficient data transfer between computation and memory inspired emerging processing-in-memory (PIM) technologies. Many PIM solutions enable storage and processing using memristors in a crossbar-array structure, with techniques such as memristor-aided logic (MAGIC) used for computation. This approach provides highly-paralleled logic computation with minimal data movement. However, memristors are… ▽ More Inefficient data transfer between computation and memory inspired emerging processing-in-memory (PIM) technologies. Many PIM solutions enable storage and processing using memristors in a crossbar-array structure, with techniques such as memristor-aided logic (MAGIC) used for computation. This approach provides highly-paralleled logic computation with minimal data movement. However, memristors are vulnerable to soft errors and standard error-correcting-code (ECC) techniques are difficult to implement without moving data outside the memory. We propose a novel technique for efficient ECC implementation along diagonals to support reliable computation inside the memory without explicitly reading the data. Our evaluation demonstrates an improvement of over eight orders of magnitude in reliability (mean time to failure) for an increase of about 26% in computation latency. △ Less

Submitted 10 May, 2021; originally announced May 2021.

Comments: Accepted to 58th Design Automation Conference (DAC) 2021

Showing 1–16 of 16 results for author: Leitersdorf, O