Search | arXiv e-print repository

Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory

Authors: Jisung Park, Roknoddin Azizi, Geraldo F. Oliveira, Mohammad Sadrosadati, Rakesh Nadig, David Novo, Juan Gómez-Luna, Myungsuk Kim, Onur Mutlu

Abstract: Bulk bitwise operations, i.e., bitwise operations on large bit vectors, are prevalent in a wide range of important application domains, including databases, graph processing, genome analysis, cryptography, and hyper-dimensional computing. In conventional systems, the performance and energy efficiency of bulk bitwise operations are bottlenecked by data movement between the compute units and the mem… ▽ More Bulk bitwise operations, i.e., bitwise operations on large bit vectors, are prevalent in a wide range of important application domains, including databases, graph processing, genome analysis, cryptography, and hyper-dimensional computing. In conventional systems, the performance and energy efficiency of bulk bitwise operations are bottlenecked by data movement between the compute units and the memory hierarchy. In-flash processing (i.e., processing data inside NAND flash chips) has a high potential to accelerate bulk bitwise operations by fundamentally reducing data movement through the entire memory hierarchy. We identify two key limitations of the state-of-the-art in-flash processing technique for bulk bitwise operations; (i) it falls short of maximally exploiting the bit-level parallelism of bulk bitwise operations; (ii) it is unreliable because it does not consider the highly error-prone nature of NAND flash memory. We propose Flash-Cosmos (Flash Computation with One-Shot Multi-Operand Sensing), a new in-flash processing technique that significantly increases the performance and energy efficiency of bulk bitwise operations while providing high reliability. Flash-Cosmos introduces two key mechanisms that can be easily supported in modern NAND flash chips: (i) Multi-Wordline Sensing (MWS), which enables bulk bitwise operations on a large number of operands with a single sensing operation, and (ii) Enhanced SLC-mode Programming (ESP), which enables reliable computation inside NAND flash memory. We demonstrate the feasibility of performing bulk bitwise operations with high reliability in Flash-Cosmos by testing 160 real 3D NAND flash chips. Our evaluation shows that Flash-Cosmos improves average performance and energy efficiency by 3.5x/32x and 3.3x/95x, respectively, over the state-of-the-art in-flash/outside-storage processing techniques across three real-world applications. △ Less

Submitted 12 September, 2022; originally announced September 2022.

Comments: To appear in 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022

arXiv:2102.05981 [pdf, other]

doi 10.1109/HPCA51647.2021.00037

BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows

Authors: Abdullah Giray Yağlıkçı, Minesh Patel, Jeremie S. Kim, Roknoddin Azizi, Ataberk Olgun, Lois Orosa, Hasan Hassan, Jisung Park, Konstantinos Kanellopoulos, Taha Shahroodi, Saugata Ghose, Onur Mutlu

Abstract: Aggressive memory density scaling causes modern DRAM devices to suffer from RowHammer, a phenomenon where rapidly activating a DRAM row can cause bit-flips in physically-nearby rows. Recent studies demonstrate that modern DRAM chips, including chips previously marketed as RowHammer-safe, are even more vulnerable to RowHammer than older chips. Many works show that attackers can exploit RowHammer bi… ▽ More Aggressive memory density scaling causes modern DRAM devices to suffer from RowHammer, a phenomenon where rapidly activating a DRAM row can cause bit-flips in physically-nearby rows. Recent studies demonstrate that modern DRAM chips, including chips previously marketed as RowHammer-safe, are even more vulnerable to RowHammer than older chips. Many works show that attackers can exploit RowHammer bit-flips to reliably mount system-level attacks to escalate privilege and leak private data. Therefore, it is critical to ensure RowHammer-safe operation on all DRAM-based systems. Unfortunately, state-of-the-art RowHammer mitigation mechanisms face two major challenges. First, they incur increasingly higher performance and/or area overheads when applied to more vulnerable DRAM chips. Second, they require either proprietary information about or modifications to the DRAM chip design. In this paper, we show that it is possible to efficiently and scalably prevent RowHammer bit-flips without knowledge of or modification to DRAM internals. We introduce BlockHammer, a low-cost, effective, and easy-to-adopt RowHammer mitigation mechanism that overcomes the two key challenges by selectively throttling memory accesses that could otherwise cause RowHammer bit-flips. The key idea of BlockHammer is to (1) track row activation rates using area-efficient Bloom filters and (2) use the tracking data to ensure that no row is ever activated rapidly enough to induce RowHammer bit-flips. By doing so, BlockHammer (1) makes it impossible for a RowHammer bit-flip to occur and (2) greatly reduces a RowHammer attack's impact on the performance of co-running benign applications. Compared to state-of-the-art RowHammer mitigation mechanisms, BlockHammer provides competitive performance and energy when the system is not under a RowHammer attack and significantly better performance and energy when the system is under attack. △ Less

Submitted 29 July, 2022; v1 submitted 11 February, 2021; originally announced February 2021.

Comments: A shorter version of this work is to appear at the 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA-27), 2021

arXiv:2005.13121 [pdf, other]

Revisiting RowHammer: An Experimental Analysis of Modern DRAM Devices and Mitigation Techniques

Authors: Jeremie S. Kim, Minesh Patel, A. Giray Yaglikci, Hasan Hassan, Roknoddin Azizi, Lois Orosa, Onur Mutlu

Abstract: In order to shed more light on how RowHammer affects modern and future devices at the circuit-level, we first present an experimental characterization of RowHammer on 1580 DRAM chips (408x DDR3, 652x DDR4, and 520x LPDDR4) from 300 DRAM modules (60x DDR3, 110x DDR4, and 130x LPDDR4) with RowHammer protection mechanisms disabled, spanning multiple different technology nodes from across each of the… ▽ More In order to shed more light on how RowHammer affects modern and future devices at the circuit-level, we first present an experimental characterization of RowHammer on 1580 DRAM chips (408x DDR3, 652x DDR4, and 520x LPDDR4) from 300 DRAM modules (60x DDR3, 110x DDR4, and 130x LPDDR4) with RowHammer protection mechanisms disabled, spanning multiple different technology nodes from across each of the three major DRAM manufacturers. Our studies definitively show that newer DRAM chips are more vulnerable to RowHammer: as device feature size reduces, the number of activations needed to induce a RowHammer bit flip also reduces, to as few as 9.6k (4.8k to two rows each) in the most vulnerable chip we tested. We evaluate five state-of-the-art RowHammer mitigation mechanisms using cycle-accurate simulation in the context of real data taken from our chips to study how the mitigation mechanisms scale with chip vulnerability. We find that existing mechanisms either are not scalable or suffer from prohibitively large performance overheads in projected future devices given our observed trends of RowHammer vulnerability. Thus, it is critical to research more effective solutions to RowHammer. △ Less

Submitted 29 May, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

arXiv:1910.10776 [pdf, other]

doi 10.1145/3352460.3358286

SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations

Authors: Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, Onur Mutlu

Abstract: Important workloads, such as machine learning and graph analytics applications, heavily involve sparse linear algebra operations. These operations use sparse matrix compression as an effective means to avoid storing zeros and performing unnecessary computation on zero elements. However, compression techniques like Compressed Sparse Row (CSR) that are widely used today introduce significant instruc… ▽ More Important workloads, such as machine learning and graph analytics applications, heavily involve sparse linear algebra operations. These operations use sparse matrix compression as an effective means to avoid storing zeros and performing unnecessary computation on zero elements. However, compression techniques like Compressed Sparse Row (CSR) that are widely used today introduce significant instruction overhead and expensive pointer-chasing operations to discover the positions of the non-zero elements. In this paper, we identify the discovery of the positions (i.e., indexing) of non-zero elements as a key bottleneck in sparse matrix-based workloads, which greatly reduces the benefits of compression. We propose SMASH, a hardware-software cooperative mechanism that enables highly-efficient indexing and storage of sparse matrices. The key idea of SMASH is to explicitly enable the hardware to recognize and exploit sparsity in data. To this end, we devise a novel software encoding based on a hierarchy of bitmaps. This encoding can be used to efficiently compress any sparse matrix, regardless of the extent and structure of sparsity. At the same time, the bitmap encoding can be directly interpreted by the hardware. We design a lightweight hardware unit, the Bitmap Management Unit (BMU), that buffers and scans the bitmap hierarchy to perform highly-efficient indexing of sparse matrices. SMASH exposes an expressive and rich ISA to communicate with the BMU, which enables its use in accelerating any sparse matrix computation. We demonstrate the benefits of SMASH on four use cases that include sparse matrix kernels and graph analytics applications. △ Less

Submitted 23 October, 2019; originally announced October 2019.

arXiv:1910.05340 [pdf, other]

EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using Approximate DRAM

Authors: Skanda Koppula, Lois Orosa, Abdullah Giray Yağlıkçı, Roknoddin Azizi, Taha Shahroodi, Konstantinos Kanellopoulos, Onur Mutlu

Abstract: The effectiveness of deep neural networks (DNN) in vision, speech, and language processing has prompted a tremendous demand for energy-efficient high-performance DNN inference systems. Due to the increasing memory intensity of most DNN workloads, main memory can dominate the system's energy consumption and stall time. One effective way to reduce the energy consumption and increase the performance… ▽ More The effectiveness of deep neural networks (DNN) in vision, speech, and language processing has prompted a tremendous demand for energy-efficient high-performance DNN inference systems. Due to the increasing memory intensity of most DNN workloads, main memory can dominate the system's energy consumption and stall time. One effective way to reduce the energy consumption and increase the performance of DNN inference systems is by using approximate memory, which operates with reduced supply voltage and reduced access latency parameters that violate standard specifications. Using approximate memory reduces reliability, leading to higher bit error rates. Fortunately, neural networks have an intrinsic capacity to tolerate increased bit errors. This can enable energy-efficient and high-performance neural network inference using approximate DRAM devices. Based on this observation, we propose EDEN, a general framework that reduces DNN energy consumption and DNN evaluation latency by using approximate DRAM devices, while strictly meeting a user-specified target DNN accuracy. EDEN relies on two key ideas: 1) retraining the DNN for a target approximate DRAM device to increase the DNN's error tolerance, and 2) efficient map** of the error tolerance of each individual DNN data type to a corresponding approximate DRAM partition in a way that meets the user-specified DNN accuracy requirements. We evaluate EDEN on multi-core CPUs, GPUs, and DNN accelerators with error models obtained from real approximate DRAM devices. For a target accuracy within 1% of the original DNN, our results show that EDEN enables 1) an average DRAM energy reduction of 21%, 37%, 31%, and 32% in CPU, GPU, and two DNN accelerator architectures, respectively, across a variety of DNNs, and 2) an average (maximum) speedup of 8% (17%) and 2.7% (5.5%) in CPU and GPU architectures, respectively, when evaluating latency-bound DNNs. △ Less

Submitted 11 October, 2019; originally announced October 2019.

Comments: This work is to appear at MICRO 2019

arXiv:1610.01499 [pdf, ps, other]

Nonautonomous Riccati difference equation with real $k$-periodic ($k\geq 1$) coefficients

Authors: Raouf Azizi

Abstract: We study the non-autonomous Riccati difference equation \[x_{n+1}=\frac{a_nx_n+b_n}{c_nx_n+d_n}, \ n=0,1,2,\cdots\] where $(a_n)_{n\geq0}, \ (b_n)_{n\geq0}, \ (c_n)_{n\geq0}, \ \text{and} \ (d_n)_{n\geq0}$ are $k$-periodic sequences, $k\geq 1$, with initial value $x_0 \in \mathbb{R}$. Precisely we give a detailed analysis of the forbidden set and the character of the solutions. We study the non-autonomous Riccati difference equation \[x_{n+1}=\frac{a_nx_n+b_n}{c_nx_n+d_n}, \ n=0,1,2,\cdots\] where $(a_n)_{n\geq0}, \ (b_n)_{n\geq0}, \ (c_n)_{n\geq0}, \ \text{and} \ (d_n)_{n\geq0}$ are $k$-periodic sequences, $k\geq 1$, with initial value $x_0 \in \mathbb{R}$. Precisely we give a detailed analysis of the forbidden set and the character of the solutions. △ Less

Submitted 4 September, 2016; originally announced October 2016.

arXiv:1506.00099 [pdf]

A Novel Energy Aware Node Clustering Algorithm for Wireless Sensor Networks Using a Modified Artificial Fish Swarm Algorithm

Authors: Reza Azizi, Hasan Sedghi, Hamid Shoja, Alireza Sepas-Moghaddam

Abstract: Clustering problems are considered amongst the prominent challenges in statistics and computational science. Clustering of nodes in wireless sensor networks which is used to prolong the life-time of networks is one of the difficult tasks of clustering procedure. In order to perform nodes clustering, a number of nodes are determined as cluster heads and other ones are joined to one of these heads,… ▽ More Clustering problems are considered amongst the prominent challenges in statistics and computational science. Clustering of nodes in wireless sensor networks which is used to prolong the life-time of networks is one of the difficult tasks of clustering procedure. In order to perform nodes clustering, a number of nodes are determined as cluster heads and other ones are joined to one of these heads, based on different criteria e.g. Euclidean distance. So far, different approaches have been proposed for this process, where swarm and evolutionary algorithms contribute in this regard. In this study, a novel algorithm is proposed based on Artificial Fish Swarm Algorithm (AFSA) for clustering procedure. In the proposed method, the performance of the standard AFSA is improved by increasing balance between local and global searches. Furthermore, a new mechanism has been added to the base algorithm for improving convergence speed in clustering problems. Performance of the proposed technique is compared to a number of state-of-the-art techniques in this field and the outcomes indicate the supremacy of the proposed technique. △ Less

Submitted 30 May, 2015; originally announced June 2015.

Comments: 13 pages, 5 figures, 2 tables, International Journal of Computer Networks & Communications(IJCNC) Vol.7, No.3, May 2015

arXiv:1405.4138 [pdf]

Empirical Study of Artificial Fish Swarm Algorithm

Authors: Reza Azizi

Abstract: Artificial fish swarm algorithm (AFSA) is one of the swarm intelligence optimization algorithms that works based on population and stochastic search. In order to achieve acceptable result, there are many parameters needs to be adjusted in AFSA. Among these parameters, visual and step are very significant in view of the fact that artificial fish basically move based on these parameters. In standard… ▽ More Artificial fish swarm algorithm (AFSA) is one of the swarm intelligence optimization algorithms that works based on population and stochastic search. In order to achieve acceptable result, there are many parameters needs to be adjusted in AFSA. Among these parameters, visual and step are very significant in view of the fact that artificial fish basically move based on these parameters. In standard AFSA, these two parameters remain constant until the algorithm termination. Large values of these parameters increase the capability of algorithm in global search, while small values improve the local search ability of the algorithm. In this paper, we empirically study the performance of the AFSA and different approaches to balance between local and global exploration have been tested based on the adaptive modification of visual and step during algorithm execution. The proposed approaches have been evaluated based on the four well-known benchmark functions. Experimental results show considerable positive impact on the performance of AFSA. △ Less

Submitted 16 May, 2014; originally announced May 2014.

Journal ref: International Journal of Computing, Communications and Networking (IJCCN) , Volume 3, No.1, Pages 01-07, March 2014

arXiv:1307.6976 [pdf]

doi 10.5121/ijwmn.2013.5302

Performance study and simulation of an anycast protocol for wireless mobile ad hoc networks

Authors: Reza Azizi

Abstract: This paper conducts a detailed simulation study of stateless anycast routing in a mobile wireless ad hoc network. The model covers all the fundamental aspects of such networks with a routing mechanism using a scheme of orientation-dependent inter-node communication links. The simulation system Winsim is used which explicitly represents parallelism of events and processes in the network. The purpos… ▽ More This paper conducts a detailed simulation study of stateless anycast routing in a mobile wireless ad hoc network. The model covers all the fundamental aspects of such networks with a routing mechanism using a scheme of orientation-dependent inter-node communication links. The simulation system Winsim is used which explicitly represents parallelism of events and processes in the network. The purpose of these simulations is to investigate the effect of node s maximum speed, and different TTL over the network performance under two different scenarios. Simulation study investigates five practically important performance metrics of a wireless mobile ad hoc network and shows the dependence of this metrics on the transmission radius, link availability, and maximal possible node speed. △ Less

Submitted 26 July, 2013; originally announced July 2013.

Comments: 15 pages, 20 figures, 1 table

Journal ref: International Journal of Wireless & Mobile Networks (IJWMN) Vol. 5, No. 3, June 2013

Showing 1–9 of 9 results for author: Azizi, R