-
Error-Free and Current-Driven Synthetic Antiferromagnetic Domain Wall Memory Enabled by Channel Meandering
Authors:
Pengxiang Zhang,
Wilfried Haensch,
Charudatta M. Phatak,
Supratik Guha
Abstract:
We propose a new type of multi-bit and energy-efficient magnetic memory based on current-driven, field-free, and highly controlled domain wall motion. A meandering domain wall channel with precisely interspersed pinning regions provides the multi-bit capability of a magnetic tunnel junction. The magnetic free layer of the memory device has perpendicular magnetic anisotropy and interfacial Dzyalosh…
▽ More
We propose a new type of multi-bit and energy-efficient magnetic memory based on current-driven, field-free, and highly controlled domain wall motion. A meandering domain wall channel with precisely interspersed pinning regions provides the multi-bit capability of a magnetic tunnel junction. The magnetic free layer of the memory device has perpendicular magnetic anisotropy and interfacial Dzyaloshinskii-Moriya interaction, so that spin-orbit torques induce efficient domain wall motion. Using micromagnetic simulations, we find two pinning mechanisms that lead to different cell designs: two-way switching and four-way switching. The memory cell design choices and the physics behind these pinning mechanisms are discussed in detail. Furthermore, we show that switching reliability and speed may be significantly improved by replacing the ferromagnetic free layer with a synthetic antiferromagnetic layer. Switching behavior and material choices will be discussed for the two implementations.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Multi-Function Multi-Way Analog Technology for Sustainable Machine Intelligence Computation
Authors:
Vassilis Kalantzis,
Mark S. Squillante,
Shashanka Ubaru,
Tayfun Gokmen,
Chai Wah Wu,
Anshul Gupta,
Haim Avron,
Tomasz Nowicki,
Malte Rasch,
Murat Onen,
Vanessa Lopez Marrero,
Effendi Leobandung,
Yasuteru Kohda,
Wilfried Haensch,
Lior Horesh
Abstract:
Numerical computation is essential to many areas of artificial intelligence (AI), whose computing demands continue to grow dramatically, yet their continued scaling is jeopardized by the slowdown in Moore's law. Multi-function multi-way analog (MFMWA) technology, a computing architecture comprising arrays of memristors supporting in-memory computation of matrix operations, can offer tremendous imp…
▽ More
Numerical computation is essential to many areas of artificial intelligence (AI), whose computing demands continue to grow dramatically, yet their continued scaling is jeopardized by the slowdown in Moore's law. Multi-function multi-way analog (MFMWA) technology, a computing architecture comprising arrays of memristors supporting in-memory computation of matrix operations, can offer tremendous improvements in computation and energy, but at the expense of inherent unpredictability and noise. We devise novel randomized algorithms tailored to MFMWA architectures that mitigate the detrimental impact of imperfect analog computations while realizing their potential benefits across various areas of AI, such as applications in computer vision. Through analysis, measurements from analog devices, and simulations of larger systems, we demonstrate orders of magnitude reduction in both computation and energy with accuracy similar to digital computers.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
LRMP: Layer Replication with Mixed Precision for Spatial In-memory DNN Accelerators
Authors:
Abinand Nallathambi,
Christin David Bose,
Wilfried Haensch,
Anand Raghunathan
Abstract:
In-memory computing (IMC) with non-volatile memories (NVMs) has emerged as a promising approach to address the rapidly growing computational demands of Deep Neural Networks (DNNs). Map** DNN layers spatially onto NVM-based IMC accelerators achieves high degrees of parallelism. However, two challenges that arise in this approach are the highly non-uniform distribution of layer processing times an…
▽ More
In-memory computing (IMC) with non-volatile memories (NVMs) has emerged as a promising approach to address the rapidly growing computational demands of Deep Neural Networks (DNNs). Map** DNN layers spatially onto NVM-based IMC accelerators achieves high degrees of parallelism. However, two challenges that arise in this approach are the highly non-uniform distribution of layer processing times and high area requirements. We propose LRMP, a method to jointly apply layer replication and mixed precision quantization to improve the performance of DNNs when mapped to area-constrained NVM-based IMC accelerators. LRMP uses a combination of reinforcement learning and integer linear programming to search the replication-quantization design space using a model that is closely informed by the target hardware architecture. Across five DNN benchmarks, LRMP achieves 2.8-9$\times$ latency and 11.8-19$\times$ throughput improvement at iso-accuracy.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
A Co-design view of Compute in-Memory with Non-Volatile Elements for Neural Networks
Authors:
Wilfried Haensch,
Anand Raghunathan,
Kaushik Roy,
Bhaswar Chakrabarti,
Charudatta M. Phatak,
Cheng Wang,
Supratik Guha
Abstract:
Deep Learning neural networks are pervasive, but traditional computer architectures are reaching the limits of being able to efficiently execute them for the large workloads of today. They are limited by the von Neumann bottleneck: the high cost in energy and latency incurred in moving data between memory and the compute engine. Today, special CMOS designs address this bottleneck. The next generat…
▽ More
Deep Learning neural networks are pervasive, but traditional computer architectures are reaching the limits of being able to efficiently execute them for the large workloads of today. They are limited by the von Neumann bottleneck: the high cost in energy and latency incurred in moving data between memory and the compute engine. Today, special CMOS designs address this bottleneck. The next generation of computing hardware will need to eliminate or dramatically mitigate this bottleneck. We discuss how compute-in-memory can play an important part in this development. Here, a non-volatile memory based cross-bar architecture forms the heart of an engine that uses an analog process to parallelize the matrix vector multiplication operation, repeatedly used in all neural network workloads. The cross-bar architecture, at times referred to as a neuromorphic approach, can be a key hardware element in future computing machines. In the first part of this review we take a co-design view of the design constraints and the demands it places on the new materials and memory devices that anchor the cross-bar architecture. In the second part, we review what is knows about the different new non-volatile memory materials and devices suited for compute in-memory, and discuss the outlook and challenges.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Neural Network Training with Asymmetric Crosspoint Elements
Authors:
Murat Onen,
Tayfun Gokmen,
Teodor K. Todorov,
Tomasz Nowicki,
Jesus A. del Alamo,
John Rozen,
Wilfried Haensch,
Seyoung Kim
Abstract:
Analog crossbar arrays comprising programmable nonvolatile resistors are under intense investigation for acceleration of deep neural network training. However, the ubiquitous asymmetric conductance modulation of practical resistive devices critically degrades the classification performance of networks trained with conventional algorithms. Here, we describe and experimentally demonstrate an alterna…
▽ More
Analog crossbar arrays comprising programmable nonvolatile resistors are under intense investigation for acceleration of deep neural network training. However, the ubiquitous asymmetric conductance modulation of practical resistive devices critically degrades the classification performance of networks trained with conventional algorithms. Here, we describe and experimentally demonstrate an alternative fully-parallel training algorithm: Stochastic Hamiltonian Descent. Instead of conventionally tuning weights in the direction of the error function gradient, this method programs the network parameters to successfully minimize the total energy (Hamiltonian) of the system that incorporates the effects of device asymmetry. We provide critical intuition on why device asymmetry is fundamentally incompatible with conventional training algorithms and how the new approach exploits it as a useful feature instead. Our technique enables immediate realization of analog deep learning accelerators based on readily available device technologies.
△ Less
Submitted 31 January, 2022;
originally announced January 2022.
-
Algorithm for Training Neural Networks on Resistive Device Arrays
Authors:
Tayfun Gokmen,
Wilfried Haensch
Abstract:
Hardware architectures composed of resistive cross-point device arrays can provide significant power and speed benefits for deep neural network training workloads using stochastic gradient descent (SGD) and backpropagation (BP) algorithm. The training accuracy on this imminent analog hardware however strongly depends on the switching characteristics of the cross-point elements. One of the key requ…
▽ More
Hardware architectures composed of resistive cross-point device arrays can provide significant power and speed benefits for deep neural network training workloads using stochastic gradient descent (SGD) and backpropagation (BP) algorithm. The training accuracy on this imminent analog hardware however strongly depends on the switching characteristics of the cross-point elements. One of the key requirements is that these resistive devices must change conductance in a symmetrical fashion when subjected to positive or negative pulse stimuli. Here, we present a new training algorithm, so-called the "Tiki-Taka" algorithm, that eliminates this stringent symmetry requirement. We show that device asymmetry introduces an unintentional implicit cost term into the SGD algorithm, whereas in the "Tiki-Taka" algorithm a coupled dynamical system simultaneously minimizes the original objective function of the neural network and the unintentional cost term due to device asymmetry in a self-consistent fashion. We tested the validity of this new algorithm on a range of network architectures such as fully connected, convolutional and LSTM networks. Simulation results on these various networks show that whatever accuracy is achieved using the conventional SGD algorithm with symmetric (ideal) device switching characteristics the same accuracy is also achieved using the "Tiki-Taka" algorithm with non-symmetric (non-ideal) device switching characteristics. Moreover, all the operations performed on the arrays are still parallel and therefore the implementation cost of this new algorithm on array architectures is minimal; and it maintains the aforementioned power and speed benefits. These algorithmic improvements are crucial to relax the material specification and to realize technologically viable resistive crossbar arrays that outperform digital accelerators for similar training tasks.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
Training large-scale ANNs on simulated resistive crossbar arrays
Authors:
Malte J. Rasch,
Tayfun Gokmen,
Wilfried Haensch
Abstract:
Accelerating training of artificial neural networks (ANN) with analog resistive crossbar arrays is a promising idea. While the concept has been verified on very small ANNs and toy data sets (such as MNIST), more realistically sized ANNs and datasets have not yet been tackled. However, it is to be expected that device materials and hardware design constraints, such as noisy computations, finite num…
▽ More
Accelerating training of artificial neural networks (ANN) with analog resistive crossbar arrays is a promising idea. While the concept has been verified on very small ANNs and toy data sets (such as MNIST), more realistically sized ANNs and datasets have not yet been tackled. However, it is to be expected that device materials and hardware design constraints, such as noisy computations, finite number of resistive states of the device materials, saturating weight and activation ranges, and limited precision of analog-to-digital converters, will cause significant challenges to the successful training of state-of-the-art ANNs. By using analog hardware aware ANN training simulations, we here explore a number of simple algorithmic compensatory measures to cope with analog noise and limited weight and output ranges and resolutions, that dramatically improve the simulated training performances on RPU arrays on intermediately to large-scale ANNs.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.
-
Efficient ConvNets for Analog Arrays
Authors:
Malte J. Rasch,
Tayfun Gokmen,
Mattia Rigotti,
Wilfried Haensch
Abstract:
Analog arrays are a promising upcoming hardware technology with the potential to drastically speed up deep learning. Their main advantage is that they compute matrix-vector products in constant time, irrespective of the size of the matrix. However, early convolution layers in ConvNets map very unfavorably onto analog arrays, because kernel matrices are typically small and the constant time operati…
▽ More
Analog arrays are a promising upcoming hardware technology with the potential to drastically speed up deep learning. Their main advantage is that they compute matrix-vector products in constant time, irrespective of the size of the matrix. However, early convolution layers in ConvNets map very unfavorably onto analog arrays, because kernel matrices are typically small and the constant time operation needs to be sequentially iterated a large number of times, reducing the speed up advantage for ConvNets. Here, we propose to replicate the kernel matrix of a convolution layer on distinct analog arrays, and randomly divide parts of the compute among them, so that multiple kernel matrices are trained in parallel. With this modification, analog arrays execute ConvNets with an acceleration factor that is proportional to the number of kernel matrices used per layer (here tested 16-128). Despite having more free parameters, we show analytically and in numerical experiments that this convolution architecture is self-regularizing and implicitly learns similar filters across arrays. We also report superior performance on a number of datasets and increased robustness to adversarial attacks. Our investigation suggests to revise the notion that mixed analog-digital hardware is not suitable for ConvNets.
△ Less
Submitted 3 July, 2018;
originally announced July 2018.
-
Training LSTM Networks with Resistive Cross-Point Devices
Authors:
Tayfun Gokmen,
Malte Rasch,
Wilfried Haensch
Abstract:
In our previous work we have shown that resistive cross point devices, so called Resistive Processing Unit (RPU) devices, can provide significant power and speed benefits when training deep fully connected networks as well as convolutional neural networks. In this work, we further extend the RPU concept for training recurrent neural networks (RNNs) namely LSTMs. We show that the map** of recurre…
▽ More
In our previous work we have shown that resistive cross point devices, so called Resistive Processing Unit (RPU) devices, can provide significant power and speed benefits when training deep fully connected networks as well as convolutional neural networks. In this work, we further extend the RPU concept for training recurrent neural networks (RNNs) namely LSTMs. We show that the map** of recurrent layers is very similar to the map** of fully connected layers and therefore the RPU concept can potentially provide large acceleration factors for RNNs as well. In addition, we study the effect of various device imperfections and system parameters on training performance. Symmetry of updates becomes even more crucial for RNNs; already a few percent asymmetry results in an increase in the test error compared to the ideal case trained with floating point numbers. Furthermore, the input signal resolution to device arrays needs to be at least 7 bits for successful training. However, we show that a stochastic rounding scheme can reduce the input signal resolution back to 5 bits. Further, we find that RPU device variations and hardware noise are enough to mitigate overfitting, so that there is less need for using dropout. We note that the models trained here are roughly 1500 times larger than the fully connected network trained on MNIST dataset in terms of the total number of multiplication and summation operations performed per epoch. Thus, here we attempt to study the validity of the RPU approach for large scale networks.
△ Less
Submitted 31 May, 2018;
originally announced June 2018.
-
Analog CMOS-based Resistive Processing Unit for Deep Neural Network Training
Authors:
Seyoung Kim,
Tayfun Gokmen,
Hyung-Min Lee,
Wilfried E. Haensch
Abstract:
Recently we have shown that an architecture based on resistive processing unit (RPU) devices has potential to achieve significant acceleration in deep neural network (DNN) training compared to today's software-based DNN implementations running on CPU/GPU. However, currently available device candidates based on non-volatile memory technologies do not satisfy all the requirements to realize the RPU…
▽ More
Recently we have shown that an architecture based on resistive processing unit (RPU) devices has potential to achieve significant acceleration in deep neural network (DNN) training compared to today's software-based DNN implementations running on CPU/GPU. However, currently available device candidates based on non-volatile memory technologies do not satisfy all the requirements to realize the RPU concept. Here, we propose an analog CMOS-based RPU design (CMOS RPU) which can store and process data locally and can be operated in a massively parallel manner. We analyze various properties of the CMOS RPU to evaluate the functionality and feasibility for acceleration of DNN training.
△ Less
Submitted 20 June, 2017;
originally announced June 2017.
-
Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices
Authors:
Tayfun Gokmen,
O. Murat Onen,
Wilfried Haensch
Abstract:
In a previous work we have detailed the requirements to obtain a maximal performance benefit by implementing fully connected deep neural networks (DNN) in form of arrays of resistive devices for deep learning. This concept of Resistive Processing Unit (RPU) devices we extend here towards convolutional neural networks (CNNs). We show how to map the convolutional layers to RPU arrays such that the p…
▽ More
In a previous work we have detailed the requirements to obtain a maximal performance benefit by implementing fully connected deep neural networks (DNN) in form of arrays of resistive devices for deep learning. This concept of Resistive Processing Unit (RPU) devices we extend here towards convolutional neural networks (CNNs). We show how to map the convolutional layers to RPU arrays such that the parallelism of the hardware can be fully utilized in all three cycles of the backpropagation algorithm. We find that the noise and bound limitations imposed due to analog nature of the computations performed on the arrays effect the training accuracy of the CNNs. Noise and bound management techniques are presented that mitigate these problems without introducing any additional complexity in the analog circuits and can be addressed by the digital circuits. In addition, we discuss digitally programmable update management and device variability reduction techniques that can be used selectively for some of the layers in a CNN. We show that combination of all those techniques enables a successful application of the RPU concept for training CNNs. The techniques discussed here are more general and can be applied beyond CNN architectures and therefore enables applicability of RPU approach for large class of neural network architectures.
△ Less
Submitted 22 May, 2017;
originally announced May 2017.