-
Post-Training Sparsity-Aware Quantization
Authors:
Gil Shomron,
Freddy Gabbay,
Samer Kurzum,
Uri Weiser
Abstract:
Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Map** FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradatio…
▽ More
Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Map** FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradation; however, reducing precision below 8 bits with PTQ is challenging, as accuracy degradation becomes noticeable, due to the increase in quantization noise. In this paper, we propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. 4-bit quantization, for example, is employed by dynamically examining the bits of 8-bit values and choosing a window of 4 bits, while first skip** zero-value bits. Moreover, instead of quantizing activation-by-activation to 4 bits, we focus on pairs of 8-bit activations and examine whether one of the two is equal to zero. If one is equal to zero, the second can opportunistically use the other's 4-bit budget; if both do not equal zero, then each is dynamically quantized to 4 bits, as described. SPARQ achieves minor accuracy degradation and a practical hardware implementation. The code is available at https://github.com/gilshm/sparq.
△ Less
Submitted 28 October, 2021; v1 submitted 23 May, 2021;
originally announced May 2021.
-
Post-Training BatchNorm Recalibration
Authors:
Gil Shomron,
Uri Weiser
Abstract:
We revisit non-blocking simultaneous multithreading (NB-SMT) introduced previously by Shomron and Weiser (2020). NB-SMT trades accuracy for performance by occasionally "squeezing" more than one thread into a shared multiply-and-accumulate (MAC) unit. However, the method of accommodating more than one thread in a shared MAC unit may contribute noise to the computations, thereby changing the interna…
▽ More
We revisit non-blocking simultaneous multithreading (NB-SMT) introduced previously by Shomron and Weiser (2020). NB-SMT trades accuracy for performance by occasionally "squeezing" more than one thread into a shared multiply-and-accumulate (MAC) unit. However, the method of accommodating more than one thread in a shared MAC unit may contribute noise to the computations, thereby changing the internal statistics of the model. We show that substantial model performance can be recouped by post-training recalibration of the batch normalization layers' running mean and running variance statistics, given the presence of NB-SMT.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
Semantic prefetching using forecast slices
Authors:
Leeor Peled,
Uri Weiser,
Yoav Etsion
Abstract:
Modern prefetchers identify memory access patterns in order to predict future accesses. However, many applications exhibit irregular access patterns that do not manifest spatio-temporal locality in the memory address space. Such applications usually do not fall under the scope of existing prefetching techniques, which observe only the stream of addresses dispatched by the memory unit but not the c…
▽ More
Modern prefetchers identify memory access patterns in order to predict future accesses. However, many applications exhibit irregular access patterns that do not manifest spatio-temporal locality in the memory address space. Such applications usually do not fall under the scope of existing prefetching techniques, which observe only the stream of addresses dispatched by the memory unit but not the code flows that produce them. Similarly, temporal correlation prefetchers detect recurring relations between accesses, but do not track the chain of causality in program code that manifested the memory locality. Conversely, techniques that are code-aware are limited to the basic program functionality and are bounded by the machine depth. In this paper we show that contextual analysis of the code flows that generate memory accesses can detect recurring code patterns and expose their underlying semantics even for irregular access patterns. Moreover, program locality artifacts can be used to enhance the memory traversal code and predict future accesses. We present the semantic prefetcher that analyzes programs at run-time and learns their memory dependency chains and address calculation flows. The prefetcher then constructs forecast slices and injects them at key points to trigger timely prefetching of future contextually-related iterations. We show how this approach takes the best of both worlds, augmenting code injection with forecast functionality and relying on context-based temporal correlation of code slices. This combination allows us to overcome critical memory latencies that are currently not covered by any other prefetcher. Our evaluation of the semantic prefetcher using an industrial-grade, cycle-accurate x86 simulator shows that it improves performance by 24% on average over SPEC 2006 (outliers up to 3.7x), and 16% on average over SPEC 2017 (outliers up to 1.85x), using only ~6KB.
△ Less
Submitted 12 May, 2020;
originally announced May 2020.
-
Non-Blocking Simultaneous Multithreading: Embracing the Resiliency of Deep Neural Networks
Authors:
Gil Shomron,
Uri Weiser
Abstract:
Deep neural networks (DNNs) are known for their inability to utilize underlying hardware resources due to hardware susceptibility to sparse activations and weights. Even in finer granularities, many of the non-zero values hold a portion of zero-valued bits that may cause inefficiencies when executed on hardware. Inspired by conventional CPU simultaneous multithreading (SMT) that increases computer…
▽ More
Deep neural networks (DNNs) are known for their inability to utilize underlying hardware resources due to hardware susceptibility to sparse activations and weights. Even in finer granularities, many of the non-zero values hold a portion of zero-valued bits that may cause inefficiencies when executed on hardware. Inspired by conventional CPU simultaneous multithreading (SMT) that increases computer resource utilization by sharing them across several threads, we propose non-blocking SMT (NB-SMT) designated for DNN accelerators. Like conventional SMT, NB-SMT shares hardware resources among several execution flows. Yet, unlike SMT, NB-SMT is non-blocking, as it handles structural hazards by exploiting the algorithmic resiliency of DNNs. Instead of opportunistically dispatching instructions while they wait in a reservation station for available hardware, NB-SMT temporarily reduces the computation precision to accommodate all threads at once, enabling a non-blocking operation. We demonstrate NB-SMT applicability using SySMT, an NB-SMT-enabled output-stationary systolic array (OS-SA). Compared with a conventional OS-SA, a 2-threaded SySMT consumes 1.4x the area and delivers 2x speedup with 33% energy savings and less than 1% accuracy degradation of state-of-the-art CNNs with ImageNet. A 4-threaded SySMT consumes 2.5x the area and delivers, for example, 3.4x speedup and 39% energy savings with 1% accuracy degradation of 40%-pruned ResNet-18.
△ Less
Submitted 17 September, 2020; v1 submitted 17 April, 2020;
originally announced April 2020.
-
Robust Quantization: One Model to Rule Them All
Authors:
Moran Shkolnik,
Brian Chmiel,
Ron Banner,
Gil Shomron,
Yury Nahshan,
Alex Bronstein,
Uri Weiser
Abstract:
Neural network quantization methods often involve simulating the quantization process during training, making the trained model highly dependent on the target bit-width and precise way quantization is performed. Robust quantization offers an alternative approach with improved tolerance to different classes of data-types and quantization policies. It opens up new exciting applications where the qua…
▽ More
Neural network quantization methods often involve simulating the quantization process during training, making the trained model highly dependent on the target bit-width and precise way quantization is performed. Robust quantization offers an alternative approach with improved tolerance to different classes of data-types and quantization policies. It opens up new exciting applications where the quantization process is not static and can vary to meet different circumstances and implementations. To address this issue, we propose a method that provides intrinsic robustness to the model against a broad range of quantization processes. Our method is motivated by theoretical arguments and enables us to store a single generic model capable of operating at various bit-widths and quantization policies. We validate our method's effectiveness on different ImageNet models.
△ Less
Submitted 22 October, 2020; v1 submitted 18 February, 2020;
originally announced February 2020.
-
Thanks for Nothing: Predicting Zero-Valued Activations with Lightweight Convolutional Neural Networks
Authors:
Gil Shomron,
Ron Banner,
Moran Shkolnik,
Uri Weiser
Abstract:
Convolutional neural networks (CNNs) introduce state-of-the-art results for various tasks with the price of high computational demands. Inspired by the observation that spatial correlation exists in CNN output feature maps (ofms), we propose a method to dynamically predict whether ofm activations are zero-valued or not according to their neighboring activation values, thereby avoiding zero-valued…
▽ More
Convolutional neural networks (CNNs) introduce state-of-the-art results for various tasks with the price of high computational demands. Inspired by the observation that spatial correlation exists in CNN output feature maps (ofms), we propose a method to dynamically predict whether ofm activations are zero-valued or not according to their neighboring activation values, thereby avoiding zero-valued activations and reducing the number of convolution operations. We implement the zero activation predictor (ZAP) with a lightweight CNN, which imposes negligible overheads and is easy to deploy on existing models. ZAPs are trained by mimicking hidden layer ouputs; thereby, enabling a parallel and label-free training. Furthermore, without retraining, each ZAP can be tuned to a different operating point trading accuracy for MAC reduction.
△ Less
Submitted 13 July, 2020; v1 submitted 17 September, 2019;
originally announced September 2019.
-
Spatial Correlation and Value Prediction in Convolutional Neural Networks
Authors:
Gil Shomron,
Uri Weiser
Abstract:
Convolutional neural networks (CNNs) are a widely used form of deep neural networks, introducing state-of-the-art results for different problems such as image classification, computer vision tasks, and speech recognition. However, CNNs are compute intensive, requiring billions of multiply-accumulate (MAC) operations per input. To reduce the number of MACs in CNNs, we propose a value prediction met…
▽ More
Convolutional neural networks (CNNs) are a widely used form of deep neural networks, introducing state-of-the-art results for different problems such as image classification, computer vision tasks, and speech recognition. However, CNNs are compute intensive, requiring billions of multiply-accumulate (MAC) operations per input. To reduce the number of MACs in CNNs, we propose a value prediction method that exploits the spatial correlation of zero-valued activations within the CNN output feature maps, thereby saving convolution operations. Our method reduces the number of MAC operations by 30.4%, averaged on three modern CNNs for ImageNet, with top-1 accuracy degradation of 1.7%, and top-5 accuracy degradation of 1.1%.
△ Less
Submitted 1 January, 2019; v1 submitted 21 July, 2018;
originally announced July 2018.
-
A neural network memory prefetcher using semantic locality
Authors:
Leeor Peled,
Uri Weiser,
Yoav Etsion
Abstract:
Accurate memory prefetching is paramount for processor performance, and modern processors employ various techniques to identify and prefetch different memory access patterns. While most modern prefetchers target spatio-temporal patterns by matching memory addresses that are accessed in close proximity (either in space or time), the recently proposed concept of semantic locality views locality as a…
▽ More
Accurate memory prefetching is paramount for processor performance, and modern processors employ various techniques to identify and prefetch different memory access patterns. While most modern prefetchers target spatio-temporal patterns by matching memory addresses that are accessed in close proximity (either in space or time), the recently proposed concept of semantic locality views locality as an artifact of the algorithmic level and searches for correlations between memory accesses and program state. While this approach was shown to be effective, capturing semantic locality requires significant associative learning capabilities. In this paper we utilize neural networks for this task. We leverage recent advances in machine learning to propose a neural network prefetcher. We show that by observing program context, this prefetcher can learn distinct memory access patterns that cannot be covered by other state-of-the-art prefetchers. We evaluate the neural network prefetcher over SPEC2006, Graph500, and several microbenchmarks. We show that the prefetcher can deliver an average speedup of 30% for SPEC2006 (up to 2.7x) and up to 4.6x over kernels. We also present a high-level design of our prefetcher, explore the power, energy and area limitations, and propose several optimizations for feasibility. We believe that this line of research can further improve the efficiency of such neural networks and allow harnessing them for additional micro-architectural predictions.
△ Less
Submitted 26 July, 2018; v1 submitted 19 March, 2018;
originally announced April 2018.
-
MultiAmdahl: Optimal Resource Allocation in Heterogeneous Architectures
Authors:
Leonid Yavits,
Amir Morad,
Uri Weiser,
Ran Ginosar
Abstract:
Future multiprocessor chips will integrate many different units, each tailored to a specific computation. When designing such a system, the chip architect must decide how to distribute limited system resources such as area, power, and energy among the computational units. We extend MultiAmdahl, an analytical optimization technique for resource allocation in heterogeneous architectures, for energy…
▽ More
Future multiprocessor chips will integrate many different units, each tailored to a specific computation. When designing such a system, the chip architect must decide how to distribute limited system resources such as area, power, and energy among the computational units. We extend MultiAmdahl, an analytical optimization technique for resource allocation in heterogeneous architectures, for energy optimality under a variety of constant system power scenarios. We conclude that reduction in constant system power should be met by reallocating resources from general-purpose computing to heterogeneous accelerator-dominated computing, to keep the overall energy consumption at a minimum. We extend this conclusion to offer an intuition regarding energy-optimal resource allocation in data center computing.
△ Less
Submitted 19 May, 2017;
originally announced May 2017.
-
A Resistive CAM Processing-in-Storage Architecture for DNA Sequence Alignment
Authors:
Roman Kaplan,
Leonid Yavits,
Ran Ginosar,
Uri Weiser
Abstract:
A novel processing-in-storage (PRinS) architecture based on Resistive CAM (ReCAM) is described and proposed for Smith-Waterman (S-W) sequence alignment. The ReCAM massively-parallel compare operation finds matching base-pairs in a fixed number of cycles, regardless of sequence length. The ReCAM PRinS S-W algorithm is simulated and compared to FPGA, Xeon Phi and GPU-based implementations, showing a…
▽ More
A novel processing-in-storage (PRinS) architecture based on Resistive CAM (ReCAM) is described and proposed for Smith-Waterman (S-W) sequence alignment. The ReCAM massively-parallel compare operation finds matching base-pairs in a fixed number of cycles, regardless of sequence length. The ReCAM PRinS S-W algorithm is simulated and compared to FPGA, Xeon Phi and GPU-based implementations, showing at least 4.7x higher throughput and at least 15x lower power dissipation.
△ Less
Submitted 11 June, 2017; v1 submitted 17 January, 2017;
originally announced January 2017.
-
Convex Optimization of Real Time SoC
Authors:
L. Yavits,
A. Morad,
R. Ginosar,
U. Weiser
Abstract:
Convex optimization methods are employed to optimize a real-time (RT) system-on-chip (SoC) under a variety of physical resource-driven constraints, demonstrated on an industry MPEG2 encoder SoC. The power optimization is compared to conventional performance-optimization framework, showing a factor of two and a half saving in power. Convex optimization is shown to be very efficient in a high-level…
▽ More
Convex optimization methods are employed to optimize a real-time (RT) system-on-chip (SoC) under a variety of physical resource-driven constraints, demonstrated on an industry MPEG2 encoder SoC. The power optimization is compared to conventional performance-optimization framework, showing a factor of two and a half saving in power. Convex optimization is shown to be very efficient in a high-level early stage design exploration, guiding computer architects as to the choice of area, voltage, and frequency of the individual components of the Chip Multiprocessor (CMP).
△ Less
Submitted 19 May, 2017; v1 submitted 28 January, 2016;
originally announced January 2016.
-
Multi-Amdahl: Optimal Resource Sharing with Multiple Program Execution Segments
Authors:
Tsahee Zidenberg,
Isaac Keslassy,
Uri Weiser
Abstract:
This paper presents Multi-Amdahl, a resource allocation analytical tool for heterogeneous systems. Our model includes multiple program execution segments, where each one is accelerated by a specific hardware unit. The acceleration speedup of the specific hardware unit is a function of a limited resource, such as the unit area, power, or energy. Using the Lagrange theorem we discover the optimal re…
▽ More
This paper presents Multi-Amdahl, a resource allocation analytical tool for heterogeneous systems. Our model includes multiple program execution segments, where each one is accelerated by a specific hardware unit. The acceleration speedup of the specific hardware unit is a function of a limited resource, such as the unit area, power, or energy. Using the Lagrange theorem we discover the optimal resource distribution between all specific units. We then illustrate this general Multi-Amdahl technique using several examples of area and power allocation among several cores and accelerators.
△ Less
Submitted 15 May, 2011;
originally announced May 2011.