-
Post-Training Sparsity-Aware Quantization
Authors:
Gil Shomron,
Freddy Gabbay,
Samer Kurzum,
Uri Weiser
Abstract:
Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Map** FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradatio…
▽ More
Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Map** FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradation; however, reducing precision below 8 bits with PTQ is challenging, as accuracy degradation becomes noticeable, due to the increase in quantization noise. In this paper, we propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. 4-bit quantization, for example, is employed by dynamically examining the bits of 8-bit values and choosing a window of 4 bits, while first skip** zero-value bits. Moreover, instead of quantizing activation-by-activation to 4 bits, we focus on pairs of 8-bit activations and examine whether one of the two is equal to zero. If one is equal to zero, the second can opportunistically use the other's 4-bit budget; if both do not equal zero, then each is dynamically quantized to 4 bits, as described. SPARQ achieves minor accuracy degradation and a practical hardware implementation. The code is available at https://github.com/gilshm/sparq.
△ Less
Submitted 28 October, 2021; v1 submitted 23 May, 2021;
originally announced May 2021.
-
Asymmetric Aging Effect on Modern Microprocessors
Authors:
Freddy Gabbay,
Avi Mendelson
Abstract:
Reliability is a crucial requirement in any modern microprocessor to assure correct execution over its lifetime. As mission critical components are becoming common in commodity systems; e.g., control of autonomous cars, the demand for reliable processing has even further heightened. Latest process technologies even worsened the situation; thus, microprocessors design has become highly susceptible…
▽ More
Reliability is a crucial requirement in any modern microprocessor to assure correct execution over its lifetime. As mission critical components are becoming common in commodity systems; e.g., control of autonomous cars, the demand for reliable processing has even further heightened. Latest process technologies even worsened the situation; thus, microprocessors design has become highly susceptible to reliability concerns. This paper examines asymmetric aging phenomenon, which is a major reliability concern in advanced process nodes. In this phenomenon, logical elements and memory cells suffer from unequal timing degradation over time and consequently introduce reliability concerns. So far, most studies approached asymmetric aging from circuit or physical design viewpoint, but these solutions were quite limited and suboptimal. In this paper we introduce an asymmetric aging aware micro-architecture that aims at reducing its impact. The study is mainly focused on the following subsystems: execution units, register files and the memory hierarchy. Our experiments indicate that the proposed solutions incur minimal overhead while significantly mitigating the asymmetric aging stress.
△ Less
Submitted 8 September, 2020;
originally announced September 2020.
-
Electromigration-Aware Architecture for Modern Microprocessors
Authors:
Freddy Gabbay,
Avi Mendelson
Abstract:
Reliability is a fundamental requirement in any microprocessor to guarantee correct execution over its lifetime. The design rules related to reliability depend on the process technology being used and the expected operating conditions of the device. To meet reliability requirements, advanced process technologies (28 nm and below) impose highly challenging design rules. Such design-for-reliability…
▽ More
Reliability is a fundamental requirement in any microprocessor to guarantee correct execution over its lifetime. The design rules related to reliability depend on the process technology being used and the expected operating conditions of the device. To meet reliability requirements, advanced process technologies (28 nm and below) impose highly challenging design rules. Such design-for-reliability rules have become a major burden on the flow of VLSI implementation because of the severe physical constraints they impose.
This paper focuses on electromigration (EM), which is one of the major critical factors affecting semiconductor reliability. EM is the aging process of on-die wires and vias and is induced by excessive current flow that can damage wires and may also significantly impact the integrated-circuit clock frequency. EM exerts a comprehensive global effect on devices because it impacts wires that may reside inside the standard or custom logical cells, between logical cells, inside memory elements, and within wires that interconnect functional blocks.
The design-implementation flow (synthesis and place-and-route) currently detects violations of EM-reliability rules and attempts to solve them. In contrast, this paper proposes a new approach to enhance these flows by using EM-aware architecture. Our results show that the proposed solution can relax EM design efforts in microprocessors and more than double microprocessor lifetime. This work demonstrates this proposed approach for modern microprocessors, although the principals and ideas can be adapted to other cases as well.
△ Less
Submitted 13 September, 2021; v1 submitted 4 May, 2020;
originally announced May 2020.
-
HCM: Hardware-Aware Complexity Metric for Neural Network Architectures
Authors:
Alex Karbachevsky,
Chaim Baskin,
Evgenii Zheltonozhskii,
Yevgeny Yermolin,
Freddy Gabbay,
Alex M. Bronstein,
Avi Mendelson
Abstract:
Convolutional Neural Networks (CNNs) have become common in many fields including computer vision, speech recognition, and natural language processing. Although CNN hardware accelerators are already included as part of many SoC architectures, the task of achieving high accuracy on resource-restricted devices is still considered challenging, mainly due to the vast number of design parameters that ne…
▽ More
Convolutional Neural Networks (CNNs) have become common in many fields including computer vision, speech recognition, and natural language processing. Although CNN hardware accelerators are already included as part of many SoC architectures, the task of achieving high accuracy on resource-restricted devices is still considered challenging, mainly due to the vast number of design parameters that need to be balanced to achieve an efficient solution. Quantization techniques, when applied to the network parameters, lead to a reduction of power and area and may also change the ratio between communication and computation. As a result, some algorithmic solutions may suffer from lack of memory bandwidth or computational resources and fail to achieve the expected performance due to hardware constraints. Thus, the system designer and the micro-architect need to understand at early development stages the impact of their high-level decisions (e.g., the architecture of the CNN and the amount of bits used to represent its parameters) on the final product (e.g., the expected power saving, area, and accuracy). Unfortunately, existing tools fall short of supporting such decisions.
This paper introduces a hardware-aware complexity metric that aims to assist the system designer of the neural network architectures, through the entire project lifetime (especially at its early stages) by predicting the impact of architectural and micro-architectural decisions on the final product. We demonstrate how the proposed metric can help evaluate different design alternatives of neural network models on resource-restricted devices such as real-time embedded systems, and to avoid making design mistakes at early stages.
△ Less
Submitted 26 April, 2020; v1 submitted 19 April, 2020;
originally announced April 2020.