-
ZOBNN: Zero-Overhead Dependable Design of Binary Neural Networks with Deliberately Quantized Parameters
Authors:
Behnam Ghavami,
Mohammad Shahidzadeh,
Lesley Shannon,
Steve Wilton
Abstract:
Low-precision weights and activations in deep neural networks (DNNs) outperform their full-precision counterparts in terms of hardware efficiency. When implemented with low-precision operations, specifically in the extreme case where network parameters are binarized (i.e. BNNs), the two most frequently mentioned benefits of quantization are reduced memory consumption and a faster inference process…
▽ More
Low-precision weights and activations in deep neural networks (DNNs) outperform their full-precision counterparts in terms of hardware efficiency. When implemented with low-precision operations, specifically in the extreme case where network parameters are binarized (i.e. BNNs), the two most frequently mentioned benefits of quantization are reduced memory consumption and a faster inference process. In this paper, we introduce a third advantage of very low-precision neural networks: improved fault-tolerance attribute. We investigate the impact of memory faults on state-of-the-art binary neural networks (BNNs) through comprehensive analysis. Despite the inclusion of floating-point parameters in BNN architectures to improve accuracy, our findings reveal that BNNs are highly sensitive to deviations in these parameters caused by memory faults. In light of this crucial finding, we propose a technique to improve BNN dependability by restricting the range of float parameters through a novel deliberately uniform quantization. The introduced quantization technique results in a reduction in the proportion of floating-point parameters utilized in the BNN, without incurring any additional computational overheads during the inference stage. The extensive experimental fault simulation on the proposed BNN architecture (i.e. ZOBNN) reveal a remarkable 5X enhancement in robustness compared to conventional floating-point DNN. Notably, this improvement is achieved without incurring any computational overhead. Crucially, this enhancement comes without computational overhead. \ToolName~excels in critical edge applications characterized by limited computational resources, prioritizing both dependability and real-time performance.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
QUTE: Quantifying Uncertainty in TinyML models with Early-exit-assisted ensembles
Authors:
Nikhil P Ghanathe,
Steve Wilton
Abstract:
Existing methods for uncertainty quantification incur massive memory and compute overhead, often requiring multiple models/inferences. Hence they are impractical on ultra-low-power KB-sized TinyML devices. To reduce overhead, prior works have proposed the use of early-exit networks as ensembles to quantify uncertainty in a single forward-pass. However, they still have a prohibitive cost for tinyML…
▽ More
Existing methods for uncertainty quantification incur massive memory and compute overhead, often requiring multiple models/inferences. Hence they are impractical on ultra-low-power KB-sized TinyML devices. To reduce overhead, prior works have proposed the use of early-exit networks as ensembles to quantify uncertainty in a single forward-pass. However, they still have a prohibitive cost for tinyML. To address these challenges, we propose QUTE, a novel resource-efficient early-exit-assisted ensemble architecture optimized for tinyML models. QUTE adds additional output blocks at the final exit of the base network and distills the knowledge of early-exits into these blocks to create a diverse and lightweight ensemble architecture. Our results show that QUTE outperforms popular prior works, and improves the quality of uncertainty estimates by 6% with 3.1x lower model size on average compared to the most relevant prior work. Furthermore, we demonstrate that QUTE is also effective in detecting co-variate shifted and out-of-distribution inputs, and shows competitive performance relative to G-ODIN, a state-of-the-art generalized OOD detector.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization
Authors:
Behnam Ghavami,
Amin Kamjoo,
Lesley Shannon,
Steve Wilton
Abstract:
The imperative to deploy Deep Neural Network (DNN) models on resource-constrained edge devices, spurred by privacy concerns, has become increasingly apparent. To facilitate the transition from cloud to edge computing, this paper introduces a technique that effectively reduces the memory footprint of DNNs, accommodating the limitations of resource-constrained edge devices while preserving model acc…
▽ More
The imperative to deploy Deep Neural Network (DNN) models on resource-constrained edge devices, spurred by privacy concerns, has become increasingly apparent. To facilitate the transition from cloud to edge computing, this paper introduces a technique that effectively reduces the memory footprint of DNNs, accommodating the limitations of resource-constrained edge devices while preserving model accuracy. Our proposed technique, named Post-Training Intra-Layer Multi-Precision Quantization (PTILMPQ), employs a post-training quantization approach, eliminating the need for extensive training data. By estimating the importance of layers and channels within the network, the proposed method enables precise bit allocation throughout the quantization process. Experimental results demonstrate that PTILMPQ offers a promising solution for deploying DNNs on edge devices with restricted memory resources. For instance, in the case of ResNet50, it achieves an accuracy of 74.57\% with a memory footprint of 9.5 MB, representing a 25.49\% reduction compared to previous similar methods, with only a minor 1.08\% decrease in accuracy.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
T-RECX: Tiny-Resource Efficient Convolutional neural networks with early-eXit
Authors:
Nikhil P Ghanathe,
Steve Wilton
Abstract:
Deploying Machine learning (ML) on milliwatt-scale edge devices (tinyML) is gaining popularity due to recent breakthroughs in ML and Internet of Things (IoT). Most tinyML research focuses on model compression techniques that trade accuracy (and model capacity) for compact models to fit into the KB-sized tiny-edge devices. In this paper, we show how such models can be enhanced by the addition of an…
▽ More
Deploying Machine learning (ML) on milliwatt-scale edge devices (tinyML) is gaining popularity due to recent breakthroughs in ML and Internet of Things (IoT). Most tinyML research focuses on model compression techniques that trade accuracy (and model capacity) for compact models to fit into the KB-sized tiny-edge devices. In this paper, we show how such models can be enhanced by the addition of an early exit intermediate classifier. If the intermediate classifier exhibits sufficient confidence in its prediction, the network exits early thereby, resulting in considerable savings in time. Although early exit classifiers have been proposed in previous work, these previous proposals focus on large networks, making their techniques suboptimal/impractical for tinyML applications. Our technique is optimized specifically for tiny-CNN sized models. In addition, we present a method to alleviate the effect of network overthinking by leveraging the representations learned by the early exit. We evaluate T-RecX on three CNNs from the MLPerf tiny benchmark suite for image classification, keyword spotting and visual wake word detection tasks. Our results show that T-RecX 1) improves the accuracy of baseline network, 2) achieves 31.58% average reduction in FLOPS in exchange for one percent accuracy across all evaluated models. Furthermore, we show that our methods consistently outperform popular prior works on the tiny-CNNs we evaluate.
△ Less
Submitted 26 April, 2023; v1 submitted 13 July, 2022;
originally announced July 2022.
-
MAFIA: Machine Learning Acceleration on FPGAs for IoT Applications
Authors:
Nikhil Pratap Ghanathe,
Vivek Seshadri,
Rahul Sharma,
Steve Wilton,
Aayan Kumar
Abstract:
Recent breakthroughs in ML have produced new classes of models that allow ML inference to run directly on milliwatt-powered IoT devices. On one hand, existing ML-to-FPGA compilers are designed for deep neural-networks on large FPGAs. On the other hand, general-purpose HLS tools fail to exploit properties specific to ML inference, thereby resulting in suboptimal performance. We propose MAFIA, a too…
▽ More
Recent breakthroughs in ML have produced new classes of models that allow ML inference to run directly on milliwatt-powered IoT devices. On one hand, existing ML-to-FPGA compilers are designed for deep neural-networks on large FPGAs. On the other hand, general-purpose HLS tools fail to exploit properties specific to ML inference, thereby resulting in suboptimal performance. We propose MAFIA, a tool to compile ML inference on small form-factor FPGAs for IoT applications. MAFIA provides native support for linear algebra operations and can express a variety of ML algorithms, including state-of-the-art models. We show that MAFIA-generated programs outperform best-performing variant of a commercial HLS compiler by 2.5x on average.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
LeFlow: Enabling Flexible FPGA High-Level Synthesis of Tensorflow Deep Neural Networks
Authors:
Daniel H. Noronha,
Bahar Salehpour,
Steven J. E. Wilton
Abstract:
Recent work has shown that Field-Programmable Gate Arrays (FPGAs) play an important role in the acceleration of Machine Learning applications. Initial specification of machine learning applications are often done using a high-level Python-oriented framework such as Tensorflow, followed by a manual translation to either C or RTL for synthesis using vendor tools. This manual translation step is time…
▽ More
Recent work has shown that Field-Programmable Gate Arrays (FPGAs) play an important role in the acceleration of Machine Learning applications. Initial specification of machine learning applications are often done using a high-level Python-oriented framework such as Tensorflow, followed by a manual translation to either C or RTL for synthesis using vendor tools. This manual translation step is time-consuming and requires expertise that limit the applicability of FPGAs in this important domain. In this paper, we present an open-source tool-flow that maps numerical computation models written in Tensorflow to synthesizable hardware. Unlike other tools, which are often constrained by a small number of inflexible templates, our flow uses Google's XLA compiler which emits LLVM code directly from a Tensorflow specification. This LLVM code can then be used with a high-level synthesis tool to automatically generate hardware. We show that our flow allows users to generate Deep Neural Networks with very few lines of Python code.
△ Less
Submitted 13 July, 2018;
originally announced July 2018.
-
Algorithms for Embedding Quantum-Dot Cellular Automata Networks onto a Quantum Annealing Processor
Authors:
Jacob Retallick,
Michael Babcock,
Miguel Aroca-Ouellette,
Shane McNamara,
Steve Wilton,
Aidan Roy,
Mark Johnson,
Konrad Walus
Abstract:
Advancements in computing based on qubit networks, and in particular the flux-qubit processor architecture developed by D-Wave System's Inc., have enabled the physical simulation of quantum-dot cellular automata (QCA) networks beyond the limit of classical methods. However, the embedding of QCA networks onto the available processor architecture is a key challenge in preparing such simulations. In…
▽ More
Advancements in computing based on qubit networks, and in particular the flux-qubit processor architecture developed by D-Wave System's Inc., have enabled the physical simulation of quantum-dot cellular automata (QCA) networks beyond the limit of classical methods. However, the embedding of QCA networks onto the available processor architecture is a key challenge in preparing such simulations. In this work, two approaches to embedding QCA circuits are characterized: a dense placement algorithm that uses a routing method based on negotiated congestion; and a heuristic method implemented in D-Wave's Solver API package. A set of benchmark QCA networks is used to characterise the algorithms and a stochastic circuit generator is employed to investigate the performance for different processor sizes and active flux-qubit yields.
△ Less
Submitted 14 September, 2017;
originally announced September 2017.
-
Enabling Effective FPGA Debug using Overlays: Opportunities and Challenges
Authors:
Fatemeh Eslami,
Eddie Hung,
Steven J. E. Wilton
Abstract:
FPGAs are going mainstream. Major companies that were not traditionally FPGA-focused are now seeking ways to exploit the benefits of reconfigurable technology and provide it to their customers. In order to do so, a debug ecosystem that provides for effective visibility into a working design and quick debug turn-around times is essential. Overlays have the opportunity to play a key role in this eco…
▽ More
FPGAs are going mainstream. Major companies that were not traditionally FPGA-focused are now seeking ways to exploit the benefits of reconfigurable technology and provide it to their customers. In order to do so, a debug ecosystem that provides for effective visibility into a working design and quick debug turn-around times is essential. Overlays have the opportunity to play a key role in this ecosystem. In this overview paper, we discuss how an overlay fabric that allows the user to rapidly add debug instrumentation to a design can be created and exploited. We discuss the requirements of such an overlay and some of the research challenges and opportunities that need to be addressed. To make our exposition concrete, we use two previously-published examples of overlays that have been developed to implement debug instrumentation.
△ Less
Submitted 21 June, 2016;
originally announced June 2016.
-
Allowing Software Developers to Debug HLS Hardware
Authors:
Jeffrey Goeders,
Steven J. E. Wilton
Abstract:
High-Level Synthesis (HLS) is emerging as a mainstream design methodology, allowing software designers to enjoy the benefits of a hardware implementation. Significant work has led to effective compilers that produce high-quality hardware designs from software specifications. However, in order to fully benefit from the promise of HLS, a complete ecosystem that provides the ability to analyze, debug…
▽ More
High-Level Synthesis (HLS) is emerging as a mainstream design methodology, allowing software designers to enjoy the benefits of a hardware implementation. Significant work has led to effective compilers that produce high-quality hardware designs from software specifications. However, in order to fully benefit from the promise of HLS, a complete ecosystem that provides the ability to analyze, debug, and optimize designs is essential. This ecosystem has to be accessible to software designers. This is challenging, since software developers view their designs very differently than how they are physically implemented on-chip. Rather than individual sequential lines of code, the implementation consists of gates operating in parallel across multiple clock cycles. In this paper, we report on our efforts to create an ecosystem that allows software designers to debug HLS-generated circuits in a familiar manner. We have implemented our ideas in a debug framework that will be included in the next release of the popular LegUp high-level synthesis tool.
△ Less
Submitted 27 August, 2015;
originally announced August 2015.