Search | arXiv e-print repository

Exploring FPGA designs for MX and beyond

Authors: Ebby Samson, Naveen Mellempudi, Wayne Luk, George A. Constantinides

Abstract: A number of companies recently worked together to release the new Open Compute Project MX standard for low-precision computation, aimed at efficient neural network implementation. In this paper, we describe and evaluate the first open-source FPGA implementation of the arithmetic defined in the standard. Our designs fully support all the standard's concrete formats for conversion into and out of MX… ▽ More A number of companies recently worked together to release the new Open Compute Project MX standard for low-precision computation, aimed at efficient neural network implementation. In this paper, we describe and evaluate the first open-source FPGA implementation of the arithmetic defined in the standard. Our designs fully support all the standard's concrete formats for conversion into and out of MX formats and for the standard-defined arithmetic operations, as well as arbitrary fixed-point and floating-point formats. Certain elements of the standard are left as implementation-defined, and we present the first concrete FPGA-inspired choices for these elements, which we outline in the paper. Our library of optimized hardware components is available open source, and can be used to build larger systems. For this purpose, we also describe and release an open-source Pytorch library for quantization into the new standard, integrated with the Brevitas library so that the community can develop novel neural network designs quantized with MX formats in mind. We demonstrate the usability and efficacy of our libraries via the implementation of example neural networks such as ResNet-18 on the ImageNet ILSVRC12 dataset. Our testing shows that MX is very effective for formats such as INT5 or FP6 which are not natively supported on GPUs. This gives FPGAs an advantage as they have the flexibility to implement a custom datapath and take advantage of the smaller area footprints offered by these formats. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 8 pages, 4 figures

arXiv:2406.14963 [pdf, other]

Optimised Grouped-Query Attention Mechanism for Transformers

Authors: Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao

Abstract: Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grou** an MHA to a GQA for better model performance. Our Asy… ▽ More Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grou** an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grou**. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted at ICML2024 ES-FoMo-II Workshop

arXiv:2406.14956 [pdf, other]

Unlocking the Global Synergies in Low-Rank Adapters

Authors: Zixi Zhang, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao

Abstract: Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models. We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters across the model for better fine-tuned performance. In addition to the allocation for the standard LoRA-adapted models, we also demonstrate the ef… ▽ More Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models. We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters across the model for better fine-tuned performance. In addition to the allocation for the standard LoRA-adapted models, we also demonstrate the efficacy of HeteroLoRA by performing the allocation in a more challenging search space that includes LoRA modules and LoRA-adapted shortcut connections. Experiments show that HeteroLoRA enables improvements in model performance given the same parameter budge. For example, on MRPC, we see an improvement of 1.6% in accuracy with similar training parameter budget. We will open-source our algorithm once the paper is accepted. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted at ICML2024 ES-FoMo-II Workshop

arXiv:2406.12421 [pdf, other]

ROVER: RTL Optimization via Verified E-Graph Rewriting

Authors: Samuel Coward, Theo Drane, George A. Constantinides

Abstract: Manual RTL design and optimization remains prevalent across the semiconductor industry because commercial logic and high-level synthesis tools are unable to match human designs. Our experience in industrial datapath design demonstrates that manual optimization can typically be decomposed into a sequence of local equivalence preserving transformations. By formulating datapath optimization as a grap… ▽ More Manual RTL design and optimization remains prevalent across the semiconductor industry because commercial logic and high-level synthesis tools are unable to match human designs. Our experience in industrial datapath design demonstrates that manual optimization can typically be decomposed into a sequence of local equivalence preserving transformations. By formulating datapath optimization as a graph rewriting problem we automate design space exploration in a tool we call ROVER. We develop a set of mixed precision RTL rewrite rules inspired by designers at Intel and an accompanying automated validation framework. A particular challenge in datapath design is to determine a productive order in which to apply transformations as this can be design dependent. ROVER resolves this problem by building upon the e-graph data structure, which compactly represents a design space of equivalent implementations. By applying rewrites to this data structure, ROVER generates a set of efficient and functionally equivalent design options. From the ROVER generated e-graph we select an efficient implementation. To accurately model the circuit area we develop a theoretical cost metric and then an integer linear programming model to extract the optimal implementation. To build trust in the generated design ROVER also produces a back-end verification certificate that can be checked using industrial tools. We apply ROVER to both Intel-provided and open-source benchmarks, and see up to a 63% reduction in circuit area. ROVER is also able to generate a customized library of distinct implementations from a given parameterizable RTL design, improving circuit area across the range of possible instantiations. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.03227 [pdf, other]

Soft GPGPU versus IP cores: Quantifying and Reducing the Performance Gap

Authors: Martin Langhammer, George A. Constantinides

Abstract: eGPU, a recently-reported soft GPGPU for FPGAs, has demonstrated very high clock frequencies (more than 750 MHz) and small footprint. This means that for the first time, commercial soft processors may be competitive for the kind of heavy numerical computations common in FPGA-based digital signal processing. In this paper we take a deep dive into the performance of the eGPU family on FFT computatio… ▽ More eGPU, a recently-reported soft GPGPU for FPGAs, has demonstrated very high clock frequencies (more than 750 MHz) and small footprint. This means that for the first time, commercial soft processors may be competitive for the kind of heavy numerical computations common in FPGA-based digital signal processing. In this paper we take a deep dive into the performance of the eGPU family on FFT computation, in order to quantify the performance gap between state-of-the-art soft processors and commercial IP cores specialized for this task. In the process, we propose two novel architectural features for the eGPU that improve the efficiency of the design by 50\% when executing the FFTs. The end-result is that our modified GPGPU takes only 3 times the performance-area product of a specialized IP core, yet as a programmable processor is able to execute arbitrary software-defined algorithms. Further comparison to Nvidia A100 GPGPUs demonstrates the superior efficiency of eGPU on FFTs of the size studied (256 to 4096-point). △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2404.12336 [pdf, other]

Combining Power and Arithmetic Optimization via Datapath Rewriting

Authors: Samuel Coward, Theo Drane, Emiliano Morini, George Constantinides

Abstract: Industrial datapath designers consider dynamic power consumption to be a key metric. Arithmetic circuits contribute a major component of total chip power consumption and are therefore a common target for power optimization. While arithmetic circuit area and dynamic power consumption are often correlated, there is also a tradeoff to consider, as additional gates can be added to explicitly reduce ar… ▽ More Industrial datapath designers consider dynamic power consumption to be a key metric. Arithmetic circuits contribute a major component of total chip power consumption and are therefore a common target for power optimization. While arithmetic circuit area and dynamic power consumption are often correlated, there is also a tradeoff to consider, as additional gates can be added to explicitly reduce arithmetic circuit activity and hence reduce power consumption. In this work, we consider two forms of power optimization and their interaction: circuit area reduction via arithmetic optimization, and the elimination of redundant computations using both data and clock gating. By encoding both these classes of optimization as local rewrites of expressions, our tool flow can simultaneously explore them, uncovering new opportunities for power saving through arithmetic rewrites using the e-graph data structure. Since power consumption is highly dependent upon the workload performed by the circuit, our tool flow facilitates a data dependent design paradigm, where an implementation is automatically tailored to particular contexts of data activity. We develop an automated RTL to RTL optimization framework, ROVER, that takes circuit input stimuli and generates power-efficient architectures. We evaluate the effectiveness on both open-source arithmetic benchmarks and benchmarks derived from Intel production examples. The tool is able to reduce the total power consumption by up to 33.9%. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2403.12285 [pdf, other]

FinLlama: Financial Sentiment Classification for Algorithmic Trading Applications

Authors: Thanos Konstantinidis, Giorgos Iacovides, Mingxue Xu, Tony G. Constantinides, Danilo Mandic

Abstract: There are multiple sources of financial news online which influence market movements and trader's decisions. This highlights the need for accurate sentiment analysis, in addition to having appropriate algorithmic trading techniques, to arrive at better informed trading decisions. Standard lexicon based sentiment approaches have demonstrated their power in aiding financial decisions. However, they… ▽ More There are multiple sources of financial news online which influence market movements and trader's decisions. This highlights the need for accurate sentiment analysis, in addition to having appropriate algorithmic trading techniques, to arrive at better informed trading decisions. Standard lexicon based sentiment approaches have demonstrated their power in aiding financial decisions. However, they are known to suffer from issues related to context sensitivity and word ordering. Large Language Models (LLMs) can also be used in this context, but they are not finance-specific and tend to require significant computational resources. To facilitate a finance specific LLM framework, we introduce a novel approach based on the Llama 2 7B foundational model, in order to benefit from its generative nature and comprehensive language manipulation. This is achieved by fine-tuning the Llama2 7B model on a small portion of supervised financial sentiment analysis data, so as to jointly handle the complexities of financial lexicon and context, and further equip** it with a neural network based decision mechanism. Such a generator-classifier scheme, referred to as FinLlama, is trained not only to classify the sentiment valence but also quantify its strength, thus offering traders a nuanced insight into financial news articles. Complementing this, the implementation of parameter-efficient fine-tuning through LoRA optimises trainable parameters, thus minimising computational and memory requirements, without sacrificing accuracy. Simulation results demonstrate the ability of the proposed FinLlama to provide a framework for enhanced portfolio management decisions and increased market returns. These results underpin the ability of FinLlama to construct high-return portfolios which exhibit enhanced resilience, even during volatile periods and unpredictable market events. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.00849 [pdf, other]

NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions

Authors: Marta Andronic, George A. Constantinides

Abstract: Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. Thus, some previous FPGA acceleration works have proposed map** neurons with quantized inputs and outpu… ▽ More Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. Thus, some previous FPGA acceleration works have proposed map** neurons with quantized inputs and outputs directly to lookup tables (LUTs) for hardware implementation. In these works, the boundaries of the neurons coincide with the boundaries of the LUTs. We propose relaxing these boundaries and map** entire sub-networks to a single LUT. As the sub-networks are absorbed within the LUT, the NN topology and precision within a partition do not affect the size of the lookup tables generated. Therefore, we utilize fully connected layers with floating-point precision inside each partition, which benefit from being universal function approximators, but with rigid sparsity and quantization enforced between partitions, where the NN topology becomes exposed to the circuit topology. Although cheap to implement, this approach can lead to very deep NNs, and so to tackle challenges like vanishing gradients, we also introduce skip connections inside the partitions. The resulting methodology can be seen as training DNNs with a specific FPGA hardware-inspired sparsity pattern that allows them to be mapped to much shallower circuit-level networks, thereby significantly improving latency. We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, digit classification using MNIST. Our approach allows for greater function expressivity within the LUTs compared to existing work, leading to up to $4.3\times$ lower latency NNs for the same accuracy. △ Less

Submitted 3 July, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

arXiv:2402.02446 [pdf, other]

LQER: Low-Rank Quantization Error Reconstruction for LLMs

Authors: Cheng Zhang, Jianyi Cheng, George A. Constantinides, Yiren Zhao

Abstract: Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nea… ▽ More Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-base iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36$\times$ fewer hardware resources than the leading state-of-the-art method. We open-source our framework at https://github.com/ChengZhang-98/lqer △ Less

Submitted 30 May, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

Comments: Accepted at ICML2024

arXiv:2401.04261 [pdf, other]

A Statically and Dynamically Scalable Soft GPGPU

Authors: Martin Langhammer, George A. Constantinides

Abstract: Current soft processor architectures for FPGAs do not utilize the potential of the massive parallelism available. FPGAs now support many thousands of embedded floating point operators, and have similar computational densities to GPGPUs. Several soft GPGPU or SIMT processors have been published, but the reported large areas and modest Fmax makes their widespread use unlikely for commercial designs.… ▽ More Current soft processor architectures for FPGAs do not utilize the potential of the massive parallelism available. FPGAs now support many thousands of embedded floating point operators, and have similar computational densities to GPGPUs. Several soft GPGPU or SIMT processors have been published, but the reported large areas and modest Fmax makes their widespread use unlikely for commercial designs. In this paper we take an alternative approach, building the soft GPU microarchitecture around the FPGA resource mix available. We demonstrate a statically scalable soft GPGPU processor (where both parameters and feature set can be determined at configuration time) that always closes timing at the peak speed of the slowest embedded component in the FPGA (DSP or hard memory), with a completely unconstrained compile into a current Intel Agilex FPGA. We also show dynamic scalability, where a subset of the thread space can be specified on an instruction-by-instruction basis. For one example core type, we show a logic range -- depending on the configuration -- of 4k to 10k ALMs, along with 24 to 32 DSP Blocks, and 50 to 250 M20K memories. All of these instances close timing at 771 MHz, a performance level limited only by the DSP Blocks. We describe our methodology for reliably achieving this clock rate by matching the processor pipeline structure to the physical structure of the FPGA fabric. We also benchmark several algorithms across a range of data sizes, and compare to a commercial soft RISC processor. △ Less

Submitted 8 January, 2024; originally announced January 2024.

arXiv:2312.06004 [pdf, other]

Multiplier Optimization via E-Graph Rewriting

Authors: Andy Wanna, Samuel Coward, Theo Drane, George A. Constantinides, Miloš D. Ercegovac

Abstract: Multiplier circuits account for significant resource usage in datapath-dominated circuit designs, and RTL designers continue to build bespoke hand-crafted multiplication arrays for their particular application. The construction of an optimized multiplier presents trade-offs between pre-processing to generate a smaller array and array reduction. A data structure known as an e-graph has recently bee… ▽ More Multiplier circuits account for significant resource usage in datapath-dominated circuit designs, and RTL designers continue to build bespoke hand-crafted multiplication arrays for their particular application. The construction of an optimized multiplier presents trade-offs between pre-processing to generate a smaller array and array reduction. A data structure known as an e-graph has recently been applied to datapath optimization, where the e-graph's ability to efficiently explore trade-offs has been shown to be crucial. We propose an e-graph based rewriting framework to construct optimized multiplier circuits. Such a framework can express alternative multiplier representations and generate customized circuit designs. We demonstrate that the proposed tool, which we call OptiMult, can reduce the latency of a squarer by up to 46% and reduce the latency of a standard multiplier by up to 9% when compared against logic synthesis instantiated components. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: Preprint for work presented at the 2023 Asilomar Conference on Signals, Systems and Computers

arXiv:2310.05079 [pdf, other]

doi 10.18653/v1/2023.emnlp-main.617

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Authors: Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides, Yiren Zhao

Abstract: The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address thi… ▽ More The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a $19\times$ higher arithmetic density and $5\times$ memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by $2.5\times$ in arithmetic density and $1.2\times$ in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced. △ Less

Submitted 21 October, 2023; v1 submitted 8 October, 2023; originally announced October 2023.

Comments: Accepted by EMNLP2023

arXiv:2309.02334 [pdf, other]

doi 10.1109/ICFPT59805.2023.00012

PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference

Authors: Marta Andronic, George A. Constantinides

Abstract: Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by t… ▽ More Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset. △ Less

Submitted 6 November, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

Journal ref: 2023 International Conference on Field Programmable Technology (ICFPT), Yokohama, Japan, 2023, pp. 60-68

arXiv:2308.05170 [pdf, other]

FPGA Resource-aware Structured Pruning for Real-Time Neural Networks

Authors: Benjamin Ramhorst, Vladimir Loncar, George A. Constantinides

Abstract: Neural networks achieve state-of-the-art performance in image classification, speech recognition, scientific analysis and many more application areas. Due to the high computational complexity and memory footprint of neural networks, various compression techniques, such as pruning and quantization, have been proposed in literature. Pruning sparsifies a neural network, reducing the number of multipl… ▽ More Neural networks achieve state-of-the-art performance in image classification, speech recognition, scientific analysis and many more application areas. Due to the high computational complexity and memory footprint of neural networks, various compression techniques, such as pruning and quantization, have been proposed in literature. Pruning sparsifies a neural network, reducing the number of multiplications and memory. However, pruning often fails to capture properties of the underlying hardware, causing unstructured sparsity and load-balance inefficiency, thus bottlenecking resource improvements. We propose a hardware-centric formulation of pruning, by formulating it as a knapsack problem with resource-aware tensor structures. Evaluated on a range of tasks, including sub-microsecond particle classification at CERN's Large Hadron Collider and fast image classification, the proposed method achieves reductions ranging between 55% and 92% in the DSP utilization and up to 81% in BRAM utilization. △ Less

Submitted 12 December, 2023; v1 submitted 9 August, 2023; originally announced August 2023.

arXiv:2308.00431 [pdf, other]

Datapath Verification via Word-Level E-Graph Rewriting

Authors: Samuel Coward, Emiliano Morini, Bryan Tan, Theo Drane, George Constantinides

Abstract: Formal verification of datapath circuits is challenging as they are subject to intense optimization effort in the design phase. Industrial vendors and design companies deploy equivalence checking against a golden or existing reference design to satisfy correctness concerns. State-of-the-art datapath equivalence checking tools deploy a suite of techniques, including rewriting. We propose a rewritin… ▽ More Formal verification of datapath circuits is challenging as they are subject to intense optimization effort in the design phase. Industrial vendors and design companies deploy equivalence checking against a golden or existing reference design to satisfy correctness concerns. State-of-the-art datapath equivalence checking tools deploy a suite of techniques, including rewriting. We propose a rewriting framework deploying bitwidth dependent rewrites based on the e-graph data structure, providing a powerful assistant to existing tools. The e-graph can generate a path of rewrites between the reference and implementation designs that can be checked by a trusted industry tool. We will demonstrate how the intermediate proofs generated by the assistant enable convergence in a state of the art tool, without which the industrial tool runs for 24 hours without making progress. The intermediate proofs automatically introduced by the assistant also reduce the total proof runtime by up to 6x. △ Less

Submitted 1 August, 2023; originally announced August 2023.

arXiv:2307.15517 [pdf, other]

A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats

Authors: Jianyi Cheng, Cheng Zhang, Zhewen Yu, Christos-Savvas Bouganis, George A. Constantinides, Yiren Zhao

Abstract: Model quantization represents both parameters (weights) and intermediate values (activations) in a more compact format, thereby directly reducing both computational and memory cost in hardware. The quantization of recent large language models (LLMs) faces challenges to achieve competitive memory density compared to other models such as convolutional neural networks, since values in LLMs require la… ▽ More Model quantization represents both parameters (weights) and intermediate values (activations) in a more compact format, thereby directly reducing both computational and memory cost in hardware. The quantization of recent large language models (LLMs) faces challenges to achieve competitive memory density compared to other models such as convolutional neural networks, since values in LLMs require larger dynamic ranges. Current hardware can expedite computation for LLMs using compact numerical formats such as low-bitwidth integers or floating-point numbers. Each has advantages: integer operations simplify circuit design, whereas floating-point calculations can enhance accuracy when a wider dynamic range is required. In this work, we seek an efficient data format that combines the best of both worlds: Microscaling (MX) formats. MX formats are efficient data formats that achieve both large dynamic ranges and high memory density. In this paper, we propose a compiler named MASE for exploring mixed-precision MX formats on dataflow hardware accelerators for LLM inference. Our main contributions are twofold. First, we propose a novel orchestration abstraction to explore both software and hardware optimizations with new data formats. Second, MASE achieves LLM inference at an average precision of 4-bits, with minimal to no accuracy degradation. To our knowledge, MASE represents the first effort to harness fine-grain multi-precision MX formats in the design of LLM hardware accelerators. Over a range of LLMs and datasets, MASE achieves an average improvement of 24% in $Δ$ accuracy with an overhead of only 3% in energy efficiency compared to designs using 8-bit fixed-point numbers. △ Less

Submitted 19 April, 2024; v1 submitted 28 July, 2023; originally announced July 2023.

arXiv:2307.08378 [pdf, other]

eGPU: A 750 MHz Class Soft GPGPU for FPGA

Authors: Martin Langhammer, George Constantinides

Abstract: This paper introduces the eGPU, a SIMT soft processor designed for FPGAs. Soft processors typically achieve modest operating frequencies, a fraction of the headline performance claimed by modern FPGA families, and obtain correspondingly modest performance results. We propose a GPGPU architecture structured specifically to take advantage of both the soft logic and embedded features of the FPGA. We… ▽ More This paper introduces the eGPU, a SIMT soft processor designed for FPGAs. Soft processors typically achieve modest operating frequencies, a fraction of the headline performance claimed by modern FPGA families, and obtain correspondingly modest performance results. We propose a GPGPU architecture structured specifically to take advantage of both the soft logic and embedded features of the FPGA. We also consider the physical location of the embedded memories and DSP Blocks relative to the location and number of soft logic elements in order to have a design with balanced resources. Our goal is to create a high performance soft processor able to implement complex portions of FPGA system designs, such as the linear solvers commonly used in wireless systems, through push-button compilation from software. The eGPU architecture is a streaming multiprocessor (SM) machine with 512 threads. Each SM contains 16 scalar processors (SP). Both IEEE754 FP32 and INT32 integer arithmetic are supported. We demonstrate a single SM eGPU in an Intel Agilex device, requiring 5600 ALMs and 24 DSP Blocks, which closes timing at over 770 MHz from a completely unconstrained compile. Multiple eGPUs can also be tightly packed together into a single Agilex FPGA logic region, with minimal speed penalty. △ Less

Submitted 17 July, 2023; originally announced July 2023.

arXiv:2304.08400 [pdf, other]

ATHEENA: A Toolflow for Hardware Early-Exit Network Automation

Authors: Benjamin Biggs, Christos-Savvas Bouganis, George A. Constantinides

Abstract: The continued need for improvements in accuracy, throughput, and efficiency of Deep Neural Networks has resulted in a multitude of methods that make the most of custom architectures on FPGAs. These include the creation of hand-crafted networks and the use of quantization and pruning to reduce extraneous network parameters. However, with the potential of static solutions already well exploited, we… ▽ More The continued need for improvements in accuracy, throughput, and efficiency of Deep Neural Networks has resulted in a multitude of methods that make the most of custom architectures on FPGAs. These include the creation of hand-crafted networks and the use of quantization and pruning to reduce extraneous network parameters. However, with the potential of static solutions already well exploited, we propose to shift the focus to using the varying difficulty of individual data samples to further improve efficiency and reduce average compute for classification. Input-dependent computation allows for the network to make runtime decisions to finish a task early if the result meets a confidence threshold. Early-Exit network architectures have become an increasingly popular way to implement such behaviour in software. We create: A Toolflow for Hardware Early-Exit Network Automation (ATHEENA), an automated FPGA toolflow that leverages the probability of samples exiting early from such networks to scale the resources allocated to different sections of the network. The toolflow uses the data-flow model of fpgaConvNet, extended to support Early-Exit networks as well as Design Space Exploration to optimize the generated streaming architecture hardware with the goal of increasing throughput/reducing area while maintaining accuracy. Experimental results on three different networks demonstrate a throughput increase of $2.00\times$ to $2.78\times$ compared to an optimized baseline network implementation with no early exits. Additionally, the toolflow can achieve a throughput matching the same baseline with as low as $46\%$ of the resources the baseline requires. △ Less

Submitted 17 April, 2023; originally announced April 2023.

arXiv:2303.01839 [pdf, other]

Automating Constraint-Aware Datapath Optimization using E-Graphs

Authors: Samuel Coward, George A. Constantinides, Theo Drane

Abstract: Numerical hardware design requires aggressive optimization, where designers exploit branch constraints, creating optimization opportunities that are valid only on a sub-domain of input space. We developed an RTL optimization tool that automatically learns the consequences of conditional branches and exploits that knowledge to enable deep optimization. The tool deploys custom built program analysis… ▽ More Numerical hardware design requires aggressive optimization, where designers exploit branch constraints, creating optimization opportunities that are valid only on a sub-domain of input space. We developed an RTL optimization tool that automatically learns the consequences of conditional branches and exploits that knowledge to enable deep optimization. The tool deploys custom built program analysis based on abstract interpretation theory, which when combined with a data-structure known as an e-graph simplifies complex reasoning about program properties. Our tool fully-automatically discovers known floating-point architectures from the computer arithmetic literature and out-performs baseline EDA tools, generating up to 33% faster and 41% smaller circuits. △ Less

Submitted 3 March, 2023; originally announced March 2023.

arXiv:2205.14989 [pdf, other]

Combining E-Graphs with Abstract Interpretation

Authors: Samuel Coward, George A. Constantinides, Theo Drane

Abstract: E-graphs are a data structure that compactly represents equivalent expressions. They are constructed via the repeated application of rewrite rules. Often in practical applications, conditional rewrite rules are crucial, but their application requires the detection - at the time the e-graph is being built - that a condition is valid in the domain of application. Detecting condition validity amounts… ▽ More E-graphs are a data structure that compactly represents equivalent expressions. They are constructed via the repeated application of rewrite rules. Often in practical applications, conditional rewrite rules are crucial, but their application requires the detection - at the time the e-graph is being built - that a condition is valid in the domain of application. Detecting condition validity amounts to proving a property of the program. Abstract interpretation is a general method to learn such properties, traditionally used in static analysis tools. We demonstrate that abstract interpretation and e-graph analysis naturally reinforce each other through a tight integration because (i) the e-graph clustering of equivalent expressions induces natural precision refinement of abstractions and (ii) precise abstractions allow the application of deeper rewrite rules (and hence potentially even greater precision). We develop the theory behind this intuition and present an exemplar interval arithmetic implementation, which we apply to the FPBench suite. △ Less

Submitted 15 August, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

arXiv:2204.11478 [pdf, other]

Automatic Datapath Optimization using E-Graphs

Authors: Samuel Coward, George A. Constantinides, Theo Drane

Abstract: Manual optimization of Register Transfer Level (RTL) datapath is commonplace in industry but holds back development as it can be very time consuming. We utilize the fact that a complex transformation of one RTL into another equivalent RTL can be broken down into a sequence of smaller, localized transformations. By representing RTL as a graph and deploying modern graph rewriting techniques we can a… ▽ More Manual optimization of Register Transfer Level (RTL) datapath is commonplace in industry but holds back development as it can be very time consuming. We utilize the fact that a complex transformation of one RTL into another equivalent RTL can be broken down into a sequence of smaller, localized transformations. By representing RTL as a graph and deploying modern graph rewriting techniques we can automate the circuit design space exploration, allowing us to discover functionally equivalent but optimized architectures. We demonstrate that modern rewriting frameworks can adequately capture a wide variety of complex optimizations performed by human designers on bit-vector manipulating code, including significant error-prone subtleties regarding the validity of transformations under complex interactions of bitwidths. The proposed automated optimization approach is able to reproduce the results of typical industrial manual optimization, resulting in a reduction in circuit area by up to 71%. Not only does our tool discover optimized RTL, but also correctly identifies that the optimal architecture to implement a given arithmetic expression can depend on the width of the operands, thus producing a library of optimized designs rather than the single design point typically generated by manual optimization. In addition, we demonstrate that prior academic work on maximally exploiting carry-save representation and on multiple constant multiplication are both generalized and extended, falling out as special cases of this paper. △ Less

Submitted 26 July, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

arXiv:2203.09191 [pdf, other]

Abstract Interpretation on E-Graphs

Authors: Samuel Coward, George A. Constantinides, Theo Drane

Abstract: Recent e-graph applications have typically considered concrete semantics of expressions, where the notion of equivalence stems from concrete interpretation of expressions. However, equivalences that hold over one interpretation may not hold in an alternative interpretation. Such an observation can be exploited. We consider the application of abstract interpretation to e-graphs, and show that withi… ▽ More Recent e-graph applications have typically considered concrete semantics of expressions, where the notion of equivalence stems from concrete interpretation of expressions. However, equivalences that hold over one interpretation may not hold in an alternative interpretation. Such an observation can be exploited. We consider the application of abstract interpretation to e-graphs, and show that within an e-graph, the lattice meet operation associated with the abstract domain has a natural interpretation for an e-class, leading to improved precision in over-approximation. In this extended abstract, we use Interval Arithmetic (IA) to illustrate this point. △ Less

Submitted 17 March, 2022; originally announced March 2022.

arXiv:2201.11522 [pdf, other]

High-level Synthesis using the Julia Language

Authors: Benjamin Biggs, Ian McInerney, Eric C. Kerrigan, George A. Constantinides

Abstract: The growing proliferation of FPGAs and High-level Synthesis (HLS) tools has led to a large interest in designing hardware accelerators for complex operations and algorithms. However, existing HLS toolflows typically require a significant amount of user knowledge or training to be effective in both industrial and research applications. In this paper, we propose using the Julia language as the basis… ▽ More The growing proliferation of FPGAs and High-level Synthesis (HLS) tools has led to a large interest in designing hardware accelerators for complex operations and algorithms. However, existing HLS toolflows typically require a significant amount of user knowledge or training to be effective in both industrial and research applications. In this paper, we propose using the Julia language as the basis for an HLS tool. The Julia HLS tool aims to decrease the barrier to entry for hardware acceleration by taking advantage of the readability of the Julia language and by allowing the use of the existing large library of standard mathematical functions written in Julia. We present a prototype Julia HLS tool, written in Julia, that transforms Julia code to VHDL. We highlight how features of Julia and its compiler simplified the creation of this tool, and we discuss potential directions for future work. △ Less

Submitted 17 February, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

Comments: Presented at the 2nd Workshop on Languages, Tools, and Techniques for Accelerator Design (LATTE'22)

arXiv:2201.09568 [pdf, other]

Pearl: Parallel Evolutionary and Reinforcement Learning Library

Authors: Rohan Tangri, Danilo P. Mandic, Anthony G. Constantinides

Abstract: Reinforcement learning is increasingly finding success across domains where the problem can be represented as a Markov decision process. Evolutionary computation algorithms have also proven successful in this domain, exhibiting similar performance to the generally more complex reinforcement learning. Whilst there exist many open-source reinforcement learning and evolutionary computation libraries,… ▽ More Reinforcement learning is increasingly finding success across domains where the problem can be represented as a Markov decision process. Evolutionary computation algorithms have also proven successful in this domain, exhibiting similar performance to the generally more complex reinforcement learning. Whilst there exist many open-source reinforcement learning and evolutionary computation libraries, no publicly available library combines the two approaches for enhanced comparison, cooperation, or visualization. To this end, we have created Pearl (https://github.com/LondonNode/Pearl), an open source Python library designed to allow researchers to rapidly and conveniently perform optimized reinforcement learning, evolutionary computation and combinations of the two. The key features within Pearl include: modular and expandable components, opinionated module settings, Tensorboard integration, custom callbacks and comprehensive visualizations. △ Less

Submitted 24 January, 2022; originally announced January 2022.

arXiv:2112.06887 [pdf, other]

doi 10.1002/advs.202105784

Nonideality-Aware Training for Accurate and Robust Low-Power Memristive Neural Networks

Authors: Dovydas Joksas, Erwei Wang, Nikolaos Barmpatsalos, Wing H. Ng, Anthony J. Kenyon, George A. Constantinides, Adnan Mehonic

Abstract: Recent years have seen a rapid rise of artificial neural networks being employed in a number of cognitive tasks. The ever-increasing computing requirements of these structures have contributed to a desire for novel technologies and paradigms, including memristor-based hardware accelerators. Solutions based on memristive crossbars and analog data processing promise to improve the overall energy eff… ▽ More Recent years have seen a rapid rise of artificial neural networks being employed in a number of cognitive tasks. The ever-increasing computing requirements of these structures have contributed to a desire for novel technologies and paradigms, including memristor-based hardware accelerators. Solutions based on memristive crossbars and analog data processing promise to improve the overall energy efficiency. However, memristor nonidealities can lead to the degradation of neural network accuracy, while the attempts to mitigate these negative effects often introduce design trade-offs, such as those between power and reliability. In this work, we design nonideality-aware training of memristor-based neural networks capable of dealing with the most common device nonidealities. We demonstrate the feasibility of using high-resistance devices that exhibit high $I$-$V$ nonlinearity -- by analyzing experimental data and employing nonideality-aware training, we estimate that the energy efficiency of memristive vector-matrix multipliers is improved by three orders of magnitude ($0.715\ \mathrm{TOPs}^{-1}\mathrm{W}^{-1}$ to $381\ \mathrm{TOPs}^{-1}\mathrm{W}^{-1}$) while maintaining similar accuracy. We show that associating the parameters of neural networks with individual memristors allows to bias these devices towards less conductive states through regularization of the corresponding optimization problem, while modifying the validation procedure leads to more reliable estimates of performance. We demonstrate the universality and robustness of our approach when dealing with a wide range of nonidealities. △ Less

Submitted 5 May, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: 32 pages, 20 figures, 4 tables

Journal ref: Adv. Sci. 2022, 2105784

arXiv:2112.02346 [pdf, other]

doi 10.1145/3490422.3502360

Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural Network Inference

Authors: Erwei Wang, James J. Davis, Georgios-Ilias Stavrou, Peter Y. K. Cheung, George A. Constantinides, Mohamed S. Abdelfattah

Abstract: FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this paper, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency… ▽ More FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this paper, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs than via the direct use of off-the-shelf, hand-designed networks. Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K. Choosing appropriate K a priori is challenging, and doing so at even high granularity, e.g. per layer, is a time-consuming and error-prone process that leaves FPGAs' spatial flexibility underexploited. Furthermore, prior works see LUT inputs connected randomly, which does not guarantee a good choice of network topology. To address these issues, we propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference. By removing LUT inputs determined to be of low importance, our method increases the efficiency of the resultant accelerators. Our GPU-friendly solution to LUT input removal is capable of processing large topologies during their training with negligible slowdown. With logic shrinkage, we better the area and energy efficiency of the best-performing LUTNet implementation of the CNV network classifying CIFAR-10 by 1.54x and 1.31x, respectively, while matching its accuracy. This implementation also reaches 2.71x the area efficiency of an equally accurate, heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment of logic shrinkage results in a post-synthesis area reduction of 2.67x vs LUTNet, allowing for implementation that was previously impossible on today's largest FPGAs. △ Less

Submitted 2 January, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

Comments: Accepted manuscript uploaded 04/12/21. DOA 22/11/21

arXiv:2105.13217 [pdf, other]

Rigorous Roundoff Error Analysis of Probabilistic Floating-Point Computations

Authors: George Constantinides, Fredrik Dahlqvist, Zvonimir Rakamaric, Rocco Salvia

Abstract: We present a detailed study of roundoff errors in probabilistic floating-point computations. We derive closed-form expressions for the distribution of roundoff errors associated with a random variable, and we prove that roundoff errors are generally close to being uncorrelated with their generating distribution. Based on these theoretical advances, we propose a model of IEEE floating-point arithme… ▽ More We present a detailed study of roundoff errors in probabilistic floating-point computations. We derive closed-form expressions for the distribution of roundoff errors associated with a random variable, and we prove that roundoff errors are generally close to being uncorrelated with their generating distribution. Based on these theoretical advances, we propose a model of IEEE floating-point arithmetic for numerical expressions with probabilistic inputs and an algorithm for evaluating this model. Our algorithm provides rigorous bounds to the output and error distributions of arithmetic expressions over random variables, evaluated in the presence of roundoff errors. It keeps track of complex dependencies between random variables using an SMT solver, and is capable of providing sound but tight probabilistic bounds to roundoff errors using symbolic affine arithmetic. We implemented the algorithm in the PAF tool, and evaluated it on FPBench, a standard benchmark suite for the analysis of roundoff errors. Our evaluation shows that PAF computes tighter bounds than current state-of-the-art on almost all benchmarks. △ Less

Submitted 27 May, 2021; originally announced May 2021.

Comments: Long version of the eponymous CAV 2021 paper

arXiv:2105.04991 [pdf, other]

Graph Theory for Metro Traffic Modelling

Authors: Bruno Scalzo Dees, Yao Lei Xu, Anthony G. Constantinides, Danilo P. Mandic

Abstract: A unifying graph theoretic framework for the modelling of metro transportation networks is proposed. This is achieved by first introducing a basic graph framework for the modelling of the London underground system from a diffusion law point of view. This forms a basis for the analysis of both station importance and their vulnerability, whereby the concept of graph vertex centrality plays a key rol… ▽ More A unifying graph theoretic framework for the modelling of metro transportation networks is proposed. This is achieved by first introducing a basic graph framework for the modelling of the London underground system from a diffusion law point of view. This forms a basis for the analysis of both station importance and their vulnerability, whereby the concept of graph vertex centrality plays a key role. We next explore k-edge augmentation of a graph topology, and illustrate its usefulness both for improving the network robustness and as a planning tool. Upon establishing the graph theoretic attributes of the underlying graph topology, we proceed to introduce models for processing data on such a metro graph. Commuter movement is shown to obey the Fick's law of diffusion, where the graph Laplacian provides an analytical model for the diffusion process of commuter population dynamics. Finally, we also explore the application of modern deep learning models, such as graph neural networks and hyper-graph neural networks, as general purpose models for the modelling and forecasting of underground data, especially in the context of the morning and evening rush hours. Comprehensive simulations including the passenger in- and out-flows during the morning rush hour in London demonstrates the advantages of the graph models in metro planning and traffic management, a formal mathematical approach with wide economic implications. △ Less

Submitted 11 May, 2021; originally announced May 2021.

Comments: International Joint Conference on Neural Networks (IJCNN) 2021. arXiv admin note: text overlap with arXiv:1912.05964, arXiv:2001.00426

arXiv:2102.04270 [pdf, other]

Enabling Binary Neural Network Training on the Edge

Authors: Erwei Wang, James J. Davis, Daniele Moro, Piotr Zielinski, Jia Jie Lim, Claudionor Coelho, Satrajit Chatterjee, Peter Y. K. Cheung, George A. Constantinides

Abstract: The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the co… ▽ More The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this article, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions while inducing little to no accuracy loss vs Courbariaux & Bengio's standard approach. These decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees memory requirement reductions of 3--5$\times$, while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.78$\times$ memory reduction. Our work is open-source, and includes the Raspberry Pi-targeted prototype we used to verify our modeled memory decreases and capture the associated energy drops. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency, and safeguarding end-user privacy. △ Less

Submitted 24 September, 2023; v1 submitted 8 February, 2021; originally announced February 2021.

arXiv:2010.08572 [pdf, ps, other]

doi 10.1109/TAC.2022.3145657

Horizon-independent Preconditioner Design for Linear Predictive Control

Authors: Ian McInerney, Eric C. Kerrigan, George A. Constantinides

Abstract: First-order optimization solvers, such as the Fast Gradient Method, are increasingly being used to solve Model Predictive Control problems in resource-constrained environments. Unfortunately, the convergence rate of these solvers is significantly affected by the conditioning of the problem data, with ill-conditioned problems requiring a large number of iterations. To reduce the number of iteration… ▽ More First-order optimization solvers, such as the Fast Gradient Method, are increasingly being used to solve Model Predictive Control problems in resource-constrained environments. Unfortunately, the convergence rate of these solvers is significantly affected by the conditioning of the problem data, with ill-conditioned problems requiring a large number of iterations. To reduce the number of iterations required, we present a simple method for computing a horizon-independent preconditioning matrix for the Hessian of the condensed problem. The preconditioner is based on the block Toeplitz structure of the Hessian. Horizon-independence allows one to use only the predicted system and cost matrices to compute the preconditioner, instead of the full Hessian. The proposed preconditioner has equivalent performance to an optimal preconditioner in numerical examples, producing speedups between 2x and 9x for the Fast Gradient Method. Additionally, we derive horizon-independent spectral bounds for the Hessian in terms of the transfer function of the predicted system, and show how these can be used to compute a novel horizon-independent bound on the condition number for the preconditioned Hessian. △ Less

Submitted 6 April, 2022; v1 submitted 16 October, 2020; originally announced October 2020.

Comments: Accepted for publication in IEEE Transactions on Automatic Control

arXiv:2006.09427 [pdf, ps, other]

doi 10.1109/TC.2020.3003529

Digit Stability Inference for Iterative Methods Using Redundant Number Representation

Authors: He Li, Ian McInerney, James J. Davis, George A. Constantinides

Abstract: In our recent work on iterative computation in hardware, we showed that arbitrary-precision solvers can perform more favorably than their traditional arithmetic equivalents when the latter's precisions are either under- or over-budgeted for the solution of the problem at hand. Significant proportions of these performance improvements stem from the ability to infer the existence of identical most-s… ▽ More In our recent work on iterative computation in hardware, we showed that arbitrary-precision solvers can perform more favorably than their traditional arithmetic equivalents when the latter's precisions are either under- or over-budgeted for the solution of the problem at hand. Significant proportions of these performance improvements stem from the ability to infer the existence of identical most-significant digits between iterations. This technique uses properties of algorithms operating on redundantly represented numbers to allow the generation of those digits to be skipped, increasing efficiency. It is unable, however, to guarantee that digits will stabilize, i.e., never change in any future iteration. In this article, we address this shortcoming, using interval and forward error analyses to prove that digits of high significance will become stable when computing the approximants of systems of linear equations using stationary iterative methods. We formalize the relationship between matrix conditioning and the rate of growth in most-significant digit stability, using this information to converge to our desired results more quickly. Versus our previous work, an exemplary hardware realization of this new technique achieves an up-to 2.2x speedup in the solution of a set of variously conditioned systems using the Jacobi method. △ Less

Submitted 16 June, 2020; originally announced June 2020.

arXiv:2001.00426 [pdf, other]

Graph Signal Processing -- Part III: Machine Learning on Graphs, from Graph Topology to Applications

Authors: Ljubisa Stankovic, Danilo Mandic, Milos Dakovic, Milos Brajovic, Bruno Scalzo, Shengxi Li, Anthony G. Constantinides

Abstract: Many modern data analytics applications on graphs operate on domains where graph topology is not known a priori, and hence its determination becomes part of the problem definition, rather than serving as prior knowledge which aids the problem solution. Part III of this monograph starts by addressing ways to learn graph topology, from the case where the physics of the problem already suggest a poss… ▽ More Many modern data analytics applications on graphs operate on domains where graph topology is not known a priori, and hence its determination becomes part of the problem definition, rather than serving as prior knowledge which aids the problem solution. Part III of this monograph starts by addressing ways to learn graph topology, from the case where the physics of the problem already suggest a possible topology, through to most general cases where the graph topology is learned from the data. A particular emphasis is on graph topology definition based on the correlation and precision matrices of the observed data, combined with additional prior knowledge and structural conditions, such as the smoothness or sparsity of graph connections. For learning sparse graphs (with small number of edges), the least absolute shrinkage and selection operator, known as LASSO is employed, along with its graph specific variant, graphical LASSO. For completeness, both variants of LASSO are derived in an intuitive way, and explained. An in-depth elaboration of the graph topology learning paradigm is provided through several examples on physically well defined graphs, such as electric circuits, linear heat transfer, social and computer networks, and spring-mass systems. As many graph neural networks (GNN) and convolutional graph networks (GCN) are emerging, we have also reviewed the main trends in GNNs and GCNs, from the perspective of graph signal filtering. Tensor representation of lattice-structured graphs is next considered, and it is shown that tensors (multidimensional data arrays) are a special class of graph signals, whereby the graph vertices reside on a high-dimensional regular lattice structure. This part of monograph concludes with two emerging applications in financial data processing and underground transportation networks modeling. △ Less

Submitted 2 January, 2020; originally announced January 2020.

Comments: 61 pages, 55 figures, 40 examples

arXiv:1912.05964 [pdf, other]

Graph Theory and Metro Traffic Modelling

Authors: Bruno Scalzo Dees, Anthony G. Constantinides, Danilo P. Mandic

Abstract: In this article we demonstrate how graph theory can be used to identify those stations in the London underground network which have the greatest influence on the functionality of the traffic, and proceed, in an innovative way, to assess the impact of a station closure on service levels across the city. Such underground network vulnerability analysis offers the opportunity to analyse, optimize and… ▽ More In this article we demonstrate how graph theory can be used to identify those stations in the London underground network which have the greatest influence on the functionality of the traffic, and proceed, in an innovative way, to assess the impact of a station closure on service levels across the city. Such underground network vulnerability analysis offers the opportunity to analyse, optimize and enhance the connectivity of the London underground network in a mathematically tractable and physically meaningful manner. △ Less

Submitted 12 December, 2019; originally announced December 2019.

Comments: 4 pages, 6 figures, 2 tables

arXiv:1912.00867 [pdf, other]

A Probabilistic Approach to Floating-Point Arithmetic

Authors: Fredrik Dahlqvist, Rocco Salvia, George A Constantinides

Abstract: Finite-precision floating point arithmetic unavoidably introduces rounding errors which are traditionally bounded using a worst-case analysis. However, worst-case analysis might be overly conservative because worst-case errors can be extremely rare events in practice. Here we develop a probabilistic model of rounding errors with which it becomes possible to estimate the likelihood that the roundin… ▽ More Finite-precision floating point arithmetic unavoidably introduces rounding errors which are traditionally bounded using a worst-case analysis. However, worst-case analysis might be overly conservative because worst-case errors can be extremely rare events in practice. Here we develop a probabilistic model of rounding errors with which it becomes possible to estimate the likelihood that the rounding error of an algorithm lies within a given interval. Given an input distribution, we show how to compute the distribution of rounding errors. We do this exactly for low precision arithmetic, for high precision arithmetic we derive a simple approximation. The model is then entirely compositional: given a numerical program written in a simple imperative programming language we can recursively compute the distribution of rounding errors at each step of the computation and propagate it through each program instruction. This is done by applying a formalism originally developed by Kozen to formalize the semantics of probabilistic programs. We then discuss an implementation of the model and use it to perform probabilistic range analyses on some benchmarks. △ Less

Submitted 10 December, 2019; v1 submitted 2 December, 2019; originally announced December 2019.

Comments: 9 pages, 6 figures

arXiv:1910.12625 [pdf, other]

LUTNet: Learning FPGA Configurations for Highly Efficient Neural Network Inference

Authors: Erwei Wang, James J. Davis, Peter Y. K. Cheung, George A. Constantinides

Abstract: Research has shown that deep neural networks contain significant redundancy, and thus that high classification accuracy can be achieved even when weights and activations are quantized down to binary values. Network binarization on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, i… ▽ More Research has shown that deep neural networks contain significant redundancy, and thus that high classification accuracy can be achieved even when weights and activations are quantized down to binary values. Network binarization on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We describe the realization of both unrolled and tiled LUTNet architectures, with the latter facilitating smaller, less power-hungry deployment over the former while sacrificing area and energy efficiency along with throughput. For both varieties, we demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarized neural network implementation, we achieve up to twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable. △ Less

Submitted 2 March, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1904.00938. Accepted manuscript uploaded 02/03/20. DOA 01/03/20

arXiv:1910.10075 [pdf, other]

Automatic Generation of Multi-precision Multi-arithmetic CNN Accelerators for FPGAs

Authors: Yiren Zhao, Xitong Gao, Xuan Guo, Junyi Liu, Erwei Wang, Robert Mullins, Peter Y. K. Cheung, George Constantinides, Cheng-Zhong Xu

Abstract: Modern deep Convolutional Neural Networks (CNNs) are computationally demanding, yet real applications often require high throughput and low latency. To help tackle these problems, we propose Tomato, a framework designed to automate the process of generating efficient CNN accelerators. The generated design is pipelined and each convolution layer uses different arithmetics at various precisions. Usi… ▽ More Modern deep Convolutional Neural Networks (CNNs) are computationally demanding, yet real applications often require high throughput and low latency. To help tackle these problems, we propose Tomato, a framework designed to automate the process of generating efficient CNN accelerators. The generated design is pipelined and each convolution layer uses different arithmetics at various precisions. Using Tomato, we showcase state-of-the-art multi-precision multi-arithmetic networks, including MobileNet-V1, running on FPGAs. To our knowledge, this is the first multi-precision multi-arithmetic auto-generation framework for CNNs. In software, Tomato fine-tunes pretrained networks to use a mixture of short powers-of-2 and fixed-point weights with a minimal loss in classification accuracy. The fine-tuned parameters are combined with the templated hardware designs to automatically produce efficient inference circuits in FPGAs. We demonstrate how our approach significantly reduces model sizes and computation complexities, and permits us to pack a complete ImageNet network onto a single FPGA without accessing off-chip memories for the first time. Furthermore, we show how Tomato produces implementations of networks with various sizes running on single or multiple FPGAs. To the best of our knowledge, our automatically generated accelerators outperform closest FPGA-based competitors by at least 2-4x for lantency and throughput; the generated accelerator runs ImageNet classification at a rate of more than 3000 frames per second. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: To be published in International Conference on Field Programmable Technology 2019

arXiv:1910.05561 [pdf, other]

Portfolio Cuts: A Graph-Theoretic Framework to Diversification

Authors: Bruno Scalzo Dees, Ljubisa Stankovic, Anthony G. Constantinides, Danilo P. Mandic

Abstract: Investment returns naturally reside on irregular domains, however, standard multivariate portfolio optimization methods are agnostic to data structure. To this end, we investigate ways for domain knowledge to be conveniently incorporated into the analysis, by means of graphs. Next, to relax the assumption of the completeness of graph topology and to equip the graph model with practically relevant… ▽ More Investment returns naturally reside on irregular domains, however, standard multivariate portfolio optimization methods are agnostic to data structure. To this end, we investigate ways for domain knowledge to be conveniently incorporated into the analysis, by means of graphs. Next, to relax the assumption of the completeness of graph topology and to equip the graph model with practically relevant physical intuition, we introduce the portfolio cut paradigm. Such a graph-theoretic portfolio partitioning technique is shown to allow the investor to devise robust and tractable asset allocation schemes, by virtue of a rigorous graph framework for considering smaller, computationally feasible, and economically meaningful clusters of assets, based on graph cuts. In turn, this makes it possible to fully utilize the asset returns covariance matrix for constructing the portfolio, even without the requirement for its inversion. The advantages of the proposed framework over traditional methods are demonstrated through numerical simulations based on real-world price data. △ Less

Submitted 16 October, 2019; v1 submitted 12 October, 2019; originally announced October 2019.

Comments: 5 pages, 4 figures

arXiv:1910.00271 [pdf, other]

ARCHITECT: Arbitrary-precision Hardware with Digit Elision for Efficient Iterative Compute

Authors: He Li, James J. Davis, John Wickerson, George A. Constantinides

Abstract: Many algorithms feature an iterative loop that converges to the result of interest. The numerical operations in such algorithms are generally implemented using finite-precision arithmetic, either fixed- or floating-point, most of which operate least-significant digit first. This results in a fundamental problem: if, after some time, the result has not converged, is this because we have not run the… ▽ More Many algorithms feature an iterative loop that converges to the result of interest. The numerical operations in such algorithms are generally implemented using finite-precision arithmetic, either fixed- or floating-point, most of which operate least-significant digit first. This results in a fundamental problem: if, after some time, the result has not converged, is this because we have not run the algorithm for enough iterations or because the arithmetic in some iterations was insufficiently precise? There is no easy way to answer this question, so users will often over-budget precision in the hope that the answer will always be to run for a few more iterations. We propose a fundamentally new approach: with the appropriate arithmetic able to generate results from most-significant digit first, we show that fixed compute-area hardware can be used to calculate an arbitrary number of algorithmic iterations to arbitrary precision, with both precision and approximant index increasing in lockstep. Consequently, datapaths constructed following our principles demonstrate efficiency over their traditional arithmetic equivalents where the latter's precisions are either under- or over-budgeted for the computation of a result to a particular accuracy. Use of most-significant digit-first arithmetic additionally allows us to declare certain digits to be stable at runtime, avoiding their recalculation in subsequent iterations and thereby increasing performance and decreasing memory footprints. Versus arbitrary-precision iterative solvers without the optimisations we detail herein, we achieve up-to 16$\times$ performance speedups and 1.9x memory savings for the evaluated benchmarks. △ Less

Submitted 1 October, 2019; originally announced October 2019.

arXiv:1909.10325 [pdf, other]

Graph Signal Processing -- Part II: Processing and Analyzing Signals on Graphs

Authors: Ljubisa Stankovic, Danilo Mandic, Milos Dakovic, Milos Brajovic, Bruno Scalzo, Anthony G. Constantinides

Abstract: The focus of Part I of this monograph has been on both the fundamental properties, graph topologies, and spectral representations of graphs. Part II embarks on these concepts to address the algorithmic and practical issues centered round data/signal processing on graphs, that is, the focus is on the analysis and estimation of both deterministic and random data on graphs. The fundamental ideas rela… ▽ More The focus of Part I of this monograph has been on both the fundamental properties, graph topologies, and spectral representations of graphs. Part II embarks on these concepts to address the algorithmic and practical issues centered round data/signal processing on graphs, that is, the focus is on the analysis and estimation of both deterministic and random data on graphs. The fundamental ideas related to graph signals are introduced through a simple and intuitive, yet illustrative and general enough case study of multisensor temperature field estimation. The concept of systems on graph is defined using graph signal shift operators, which generalize the corresponding principles from traditional learning systems. At the core of the spectral domain representation of graph signals and systems is the Graph Discrete Fourier Transform (GDFT). The spectral domain representations are then used as the basis to introduce graph signal filtering concepts and address their design, including Chebyshev polynomial approximation series. Ideas related to the sampling of graph signals are presented and further linked with compressive sensing. Localized graph signal analysis in the joint vertex-spectral domain is referred to as the vertex-frequency analysis, since it can be considered as an extension of classical time-frequency analysis to the graph domain of a signal. Important topics related to the local graph Fourier transform (LGFT) are covered, together with its various forms including the graph spectral and vertex domain windows and the inversion conditions and relations. A link between the LGFT with spectral varying window and the spectral graph wavelet transform (SGWT) is also established. Realizations of the LGFT and SGWT using polynomial (Chebyshev) approximations of the spectral functions are further considered. Finally, energy versions of the vertex-frequency representations are introduced. △ Less

Submitted 23 September, 2019; originally announced September 2019.

Comments: 60 pages, 50 figures,

arXiv:1909.05767 [pdf, other]

Unitary Shift Operators on a Graph

Authors: Bruno Scalzo Dees, Ljubisa Stankovic, Milos Dakovic, Anthony G. Constantinides, Danilo P. Mandic

Abstract: A unitary shift operator (GSO) for signals on a graph is introduced, which exhibits the desired property of energy preservation over both backward and forward graph shifts. For rigour, the graph differential operator is also derived in an analytical form. The commutativity relation of the shift operator with the Fourier transform is next explored in conjunction with the proposed GSO to introduce a… ▽ More A unitary shift operator (GSO) for signals on a graph is introduced, which exhibits the desired property of energy preservation over both backward and forward graph shifts. For rigour, the graph differential operator is also derived in an analytical form. The commutativity relation of the shift operator with the Fourier transform is next explored in conjunction with the proposed GSO to introduce a graph discrete Fourier transform (GDFT) which, unlike existing approaches, ensures the orthogonality of GDFT bases and admits a natural frequency-domain interpretation. The proposed GDFT is shown to allow for a coherent definition of the graph discrete Hilbert transform (GDHT) and the graph analytic signal. The advantages of the proposed GSO are demonstrated through illustrative examples. △ Less

Submitted 17 September, 2019; v1 submitted 12 September, 2019; originally announced September 2019.

Comments: 5 pages, 3 figures

arXiv:1908.01596 [pdf, other]

A Class of Doubly Stochastic Shift Operators for Random Graph Signals and their Boundedness

Authors: Bruno Scalzo Dees, Ljubisa Stankovic, Milos Dakovic, Anthony G. Constantinides, Danilo P. Mandic

Abstract: A class of doubly stochastic graph shift operators (GSO) is proposed, which is shown to exhibit: (i) lower and upper $L_{2}$-boundedness for locally stationary random graph signals; (ii) $L_{2}$-isometry for \textit{i.i.d.} random graph signals with the asymptotic increase in the incoming neighbourhood size of vertices; and (iii) preservation of the mean of any graph signal. These properties are o… ▽ More A class of doubly stochastic graph shift operators (GSO) is proposed, which is shown to exhibit: (i) lower and upper $L_{2}$-boundedness for locally stationary random graph signals; (ii) $L_{2}$-isometry for \textit{i.i.d.} random graph signals with the asymptotic increase in the incoming neighbourhood size of vertices; and (iii) preservation of the mean of any graph signal. These properties are obtained through a statistical consistency analysis of the graph shift, and by exploiting the dual role of the doubly stochastic GSO as a Markov (diffusion) matrix and as an unbiased expectation operator. Practical utility of the class of doubly stochastic GSOs is demonstrated in a real-world multi-sensor signal filtering setting. △ Less

Submitted 7 February, 2020; v1 submitted 5 August, 2019; originally announced August 2019.

Comments: 5 pages, 1 figure

arXiv:1907.03471 [pdf, other]

Vertex-Frequency Graph Signal Processing: A review

Authors: Ljubisa Stankovic, Danilo P. Mandic, Milos Dakovic, Bruno Scalzo, Milos Brajovic, Ervin Sejdic, Anthony G. Constantinides

Abstract: Graph signal processing deals with signals which are observed on an irregular graph domain. While many approaches have been developed in classical graph theory to cluster vertices and segment large graphs in a signal independent way, signal localization based approaches to the analysis of data on graph represent a new research direction which is also a key to big data analytics on graphs. To this… ▽ More Graph signal processing deals with signals which are observed on an irregular graph domain. While many approaches have been developed in classical graph theory to cluster vertices and segment large graphs in a signal independent way, signal localization based approaches to the analysis of data on graph represent a new research direction which is also a key to big data analytics on graphs. To this end, after an overview of the basic definitions in graphs and graph signals, we present and discuss a localized form of the graph Fourier transform. To establish an analogy with classical signal processing, spectral- and vertex-domain definitions of the localization window are given next. The spectral and vertex localization kernels are then related to the wavelet transform, followed by a study of filtering and inversion of the localized graph Fourier transform. For rigour, the analysis of energy representation and frames in the localized graph Fourier transform is extended to the energy forms of vertex-frequency distributions, which operate even without the need to apply localization windows. Another link with classical signal processing is established through the concept of local smoothness, which is subsequently related to the particular paradigm of signal smoothness on graphs. This all represents a comprehensive account of the relation of general vertex-frequency analysis with classical time-frequency analysis, and important but missing link for more advanced applications of graphs signal processing. The theory is supported by illustrative and practically relevant examples. △ Less

Submitted 26 December, 2019; v1 submitted 8 July, 2019; originally announced July 2019.

Comments: 16 pages, 12 figures

arXiv:1905.02438 [pdf, other]

doi 10.1098/rsta.2019.0051

Rethinking Arithmetic for Deep Neural Networks

Authors: George A. Constantinides

Abstract: We consider efficiency in the implementation of deep neural networks. Hardware accelerators are gaining interest as machine learning becomes one of the drivers of high-performance computing. In these accelerators, the directed graph describing a neural network can be implemented as a directed graph describing a Boolean circuit. We make this observation precise, leading naturally to an understandin… ▽ More We consider efficiency in the implementation of deep neural networks. Hardware accelerators are gaining interest as machine learning becomes one of the drivers of high-performance computing. In these accelerators, the directed graph describing a neural network can be implemented as a directed graph describing a Boolean circuit. We make this observation precise, leading naturally to an understanding of practical neural networks as discrete functions, and show that so-called binarised neural networks are functionally complete. In general, our results suggest that it is valuable to consider Boolean circuits as neural networks, leading to the question of which circuit topologies are promising. We argue that continuity is central to generalisation in learning, explore the interaction between data coding, network topology, and node functionality for continuity, and pose some open questions for future research. As a first step to bridging the gap between continuous and Boolean views of neural network accelerators, we present some recent results from our work on LUTNet, a novel Field-Programmable Gate Array inference approach. Finally, we conclude with additional possible fruitful avenues for research bridging the continuous and discrete views of neural networks. △ Less

Submitted 17 September, 2019; v1 submitted 7 May, 2019; originally announced May 2019.

arXiv:1904.00938 [pdf, other]

LUTNet: Rethinking Inference in FPGA Soft Logic

Authors: Erwei Wang, James J. Davis, Peter Y. K. Cheung, George A. Constantinides

Abstract: Research has shown that deep neural networks contain significant redundancy, and that high classification accuracies can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is c… ▽ More Research has shown that deep neural networks contain significant redundancy, and that high classification accuracies can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarised neural network implementation, we achieve twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable. △ Less

Submitted 1 April, 2019; originally announced April 2019.

Comments: Accepted manuscript uploaded 01/04/19. DOA 03/03/19

arXiv:1903.11179 [pdf, other]

An Example-Driven Introduction to Data Analytics on Graphs

Authors: Ljubisa Stankovic, Danilo Mandic, Milos Dakovic, Ilya Kisil, Ervin Sejdic, Anthony G. Constantinides

Abstract: Graphs are irregular structures which naturally account for data integrity, however, traditional approaches have been established outside Signal Processing, and largely focus on analyzing the underlying graphs rather than signals on graphs. Given the rapidly increasing availability of multisensor and multinode measurements, likely recorded on irregular or ad-hoc grids, it would be extremely advant… ▽ More Graphs are irregular structures which naturally account for data integrity, however, traditional approaches have been established outside Signal Processing, and largely focus on analyzing the underlying graphs rather than signals on graphs. Given the rapidly increasing availability of multisensor and multinode measurements, likely recorded on irregular or ad-hoc grids, it would be extremely advantageous to analyze such structured data as graph signals and thus benefit from the ability of graphs to incorporate spatial awareness of the sensing locations, sensor importance, and local versus global sensor association. The aim of this lecture note is therefore to establish a common language between graph signals, defined on irregular signal domains, and some of the most fundamental paradigms in DSP, such as spectral analysis of multichannel signals, system transfer function, digital filter design, parameter estimation, and optimal filter design. This is achieved through a physically meaningful and intuitive real-world example of geographically distributed multisensor temperature estimation. A similar spatial multisensor arrangement is already widely used in Signal Processing curricula to introduce minimum variance estimators and Kalman filters \cite{HM}, and by adopting this framework we facilitate a seamless integration of graph theory into the curriculum of existing DSP courses. By bridging the gap between standard approaches and graph signal processing, we also show that standard methods can be thought of as special cases of their graph counterparts, evaluated on linear graphs. It is hoped that our approach would not only help to demystify graph theoretic approaches in education and research but it would also empower practitioners to explore a whole host of otherwise prohibitive modern applications. △ Less

Submitted 12 May, 2019; v1 submitted 26 March, 2019; originally announced March 2019.

Comments: 10 pages, 3 figures, 5 boxes

arXiv:1902.02221 [pdf, ps, other]

Bounding Computational Complexity under Cost Function Scaling in Predictive Control

Authors: Ian McInerney, Eric C. Kerrigan, George A. Constantinides

Abstract: We present a framework for upper bounding the number of iterations required by first-order optimization algorithms implementing constrained LQR controllers. We derive new bounds for the condition number and extremal eigenvalues of the primal and dual Hessian matrices when the cost function is scaled. These bounds are horizon-independent, allowing for their use with receding, variable and decreasin… ▽ More We present a framework for upper bounding the number of iterations required by first-order optimization algorithms implementing constrained LQR controllers. We derive new bounds for the condition number and extremal eigenvalues of the primal and dual Hessian matrices when the cost function is scaled. These bounds are horizon-independent, allowing for their use with receding, variable and decreasing horizon controllers. We considerably relax prior assumptions on the structure of the weight matrices and assume only that the system is Schur-stable and the primal Hessian of the quadratic program (QP) is positive-definite. Our analysis uses the Toeplitz structure of the QP matrices to relate their spectrum to the transfer function of the system, allowing for the use of system-theoretic techniques to compute the bounds. Using these bounds, we can compute the effect on the computational complexity of trading off the input energy used against the state deviation. An example system shows a three-times increase in algorithm iterations between the two extremes, with the state 2-norm decreased by only 5% despite a greatly increased state deviation penalty. △ Less

Submitted 6 February, 2019; originally announced February 2019.

arXiv:1901.06955 [pdf, other]

doi 10.1145/3309551

Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going

Authors: Erwei Wang, James J. Davis, Ruizhe Zhao, Ho-Cheung Ng, Xinyu Niu, Wayne Luk, Peter Y. K. Cheung, George A. Constantinides

Abstract: Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms… ▽ More Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy efficiency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-efficient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. This article represents the first survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the field. △ Less

Submitted 8 July, 2019; v1 submitted 21 January, 2019; originally announced January 2019.

Comments: Accepted manuscript uploaded 21/01/19. DOA 15/01/19

Journal ref: ACM Comput. Surv. 52, 2, Article 40 (May 2019), 39 pages

arXiv:1807.08720 [pdf, other]

A Data Analytics Perspective of the Clarke and Related Transforms in Power Grid Analysis

Authors: Danilo P. Mandic, Sithan Kanna, Yili Xia, Ahmad Moniri, Anthony G. Constantinides

Abstract: Affordable and reliable electric power is fundamental to modern society and economy, with the Smart Grid becoming an increasingly important factor in power generation and distribution. In order to fully exploit it advantages, the analysis of modern Smart Grid requires close collaboration and convergence between power engineers and signal processing and machine learning experts. Current analysis te… ▽ More Affordable and reliable electric power is fundamental to modern society and economy, with the Smart Grid becoming an increasingly important factor in power generation and distribution. In order to fully exploit it advantages, the analysis of modern Smart Grid requires close collaboration and convergence between power engineers and signal processing and machine learning experts. Current analysis techniques are typically derived from a Circuit Theory perspective; such an approach is adequate for only fully balanced systems operating at nominal conditions and non-obvious for data scientists - this is prohibitive for the analysis of dynamically unbalanced smart grids, where Data Analytics is not only well suited but also necessary. A common language that bridges the gap between Circuit Theory and Data Analytics, and the respective community of experts, would be a natural step forward. To this end, we revisit the Clarke and related transforms from a subspace, latent component, and spatial frequency analysis frameworks, to establish fundamental relationships between the standard three-phase transforms and modern Data Analytics. We show that the Clarke transform admits a physical interpretation as a "spatial dimensionality" reduction technique which is equivalent to Principal Component Analysis (PCA) for balanced systems, but is sub-optimal for dynamically unbalanced systems, such as the Smart Grid, while the related Park transform performs further "temporal" dimensionality reduction. Such a perspective opens numerous new avenues for the use Signal Processing and Machine Learning in power grid research, and paves the way for innovative optimisation, transformation, and analysis techniques that are not accessible to arrive at from the standard Circuit Theory principles, as demonstrated in this work through the possibility of simultaneous frequency estimation and fault detection. △ Less

Submitted 23 July, 2018; originally announced July 2018.

Comments: 20 pages, 11 figures

arXiv:1710.08802 [pdf, ps, other]

doi 10.1109/TCST.2018.2855666

Automatic Software and Computing Hardware Co-design for Predictive Control

Authors: Bulat Khusainov, Eric C. Kerrigan, George A. Constantinides

Abstract: Model Predictive Control (MPC) is a computationally demanding control technique that allows dealing with multiple-input and multiple-output systems, while handling constraints in a systematic way. The necessity of solving an optimization problem at every sampling instant often (i) limits the application scope to slow dynamical systems and/or (ii) results in expensive computational hardware impleme… ▽ More Model Predictive Control (MPC) is a computationally demanding control technique that allows dealing with multiple-input and multiple-output systems, while handling constraints in a systematic way. The necessity of solving an optimization problem at every sampling instant often (i) limits the application scope to slow dynamical systems and/or (ii) results in expensive computational hardware implementations. Traditional MPC design is based on manual tuning of software and computational hardware design parameters, which leads to suboptimal implementations. This paper proposes a framework for automating the MPC software and computational hardware co-design, while achieving the optimal trade-off between computational resource usage and controller performance. The proposed approach is based on using a multi-objective optimization algorithm, namely BiMADS. Two test studies are considered: Central Processing Unit (CPU) and Field-Programmable Gate Array (FPGA) implementations of fast gradient-based MPC. Numerical experiments show that optimization-based design outperforms Latin Hypercube Sampling (LHS), a statistical sampling-based design exploration technique. △ Less

Submitted 24 October, 2017; originally announced October 2017.

Journal ref: IEEE Transactions on Control Systems Technology (Volume: 27, Issue: 5, Sept. 2019)

arXiv:1710.08737 [pdf, ps, other]

doi 10.1016/j.conengprac.2018.06.016

Nonlinear Predictive Control on a Heterogeneous Computing Platform

Authors: Bulat Khusainov, Eric C. Kerrigan, Andrea Suardi, George A. Constantinides

Abstract: We propose an implementation of an interior-point-based nonlinear predictive controller on a heterogeneous processor. The workload can be split between a general-purpose CPU and a field-programmable gate array to trade off the contradicting design objectives of control performance and computational resource usage. A new way of exploiting the structure of the KKT matrix yields significant memory sa… ▽ More We propose an implementation of an interior-point-based nonlinear predictive controller on a heterogeneous processor. The workload can be split between a general-purpose CPU and a field-programmable gate array to trade off the contradicting design objectives of control performance and computational resource usage. A new way of exploiting the structure of the KKT matrix yields significant memory savings. We report an 18x memory saving, compared to existing approaches, and a 36x speedup over a software implementation with an ARM Cortex-A9 processor. We also introduce a new release of Protoip, which abstracts low-level details of heterogeneous programming and allows processor-in-the-loop verification. △ Less

Submitted 24 October, 2017; originally announced October 2017.

Journal ref: Control Engineering Practice, Volume 78, September 2018, Pages 105-115

Showing 1–50 of 53 results for author: Constantinides, G