Skip to main content

Showing 1–35 of 35 results for author: Whatmough, P

.
  1. arXiv:2406.13175  [pdf, other

    cs.LG cs.AI

    Sparse High Rank Adapters

    Authors: Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel

    Abstract: Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30%… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  2. arXiv:2404.09317  [pdf, other

    cs.AR cs.AI

    Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

    Authors: Abhishek Tyagi, Reiley Jeyapaul, Chuteng Zhu, Paul Whatmough, Yuhao Zhu

    Abstract: As Neural Processing Units (NPU) or accelerators are increasingly deployed in a variety of applications including safety critical applications such as autonomous vehicle, and medical imaging, it is critical to understand the fault-tolerance nature of the NPUs. We present a reliability study of Arm's Ethos-U55, an important industrial-scale NPU being utilised in embedded and IoT applications. We pe… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

  3. arXiv:2402.15319  [pdf, other

    cs.LG cs.CL

    GPTVQ: The Blessing of Dimensionality for LLM Quantization

    Authors: Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough

    Abstract: In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining un… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  4. arXiv:2301.10999  [pdf, other

    cs.LG cs.PF

    PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices

    Authors: Yuji Chai, Devashree Tripathy, Chuteng Zhou, Dibakar Gope, Igor Fedorov, Ramon Matas, David Brooks, Gu-Yeon Wei, Paul Whatmough

    Abstract: The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortuna… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

  5. arXiv:2212.02649  [pdf, other

    cs.AR cs.CR cs.LG

    Thales: Formulating and Estimating Architectural Vulnerability Factors for DNN Accelerators

    Authors: Abhishek Tyagi, Yiming Gan, Shaoshan Liu, Bo Yu, Paul Whatmough, Yuhao Zhu

    Abstract: As Deep Neural Networks (DNNs) are increasingly deployed in safety critical and privacy sensitive applications such as autonomous driving and biometric authentication, it is critical to understand the fault-tolerance nature of DNNs. Prior work primarily focuses on metrics such as Failures In Time (FIT) rate and the Silent Data Corruption (SDC) rate, which quantify how often a device fails. Instead… ▽ More

    Submitted 7 January, 2024; v1 submitted 5 December, 2022; originally announced December 2022.

  6. arXiv:2208.08562  [pdf, other

    cs.CV cs.AI stat.ML

    Restructurable Activation Networks

    Authors: Kartikeya Bhardwaj, James Ward, Caleb Tung, Dibakar Gope, Lingchuan Meng, Igor Fedorov, Alex Chalfin, Paul Whatmough, Danny Loh

    Abstract: Is it possible to restructure the non-linear activation functions in a deep network to create hardware-efficient models? To address this question, we propose a new paradigm called Restructurable Activation Networks (RANs) that manipulate the amount of non-linearity in models to improve their hardware-awareness and efficiency. First, we propose RAN-explicit (RAN-e) -- a new hardware-aware search sp… ▽ More

    Submitted 7 September, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

    Comments: This work was presented at an Arm AI virtual tech talk. Video is available at https://www.youtube.com/watch?v=EUqFNE28Kq4

  7. arXiv:2201.05842  [pdf, other

    cs.LG

    UDC: Unified DNAS for Compressible TinyML Models

    Authors: Igor Fedorov, Ramon Matas, Hokchhay Tann, Chuteng Zhou, Matthew Mattina, Paul Whatmough

    Abstract: Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to fit more parameters in the same footprint. However, designing compressible neural networks (NNs) is challenging, as it expands the design space across… ▽ More

    Submitted 5 January, 2023; v1 submitted 15 January, 2022; originally announced January 2022.

  8. arXiv:2112.14340  [pdf, other

    eess.IV cs.CV cs.LG

    Super-Efficient Super Resolution for Fast Adversarial Defense at the Edge

    Authors: Kartikeya Bhardwaj, Dibakar Gope, James Ward, Paul Whatmough, Danny Loh

    Abstract: Autonomous systems are highly vulnerable to a variety of adversarial attacks on Deep Neural Networks (DNNs). Training-free model-agnostic defenses have recently gained popularity due to their speed, ease of deployment, and ability to work across many DNNs. To this end, a new technique has emerged for mitigating attacks on image classification DNNs, namely, preprocessing adversarial images using su… ▽ More

    Submitted 28 December, 2021; originally announced December 2021.

    Comments: This preprint is for personal use only. The official article will appear in proceedings of Design, Automation & Test in Europe (DATE), 2022, as part of the Special Initiative on Autonomous Systems Design (ASD)

  9. arXiv:2111.06503  [pdf, other

    cs.AR cs.ET cs.LG

    AnalogNets: ML-HW Co-Design of Noise-robust TinyML Models and Always-On Analog Compute-in-Memory Accelerator

    Authors: Chuteng Zhou, Fernando Garcia Redondo, Julian Büchel, Irem Boybat, Xavier Timoneda Comas, S. R. Nandakumar, Shidhartha Das, Abu Sebastian, Manuel Le Gallo, Paul N. Whatmough

    Abstract: Always-on TinyML perception tasks in IoT applications require very high energy efficiency. Analog compute-in-memory (CiM) using non-volatile memory (NVM) promises high efficiency and also provides self-contained on-chip model storage. However, analog CiM introduces new practical considerations, including conductance drift, read/write noise, fixed analog-to-digital (ADC) converter gain, etc. These… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

  10. arXiv:2111.04263  [pdf, other

    cs.LG cs.DC

    Federated Learning Based on Dynamic Regularization

    Authors: Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N. Whatmough, Venkatesh Saligrama

    Abstract: We propose a novel federated learning method for distributively training neural network models, where the server orchestrates cooperation between a subset of randomly chosen devices in each round. We view Federated Learning problem primarily from a communication perspective and allow more device level computations to save transmission costs. We point out a fundamental dilemma, in that the minima o… ▽ More

    Submitted 9 November, 2021; v1 submitted 7 November, 2021; originally announced November 2021.

    Comments: Slightly extended version of ICLR 2021 Paper

  11. arXiv:2107.07983  [pdf, other

    cs.AR cs.LG

    S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration

    Authors: Zhi-Gang Liu, Paul N. Whatmough, Yuhao Zhu, Matthew Mattina

    Abstract: Exploiting sparsity is a key technique in accelerating quantized convolutional neural network (CNN) inference on mobile devices. Prior sparse CNN accelerators largely exploit un-structured sparsity and achieve significant speedups. Due to the unbounded, largely unpredictable sparsity patterns, however, exploiting unstructured sparsity requires complicated hardware design with significant energy an… ▽ More

    Submitted 6 January, 2022; v1 submitted 16 July, 2021; originally announced July 2021.

    Comments: Accepted by the HPCA 20222, the 28th IEEE International Symposium on High-Performance Computer Architecture (HPCA-28)

  12. arXiv:2103.08764  [pdf, other

    cs.CV cs.RO

    Fast and Accurate: Video Enhancement using Sparse Depth

    Authors: Yu Feng, Patrick Hansen, Paul N. Whatmough, Guoyu Lu, Yuhao Zhu

    Abstract: This paper presents a general framework to build fast and accurate algorithms for video enhancement tasks such as super-resolution, deblurring, and denoising. Essential to our framework is the realization that the accuracy, rather than the density, of pixel flows is what is required for high-quality video enhancement. Most of prior works take the opposite approach: they estimate dense (per-pixel)-… ▽ More

    Submitted 14 September, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

  13. arXiv:2102.07071  [pdf, other

    cs.LG

    Do**: A technique for efficient compression of LSTM models using sparse structured additive matrices

    Authors: Urmish Thakker, Paul N. Whatmough, Zhigang Liu, Matthew Mattina, Jesse Beu

    Abstract: Structured matrices, such as those derived from Kronecker products (KP), are effective at compressing neural networks, but can lead to unacceptable accuracy loss when applied to large models. In this paper, we propose the notion of do** -- addition of an extremely sparse matrix to a structured matrix. Do** facilitates additional degrees of freedom for a small number of parameters, allowing the… ▽ More

    Submitted 14 February, 2021; originally announced February 2021.

    Comments: Accepted to be published at MLSys 2021

  14. arXiv:2102.02988  [pdf, other

    cs.RO cs.AI cs.AR cs.LG

    AutoPilot: Automating SoC Design Space Exploration for SWaP Constrained Autonomous UAVs

    Authors: Srivatsan Krishnan, Zishen Wan, Kshitij Bhardwaj, Paul Whatmough, Aleksandra Faust, Sabrina Neuman, Gu-Yeon Wei, David Brooks, Vijay Janapa Reddi

    Abstract: Building domain-specific accelerators for autonomous unmanned aerial vehicles (UAVs) is challenging due to a lack of systematic methodology for designing onboard compute. Balancing a computing system for a UAV requires considering both the cyber (e.g., sensor rate, compute performance) and physical (e.g., payload weight) characteristics that affect overall performance. Iterating over the many comp… ▽ More

    Submitted 10 September, 2021; v1 submitted 4 February, 2021; originally announced February 2021.

  15. arXiv:2101.11750  [pdf, other

    cs.IT cs.AI cs.CC cs.LG

    Information contraction in noisy binary neural networks and its implications

    Authors: Chuteng Zhou, Quntao Zhuang, Matthew Mattina, Paul N. Whatmough

    Abstract: Neural networks have gained importance as the machine learning models that achieve state-of-the-art performance on large-scale image classification, object detection and natural language processing tasks. In this paper, we consider noisy binary neural networks, where each neuron has a non-zero probability of producing an incorrect output. These noisy models may arise from biological, physical and… ▽ More

    Submitted 1 February, 2021; v1 submitted 27 January, 2021; originally announced January 2021.

    Comments: 14 pages, 8 figures

  16. arXiv:2011.14203  [pdf, other

    cs.AR cs.CL

    EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

    Authors: Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul N. Whatmough, Alexander M. Rush, David Brooks, Gu-Yeon Wei

    Abstract: Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimi… ▽ More

    Submitted 5 September, 2021; v1 submitted 28 November, 2020; originally announced November 2020.

    Comments: 12 pages plus references. Paper to appear at the 54th IEEE/ACM International Symposium on Microarchitecture (MICRO 2021)

  17. arXiv:2010.11267  [pdf, other

    cs.LG

    MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers

    Authors: Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas Navarro, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi, Matthew Mattina, Paul N. Whatmough

    Abstract: Executing machine learning workloads locally on resource constrained microcontrollers (MCUs) promises to drastically expand the application space of IoT. However, so-called TinyML presents severe technical challenges, as deep neural network inference demands a large compute and memory budget. To address this challenge, neural architecture search (NAS) promises to help design accurate ML models tha… ▽ More

    Submitted 12 April, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: 10 pages, 8 figures, 3 tables

  18. arXiv:2009.02381  [pdf, other

    cs.AR cs.LG

    Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration

    Authors: Zhi-Gang Liu, Paul N. Whatmough, Matthew Mattina

    Abstract: Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead. In this paper, we address… ▽ More

    Submitted 12 October, 2020; v1 submitted 4 September, 2020; originally announced September 2020.

    ACM Class: B.0; I.2

  19. arXiv:2008.06967  [pdf, other

    cs.CV cs.AR

    Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation

    Authors: Yu Feng, Boyuan Tian, Tiancheng Xu, Paul Whatmough, Yuhao Zhu

    Abstract: Point cloud analytics is poised to become a key workload on battery-powered embedded and mobile platforms in a wide range of emerging application domains, such as autonomous driving, robotics, and augmented reality, where efficiency is paramount. This paper proposes Mesorasi, an algorithm-architecture co-designed system that simultaneously improves the performance and energy efficiency of point cl… ▽ More

    Submitted 16 August, 2020; originally announced August 2020.

    Journal ref: Proceedings of the 53nd (2020) Annual IEEE/ACM International Symposium on Microarchitecture

  20. arXiv:2005.11138  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids

    Authors: Igor Fedorov, Marko Stamenovic, Carl Jensen, Li-Chia Yang, Ari Mandell, Yiming Gan, Matthew Mattina, Paul N. Whatmough

    Abstract: Modern speech enhancement algorithms achieve remarkable noise suppression by means of large recurrent neural networks (RNNs). However, large RNNs limit practical deployment in hearing aid hardware (HW) form-factors, which are battery powered and run on resource-constrained microcontroller units (MCUs) with limited memory capacity and compute capability. In this work, we use model compression techn… ▽ More

    Submitted 20 May, 2020; originally announced May 2020.

    Comments: First four authors contributed equally. For audio samples, see https://github.com/BoseCorp/efficient-neural-speech-enhancement

  21. arXiv:2005.08098  [pdf, other

    cs.DC cs.AR cs.LG eess.SP

    Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference

    Authors: Zhi-Gang Liu, Paul N. Whatmough, Matthew Mattina

    Abstract: Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). The systolic array (SA) is a pipelined 2D array of processing elements (PEs), with very efficient local data movement, well suited to accelerating GEMM, and widely deployed in industry. In this work, we describe two significant improve… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Accepted by IEEE Computer Architecture Letters on 3/4/2020

  22. arXiv:2002.10711  [pdf, other

    cs.LG cs.CV stat.ML

    Searching for Winograd-aware Quantized Networks

    Authors: Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina

    Abstract: Lightweight architectural designs of Convolutional Neural Networks (CNNs) together with quantization have paved the way for the deployment of demanding computer vision applications on mobile devices. Parallel to this, alternative formulations to the convolution operation such as FFT, Strassen and Winograd, have been adapted for use in CNNs offering further speedups. Winograd convolutions are the f… ▽ More

    Submitted 25 February, 2020; originally announced February 2020.

    Comments: Published as a conference paper at MLSys 2020

    Journal ref: Proceedings of Machine Learning and Systems (2020), 14-29

  23. arXiv:2001.08896  [pdf, other

    cs.LG cs.CL stat.ML

    Compressing Language Models using Doped Kronecker Products

    Authors: Urmish Thakker, Paul N. Whatmough, Zhi-Gang Liu, Matthew Mattina, Jesse Beu

    Abstract: Kronecker Products (KP) have been used to compress IoT RNN Applications by 15-38x compression factors, achieving better results than traditional compression methods. However when KP is applied to large Natural Language Processing tasks, it leads to significant accuracy loss (approx 26%). This paper proposes a way to recover accuracy otherwise lost when applying KP to large NLP tasks, by allowing a… ▽ More

    Submitted 17 November, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

    Comments: Link to Workshop (https://research.fb.com/programs/on-device-intelligence-workshop/)

    Journal ref: Presented at On-device Intelligence Workshop at Third Conference on Machine Learning and Systems (MLSys) 2020

  24. arXiv:2001.04974  [pdf, other

    cs.LG cs.AI cs.AR stat.ML

    Noisy Machines: Understanding Noisy Neural Networks and Enhancing Robustness to Analog Hardware Errors Using Distillation

    Authors: Chuteng Zhou, Prad Kadambi, Matthew Mattina, Paul N. Whatmough

    Abstract: The success of deep learning has brought forth a wave of interest in computer hardware design to better meet the high demands of neural network inference. In particular, analog computing hardware has been heavily motivated specifically for accelerating neural networks, based on either electronic, optical or photonic devices, which may well achieve lower power consumption than conventional digital… ▽ More

    Submitted 14 January, 2020; originally announced January 2020.

  25. CHIPKIT: An agile, reusable open-source framework for rapid test chip development

    Authors: Paul Whatmough, Marco Donato, Glenn Ko, Sae-Kyu Lee, David Brooks, Gu-Yeon Wei

    Abstract: The current trend for domain-specific architectures (DSAs) has led to renewed interest in research test chips to demonstrate new specialized hardware. Tape-outs also offer huge pedagogical value garnered from real hands-on exposure to the whole system stack. However, successful tape-outs demand hard-earned experience, and the design process is time consuming and fraught with challenges. Therefore,… ▽ More

    Submitted 26 May, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

  26. arXiv:1912.04481  [pdf, other

    cs.LG cs.DC

    SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads

    Authors: Sam Likun Xi, Yuan Yao, Kshitij Bhardwaj, Paul Whatmough, Gu-Yeon Wei, David Brooks

    Abstract: In recent years, there has been tremendous advances in hardware acceleration of deep neural networks. However, most of the research has focused on optimizing accelerator microarchitecture for higher performance and energy efficiency on a per-layer basis. We find that for overall single-batch inference latency, the accelerator may only make up 25-40%, with the rest spent on data movement and in the… ▽ More

    Submitted 11 December, 2019; v1 submitted 9 December, 2019; originally announced December 2019.

    Comments: 14 pages, 20 figures

  27. arXiv:1911.07954  [pdf, ps, other

    eess.IV cs.CV cs.LG

    ISP4ML: Understanding the Role of Image Signal Processing in Efficient Deep Learning Vision Systems

    Authors: Patrick Hansen, Alexey Vilkin, Yury Khrustalev, James Imber, David Hanwell, Matthew Mattina, Paul N. Whatmough

    Abstract: Convolutional neural networks (CNNs) are now predominant components in a variety of computer vision (CV) systems. These systems typically include an image signal processor (ISP), even though the ISP is traditionally designed to produce images that look appealing to humans. In CV systems, it is not clear what the role of the ISP is, or if it is even required at all for accurate prediction. In this… ▽ More

    Submitted 17 March, 2021; v1 submitted 18 November, 2019; originally announced November 2019.

    Comments: 13 pages, 11 figures

  28. ASV: Accelerated Stereo Vision System

    Authors: Yu Feng, Paul Whatmough, Yuhao Zhu

    Abstract: Estimating depth from stereo vision cameras, i.e., "depth from stereo", is critical to emerging intelligent applications deployed in energy- and performance-constrained devices, such as augmented reality headsets and mobile autonomous robots. While existing stereo vision systems make trade-offs between accuracy, performance and energy-efficiency, we describe ASV, an accelerated stereo vision syste… ▽ More

    Submitted 15 November, 2019; originally announced November 2019.

    Comments: MICRO 2019

    Journal ref: In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '52). ACM, New York, NY, USA, 643-656 (2019)

  29. arXiv:1905.12107  [pdf, ps, other

    cs.LG cs.CV

    SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers

    Authors: Igor Fedorov, Ryan P. Adams, Matthew Mattina, Paul N. Whatmough

    Abstract: The vast majority of processors in the world are actually microcontroller units (MCUs), which find widespread use performing simple control tasks in applications ranging from automobiles to medical devices and office equipment. The Internet of Things (IoT) promises to inject machine learning into many of these every-day objects via tiny, cheap MCUs. However, these resource-impoverished hardware pl… ▽ More

    Submitted 28 May, 2019; originally announced May 2019.

  30. arXiv:1902.11128  [pdf, other

    cs.CV cs.AR cs.LG stat.ML

    FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

    Authors: Paul N. Whatmough, Chuteng Zhou, Patrick Hansen, Shreyas Kolala Venkataramanaiah, Jae-sun Seo, Matthew Mattina

    Abstract: The computational demands of computer vision tasks based on state-of-the-art Convolutional Neural Network (CNN) image classification far exceed the energy budgets of mobile devices. This paper proposes FixyNN, which consists of a fixed-weight feature extractor that generates ubiquitous CNN features, and a conventional programmable CNN accelerator which processes a dataset-specific CNN. Image class… ▽ More

    Submitted 26 February, 2019; originally announced February 2019.

    Comments: 10 pages, 8 figures, paper accepted at SysML2019 conference

  31. arXiv:1812.01672  [pdf, other

    cs.LG stat.ML

    Energy Efficient Hardware for On-Device CNN Inference via Transfer Learning

    Authors: Paul Whatmough, Chuteng Zhou, Patrick Hansen, Matthew Mattina

    Abstract: On-device CNN inference for real-time computer vision applications can result in computational demands that far exceed the energy budgets of mobile devices. This paper proposes FixyNN, a co-designed hardware accelerator platform which splits a CNN model into two parts: a set of layers that are fixed in the hardware platform as a front-end fixed-weight feature extractor, and the remaining layers wh… ▽ More

    Submitted 26 February, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

    Comments: 4 pages, 2 figures, NeurIPS 2018 on-device ML workshop

  32. arXiv:1811.02883  [pdf, other

    cs.DC cs.AR

    SCALE-Sim: Systolic CNN Accelerator Simulator

    Authors: Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, Tushar Krishna

    Abstract: Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, the research community lacks tools to insights on both the design trade-offs and efficient map** strategies for systolic-array based accelerators. We introduce Systolic CNN Accelerator Simulator (SCALE… ▽ More

    Submitted 1 February, 2019; v1 submitted 16 October, 2018; originally announced November 2018.

  33. arXiv:1803.11232  [pdf, other

    cs.CV

    Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision

    Authors: Yuhao Zhu, Anand Samajdar, Matthew Mattina, Paul Whatmough

    Abstract: Continuous computer vision (CV) tasks increasingly rely on convolutional neural networks (CNN). However, CNNs have massive compute demands that far exceed the performance and energy constraints of mobile devices. In this paper, we propose and develop an algorithm-architecture co-designed system, Euphrates, that simultaneously improves the energy-efficiency and performance of continuous vision task… ▽ More

    Submitted 29 March, 2018; originally announced March 2018.

  34. arXiv:1801.06274  [pdf, other

    cs.LG cs.AR cs.NE

    Mobile Machine Learning Hardware at ARM: A Systems-on-Chip (SoC) Perspective

    Authors: Yuhao Zhu, Matthew Mattina, Paul Whatmough

    Abstract: Machine learning is playing an increasingly significant role in emerging mobile application domains such as AR/VR, ADAS, etc. Accordingly, hardware architects have designed customized hardware for machine learning algorithms, especially neural networks, to improve compute efficiency. However, machine learning is typically just one processing stage in complex end-to-end applications, involving mult… ▽ More

    Submitted 1 February, 2018; v1 submitted 18 January, 2018; originally announced January 2018.

  35. arXiv:1411.2860  [pdf, other

    cs.MM cs.MS

    Precision-Energy-Throughput Scaling Of Generic Matrix Multiplication and Convolution Kernels Via Linear Projections

    Authors: Mohammad Ashraful Anam, Paul N. Whatmough, Yiannis Andreopoulos

    Abstract: Generic matrix multiplication (GEMM) and one-dimensional convolution/cross-correlation (CONV) kernels often constitute the bulk of the compute- and memory-intensive processing within image/audio recognition and matching systems. We propose a novel method to scale the energy and processing throughput of GEMM and CONV kernels for such error-tolerant multimedia applications by adjusting the precision… ▽ More

    Submitted 11 November, 2014; originally announced November 2014.

    Journal ref: IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 11, pp. 1860-1873, Nov. 2014