Skip to main content

Showing 1–18 of 18 results for author: Moshovos, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2204.13666  [pdf, other

    cs.LG cs.AR

    Schrödinger's FP: Dynamic Adaptation of Floating-Point Containers for Deep Learning Training

    Authors: Miloš Nikolić, Enrique Torres Sanchez, Jiahui Wang, Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Kareem Ibrahim, Andreas Moshovos

    Abstract: The transfer of tensors from/to memory during neural network training dominates time and energy. To improve energy efficiency and performance, research has been exploring ways to use narrower data representations. So far, these attempts relied on user-directed trial-and-error to achieve convergence. We present methods that relieve users from this responsibility. Our methods dynamically adjust the… ▽ More

    Submitted 16 May, 2024; v1 submitted 28 April, 2022; originally announced April 2022.

  2. Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models

    Authors: Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Andreas Moshovos

    Abstract: Increasingly larger and better Transformer models keep advancing state-of-the-art accuracy and capability for Natural Language Processing applications. These models demand more computational power, storage, and energy. Mokey reduces the footprint of state-of-the-art 32-bit or 16-bit floating-point transformer models by quantizing all values to 4-bit indexes into dictionaries of representative 16-b… ▽ More

    Submitted 23 March, 2022; originally announced March 2022.

    Comments: Accepted at the 49th IEEE/ACM International Symposium on Computer Architecture (ISCA '22)

  3. arXiv:2201.08830  [pdf, other

    cs.AR cs.LG

    APack: Off-Chip, Lossless Data Compression for Efficient Deep Learning Inference

    Authors: Alberto Delmas Lascorz, Mostafa Mahmoud, Andreas Moshovos

    Abstract: Data accesses between on- and off-chip memories account for a large fraction of overall energy consumption during inference with deep learning networks. We present APack, a simple and effective, lossless, off-chip memory compression technique for fixed-point quantized models. APack reduces data widths by exploiting the non-uniform value distribution in deep learning applications. APack can be used… ▽ More

    Submitted 21 January, 2022; originally announced January 2022.

  4. arXiv:2010.08065  [pdf, other

    cs.AR cs.AI

    FPRaker: A Processing Element For Accelerating Neural Network Training

    Authors: Omar Mohamed Awad, Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Ciaran Bannon, Anand Jayarajan, Gennady Pekhimenko, Andreas Moshovos

    Abstract: We present FPRaker, a processing element for composing training accelerators. FPRaker processes several floating-point multiply-accumulation operations concurrently and accumulates their result into a higher precision accumulator. FPRaker boosts performance and energy efficiency during training by taking advantage of the values that naturally appear during training. Specifically, it processes the… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

  5. TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference

    Authors: Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, Andreas Moshovos

    Abstract: TensorDash is a hardware level technique for enabling data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware accelerator for deep learning, TensorDash can speedup the training process while also increasing energy efficiency. TensorDash combines a low-cost, sparse input operand interconnect comprising an 8-input multiplexer per multipli… ▽ More

    Submitted 1 September, 2020; originally announced September 2020.

  6. GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference

    Authors: Ali Hadi Zadeh, Isak Edo, Omar Mohamed Awad, Andreas Moshovos

    Abstract: Attention-based models have demonstrated remarkable success in various natural language understanding tasks. However, efficient execution remains a challenge for these models which are memory-bound due to their massive number of parameters. We present GOBO, a model quantization technique that compresses the vast majority (typically 99.9%) of the 32-bit floating-point parameters of state-of-the-art… ▽ More

    Submitted 26 September, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

    Comments: Accepted at the 53rd IEEE/ACM International Symposium on Microarchitecture - MICRO 2020

  7. arXiv:2002.03090  [pdf, other

    cs.LG stat.ML

    BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization

    Authors: Miloš Nikolić, Ghouthi Boukli Hacene, Ciaran Bannon, Alberto Delmas Lascorz, Matthieu Courbariaux, Yoshua Bengio, Vincent Gripon, Andreas Moshovos

    Abstract: Neural networks have demonstrably achieved state-of-the art accuracy using low-bitlength integer quantization, yielding both execution time and energy benefits on existing hardware designs that support short bitlengths. However, the question of finding the minimum bitlength for a desired accuracy remains open. We introduce a training method for minimizing inference bitlength at any granularity whi… ▽ More

    Submitted 11 August, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

  8. arXiv:1910.06548  [pdf, other

    cs.LG cs.CV stat.ML

    Training CNNs faster with Dynamic Input and Kernel Downsampling

    Authors: Zissis Poulos, Ali Nouri, Andreas Moshovos

    Abstract: We reduce training time in convolutional networks (CNNs) with a method that, for some of the mini-batches: a) scales down the resolution of input images via downsampling, and b) reduces the forward pass operations via pooling on the convolution filters. Training is performed in an interleaved fashion; some batches undergo the regular forward and backpropagation passes with original network paramet… ▽ More

    Submitted 15 October, 2019; originally announced October 2019.

    Comments: 12 pages, 4 figures

  9. arXiv:1805.04513  [pdf, ps, other

    cs.NE cs.AR cs.LG

    Laconic Deep Learning Computing

    Authors: Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos

    Abstract: We motivate a method for transparently identifying ineffectual computations in unmodified Deep Learning models and without affecting accuracy. Specifically, we show that if we decompose multiplications down to the bit level the amount of work performed during inference for image classification models can be consistently reduced by two orders of magnitude. In the best case studied of a sparse varia… ▽ More

    Submitted 10 May, 2018; originally announced May 2018.

  10. arXiv:1804.06732  [pdf, other

    cs.NE

    DPRed: Making Typical Activation and Weight Values Matter In Deep Learning Computing

    Authors: Alberto Delmas, Sayeh Sharify, Patrick Judd, Kevin Siu, Milos Nikolic, Andreas Moshovos

    Abstract: We show that selecting a single data type (precision) for all values in Deep Neural Networks, even if that data type is different per layer, amounts to worst case design. Much shorter data types can be used if we target the common case by adjusting the precision at a much finer granularity. We propose Dynamic Precision Reduction (DPRed), where we group weights and activations and encode them using… ▽ More

    Submitted 17 December, 2018; v1 submitted 16 April, 2018; originally announced April 2018.

  11. arXiv:1803.03688  [pdf, other

    cs.NE

    Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How

    Authors: Alberto Delmas, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Andreas Moshovos

    Abstract: We show that, during inference with Convolutional Neural Networks (CNNs), more than 2x to $8x ineffectual work can be exposed if instead of targeting those weights and activations that are zero, we target different combinations of value stream properties. We demonstrate a practical application with Bit-Tactical (TCL), a hardware accelerator which exploits weight sparsity, per layer precision varia… ▽ More

    Submitted 9 March, 2018; originally announced March 2018.

    Comments: An earlier version of this work titled "JaZ: Enabling Innovation Towards Chaff-Free Deep Learning Computing" was submitted for blind review

  12. arXiv:1707.09068  [pdf, other

    cs.NE

    Tartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Exploiting Numerical Precision Variability

    Authors: Alberto Delmas, Sayeh Sharify, Patrick Judd, Andreas Moshovos

    Abstract: Tartan (TRT), a hardware accelerator for inference with Deep Neural Networks (DNNs), is presented and evaluated on Convolutional Neural Networks. TRT exploits the variable per layer precision requirements of DNNs to deliver execution time that is proportional to the precision p in bits used per layer for convolutional and fully-connected layers. Prior art has demonstrated an accelerator with the s… ▽ More

    Submitted 27 July, 2017; originally announced July 2017.

  13. arXiv:1706.07853  [pdf, ps, other

    cs.DC cs.AR cs.LG

    Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks

    Authors: Sayeh Sharify, Alberto Delmas Lascorz, Kevin Siu, Patrick Judd, Andreas Moshovos

    Abstract: Loom (LM), a hardware inference accelerator for Convolutional Neural Networks (CNNs) is presented. In LM every bit of data precision that can be saved translates to proportional performance gains. Specifically, for convolutional layers LM's execution time scales inversely proportionally with the precisions of both weights and activations. For fully-connected layers LM's performance scales inversel… ▽ More

    Submitted 16 May, 2018; v1 submitted 23 June, 2017; originally announced June 2017.

  14. arXiv:1706.00504  [pdf, other

    cs.NE cs.LG

    Dynamic Stripes: Exploiting the Dynamic Precision Requirements of Activation Values in Neural Networks

    Authors: Alberto Delmas, Patrick Judd, Sayeh Sharify, Andreas Moshovos

    Abstract: Stripes is a Deep Neural Network (DNN) accelerator that uses bit-serial computation to offer performance that is proportional to the fixed-point precision of the activation values. The fixed-point precisions are determined a priori using profiling and are selected at a per layer granularity. This paper presents Dynamic Stripes, an extension to Stripes that detects precision variance at runtime and… ▽ More

    Submitted 1 June, 2017; originally announced June 2017.

    Comments: 3 pages, 3 figures

  15. arXiv:1705.00125  [pdf, other

    cs.LG

    Cnvlutin2: Ineffectual-Activation-and-Weight-Free Deep Neural Network Computing

    Authors: Patrick Judd, Alberto Delmas, Sayeh Sharify, Andreas Moshovos

    Abstract: We discuss several modifications and extensions over the previous proposed Cnvlutin (CNV) accelerator for convolutional and fully-connected layers of Deep Learning Network. We first describe different encodings of the activations that are deemed ineffectual. The encodings have different memory overhead and energy characteristics. We propose using a level of indirection when accessing activations f… ▽ More

    Submitted 28 April, 2017; originally announced May 2017.

    Comments: 6 pages, 5 figures

  16. Memory Controller Design Under Cloud Workloads

    Authors: Mostafa Mahmoud, Andreas Moshovos

    Abstract: This work studies the behavior of state-of-the-art memory controller designs when executing scale-out workloads. It considers memory scheduling techniques, memory page management policies, the number of memory channels, and the address map** scheme used. Experimental measurements demonstrate: 1)~Several recently proposed memory scheduling policies are not a good match for these scale-out workloa… ▽ More

    Submitted 30 November, 2016; originally announced November 2016.

    Journal ref: 2016 IEEE International Symposium on Workload Characterization (IISWC)

  17. arXiv:1610.06920  [pdf, other

    cs.LG cs.AI cs.AR cs.CV

    Bit-pragmatic Deep Neural Network Computing

    Authors: J. Albericio, P. Judd, A. Delmás, S. Sharify, A. Moshovos

    Abstract: We quantify a source of ineffectual computations when processing the multiplications of the convolutional layers in Deep Neural Networks (DNNs) and propose Pragmatic (PRA), an architecture that exploits it improving performance and energy efficiency. The source of these ineffectual computations is best understood in the context of conventional multipliers which generate internally multiple terms,… ▽ More

    Submitted 20 October, 2016; originally announced October 2016.

  18. arXiv:1511.05236  [pdf, other

    cs.LG cs.NE

    Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets

    Authors: Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, Raquel Urtasun, Andreas Moshovos

    Abstract: This work investigates how using reduced precision data in Convolutional Neural Networks (CNNs) affects network accuracy during classification. More specifically, this study considers networks where each layer may use different precision data. Our key result is the observation that the tolerance of CNNs to reduced precision data not only varies across networks, a well established observation, but… ▽ More

    Submitted 8 January, 2016; v1 submitted 16 November, 2015; originally announced November 2015.

    Comments: Submitted to ICLR 2016, 12 pages, 5 figures